HA router state change takes too much time to notify neutron server

Bug #1612069 reported by LIU Yulong
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
LIU Yulong

Bug Description

The ha state change BatchNotifier uses the following calculated interval.

    def _calculate_batch_duration(self):
        # Slave becomes the master after not hearing from it 3 times
        detection_time = self.conf.ha_vrrp_advert_int * 3

        # Keepalived takes a couple of seconds to configure the VIPs
        configuration_time = 2

        # Give it enough slack to batch all events due to the same failure
        return (detection_time + configuration_time) * 2

It takes almost 16s, by default ha_vrrp_advert_int is 2s, for a single HA router state change to notify neutron server.
Actually before this notify, the ip MonitorDaemon has already set the router to its relevant state.
So no need to wait this long time.

Tags: l3-ha
LIU Yulong (dragon889)
description: updated
summary: - HA router state change take too much time to notify neutron server
+ HA router state change takes too much time to notify neutron server
tags: added: l3-ha
Revision history for this message
LIU Yulong (dragon889) wrote :

In this 16s time, assuming that a HA router meets 8 times HA router state change.
After this 16s, the first change dequeue and be notified to neutron server, then the 2nd, 3rd, and so on.
This now become interesting, in this 16 seconds if you use neutron
`neutron l3-agent-list-hosting-router ha_router_id`
you may see the router state in one agent is alternatively changing in active and standby.
It's not stay in the real state, because of the delay notification.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/364803

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: New → In Progress
John Schwarz (jschwarz)
Changed in neutron:
importance: Undecided → Low
Revision history for this message
LIU Yulong (dragon889) wrote :

After several rally scenario testing, we found that 2 seconds for state change notification interval is not so much aggressive.
On the contrary, I want to say, 16s may let the l3 agent squeeze more notification, and 16s may let the MQ a high pressure.

ENV:
0)stable/mitaka with patch: https://review.openstack.org/#/c/364803/
1)2 controller nodes: httpd, keystone, glance-api/registry, nova-api, nova-conductor, neutron-server, rabbitMQ
2)2 DB nodes: mariadb-10.1.12-4.el7
3)2 network nodes: L3 agent, ovs-agent, openvswitch
4)40 compute nodes: ovs-agent, openvswitch, nova-compute
5)All nodes have corresponding 10G NICs for both data and external networking.

Here is quick look of one our tests:
Rally input: http://paste.openstack.org/show/575963/
Rally output: http://paste.openstack.org/show/575964/
The attachment is the Rally output in HTML format with full test data and results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/364803
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e795a3fcf882ad8130018f32b57f2f887a1d20da
Submitter: Jenkins
Branch: master

commit e795a3fcf882ad8130018f32b57f2f887a1d20da
Author: LIU Yulong <email address hidden>
Date: Thu Aug 11 16:58:48 2016 +0800

    Make the HA router state change notification more faster

    HA router state change takes too much time to notify neutron server.
    It takes almost 16s, by default ha_vrrp_advert_int 2s, for a single
    HA router state change.

    In this 16s time, assuming that a HA router meets 8 times HA router
    state change. After this 16s, the first change dequeue and notify the
    neutron server, then the 2nd, 3rd, and so on. Things are now becoming
    interesting, after this 16 seconds if you run
    `neutron l3-agent-list-hosting-router ha_router_id`, you may see the
    router state in one specific agent is alternatively changing in active
    and standby. It's not stay in the real state, because of the delay
    notification.

    This patch sets the BatchNotifier interval to ha_vrrp_advert_int (default
    2s) to make the HA router state change notification more faster.

    NOTE: the BatchNotifier event queue is needed, because the HA router state
    change needs to be sent in a proper order. Then the neutron server could set
    the HA state properly.

    Closes-Bug: #1612069
    Change-Id: Ife687038d31bd1e1ee264ff8b6ae1264fdd05489

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b3

This issue was fixed in the openstack/neutron 11.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.