L3 HA: 2 masters after reboot of controller

Bug #1597461 reported by Ann Taraday on 2016-06-29
62
This bug affects 9 people
Affects Status Importance Assigned to Milestone
neutron
High
venkata anil

Bug Description

ENV: Mitaka 3 controllers 45 computes DVR + L3 HA (L3 HA as well affected)

After reboot of controller on which l3 agent is active, another l3 agent becomes active. When rebooted node recover, that l3 agent becomes active as well - this lead to extra loss of external connectivity in tenant network. After some time the only one agent remains to be active - the one from rebooted node. Sometimes connectivity does not come back, as snat port ends up on wrong host.

The root cause of this problem is that routers are processed by l3 agent before openvswitch agent sets up appropriate ha ports, so for some time recovered ha routers is isolated from ha routers on other hosts and becomes active.

The possible solution for this is proper serialization of ha network creation by l3 agent after ha network is set up on controller.

With 100 routers and networks this issues has been reproduced with every reboot.

Actually this is L3 HA problem, it is just increased with DVR as the number of ports that openvswith agent should handle is higher.

Ann Taraday (akamyshnikova) wrote :
Ann Taraday (akamyshnikova) wrote :
Rossella Sblendido (rossella-o) wrote :

we were hitting this problem too, we solved it making sure that the l3 agent is started after the l2 agent is running.

summary: - L3 HA + DVR: 2 masters after reboot of controller
+ L3 HA: 2 masters after reboot of controller
description: updated
Ann Taraday (akamyshnikova) wrote :

@rossella-o

Yes, but this is work around, not a solution. I want to raise a discussion in this bug as I think we should come up with some solid idea here.

One of the variants can be putting ha port in build status until it l2 agent will be able to handle it and only after that processed it with l3 agent.

Changed in neutron:
assignee: nobody → Ann Taraday (akamyshnikova)
description: updated
Changed in neutron:
status: New → Confirmed
Brian Haley (brian-haley) wrote :
Download full text (5.3 KiB)

We've been able to reproduce this internally as well, without even rebooting. I'll add the info here since the multiple active routers seems to be issue, not necessarily the reboot.

$ neutron router-create --tenant-id 8beb0d59ef0a448d9e0da918931f3c22 --distributed True --ha True r1
Created a new router:
+-------------------------+--------------------------------------+
| Field | Value |
+-------------------------+--------------------------------------+
| admin_state_up | True |
| availability_zone_hints | |
| availability_zones | |
| description | |
| distributed | True |
| external_gateway_info | |
| ha | True |
| id | cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b |
| name | r1 |
| routes | |
| status | ACTIVE |
| tenant_id | 8beb0d59ef0a448d9e0da918931f3c22 |
+-------------------------+--------------------------------------+

$ neutron router-gateway-set cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b 72e69016-085b-40a5-94aa-bdacafd5a075
Set gateway for router cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b

$ neutron net-create n1
Created a new network:
+-------------------------+--------------------------------------+
| Field | Value |
+-------------------------+--------------------------------------+
| admin_state_up | True |
| availability_zone_hints | |
| availability_zones | |
| created_at | 2016-06-15T12:28:46 |
| description | |
| id | 5b644fea-50b7-49b3-b4b3-88a05509f9a0 |
| ipv4_address_scope | |
| ipv6_address_scope | |
| mtu | 1450 |
| name | n1 |
| router:external | False |
| shared | False |
| status | ACTIVE |
| subnets | |
| tags | |
| tenant_id | 8beb0d59ef0a448d9e0da918931f3c22 |
| updated_at | 2016-06-15T12:28:46 |
+-------------------------+--------------------------------------+

$ neutron subnet-create n1 99.99.99.0/24
Created a new subnet:
+-------------------+------------------------------------------------+
| Field | Value ...

Read more...

John Schwarz (jschwarz) wrote :

Would like to note that https://bugs.launchpad.net/neutron/+bug/1580648 might be the same bug, so this might also block upstream work. I think we should set the priority higher than "Undecided" if it is :)

Changed in neutron:
importance: Undecided → High

Fix proposed to branch: master
Review: https://review.openstack.org/357458

Changed in neutron:
status: Confirmed → In Progress
Changed in neutron:
assignee: Ann Taraday (akamyshnikova) → venkata anil (anil-venkata)
Changed in neutron:
assignee: venkata anil (anil-venkata) → Ann Taraday (akamyshnikova)
Changed in neutron:
assignee: Ann Taraday (akamyshnikova) → venkata anil (anil-venkata)
Changed in neutron:
assignee: venkata anil (anil-venkata) → Ann Taraday (akamyshnikova)
Randeep Jalli (jallirs) wrote :

Is this meant to fix just the split brain or also the shuffle back and fourth between both masters?

Randeep Jalli (jallirs) wrote :

Also for curiousity's sake it would be interesting to know which version of keepalived is being used....

Ann Taraday (akamyshnikova) wrote :

@jallirs

The issues that described here is not classical split brain issue (see https://bugs.launchpad.net/neutron/+bug/1365461), it reproduced with reboot of node or restart of l2 and l3 agent simultaneously.

Used keepalived vesrions v1.2.13, v1.2.19

Don't know if the following is the expected behaviour, nor if it is related to this issue, but when we reboot a backup node in a 2-node setup, it becomes the new master after starting up, producing an unnecessary failover/SNAT downtime.

Using DVR neutron 8.1.2, keepalived v1.2.19

Changed in neutron:
assignee: Ann Taraday (akamyshnikova) → John Schwarz (jschwarz)
Hemachandra Reddy (hr858f) wrote :

@Gustavo Randich, not just with backup node, it is happening even with the master node. When master node is rebooted, it is assuming master state once it is up, causing unnecessary failover once again. All nodes set to BACKUP state and with same priority in keepalived.conf.

Reviewed: https://review.openstack.org/357458
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=25f5912cf8f69f18d111bd60a6cc6ee488755ff3
Submitter: Jenkins
Branch: master

commit 25f5912cf8f69f18d111bd60a6cc6ee488755ff3
Author: AKamyshnikova <email address hidden>
Date: Thu Aug 18 23:18:40 2016 +0300

    Check for ha port to become ACTIVE

    After reboot(restart of l3 and l2 agents) of the node routers
    can be processed by l3 agent before openvswitch agent sets up
    appropriate ha ports. This change add notification for l3 agent
    that ha port becomes ACTIVE and keepalived can be enabled.

    Closes-bug: #1597461

    Co-Authored-By: venkata anil <email address hidden>

    Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Just wanted to clarify the behaviour described in my last comment. It is not related to this issue, but to keepalived's VRRP implementation, which preempts equal-priority BACKUP nodes when a higher IP address node comes online again (see https://github.com/acassen/keepalived/issues/107).

To avoid this unnecessary fail back when l2 and l3 services in any node, I've configured one of my two nodes with higher priority, as suggested here: http://serverfault.com/a/579979 . Nonetheless, if I reboot the higher priority node, it preempts the other node when it comes online.

This is obviously a keepalived limitation, but I wanted to make it clear that with the default generated keepalived.conf we are experiencing VIP flapping (and extra downtime).

meant: "when l2 and l3 services are restarted in any node"

Hemachandra Reddy (hr858f) wrote :

Thank you for those details. They are very useful. We see exactly same issue.

Change abandoned by Dongcan Ye (<email address hidden>) on branch: master
Review: https://review.openstack.org/342730
Reason: Fixed in https://review.openstack.org/#/c/366493/

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/382191
Reason: Liberty is in CVE only mode.

Change abandoned by venkata anil (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/385395
Reason: in favor of https://review.openstack.org/#/c/364407/

Reviewed: https://review.openstack.org/364407
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5860fb21e966ab8f1e011654dd477d7af35f7a27
Submitter: Jenkins
Branch: stable/mitaka

commit 5860fb21e966ab8f1e011654dd477d7af35f7a27
Author: venkata anil <email address hidden>
Date: Wed Oct 12 10:57:46 2016 +0000

    Check for ha port to become ACTIVE

    After reboot(restart of l3 and l2 agents) of the node routers
    can be processed by l3 agent before openvswitch agent sets up
    appropriate ha ports. This change add notification for l3 agent
    that ha port becomes ACTIVE and keepalived can be enabled.

    note: Release notes added to specify l3 agent dependency on neutron
    server.

    Closes-bug: #1597461

    Co-Authored-By: venkata anil <email address hidden>

    (cherry picked from commit 25f5912cf8f69f18d111bd60a6cc6ee488755ff3)

    Conflicts:
            neutron/db/l3_hascheduler_db.py
            neutron/services/l3_router/l3_router_plugin.py
            neutron/tests/unit/plugins/ml2/test_plugin.py
            neutron/tests/functional/agent/l3/test_ha_router.py
            releasenotes/notes/l3ha-agent-server-dependency-1fcb775328ac4502.yaml

    Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a
    (cherry picked from commit 4ad841c4cf1b23695a792ea6facf1dbf66cb48e9)

    split out l3-ha specific test from TestMl2PortsV2

    split out test_update_port_status_notify_port_event_after_update
    from ml2.test_plugin.TestMl2PortsV2 into TestMl2PortsV2WithL3

    The change set of 25f5912cf8f69f18d111bd60a6cc6ee488755ff3
    change id of Iedad1ccae45005efaaa74d5571df04197757d07a
    introduced a test,
    test_update_port_status_notify_port_event_after_update, that is valid
    only when l3 plugin support l3-ha. Such assumption isn't always true
    depending on actual ml2 driver.
    Since test cases in ml2.test_plugin is used as a common base for
    multiple drivers,
    test_update_port_status_notify_port_event_after_update, may or may not
    pass. So split out tests with very specific assumption into a new
    dedicated testcase so that each driver can safely reuse tests in
    tests/unit/plugin/ml2 based on their characteristics.

    Conflicts:
            neutron/tests/unit/plugins/ml2/test_plugin.py

    Closes-Bug: #1618601
    Change-Id: Ie81dde976649111d029a7d107c99960aded64915
    (cherry picked from commit 03c412ff011a8d4e86afbada24db675028861728)

    Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a
    (cherry picked from commit 4ad841c4cf1b23695a792ea6facf1dbf66cb48e9)

tags: added: in-stable-mitaka

This issue was fixed in the openstack/neutron 8.4.0 release.

venkata anil (anil-venkata) wrote :

https://review.openstack.org/#/c/357458/ can't completely resolve the issue.

I have a two node setup. node2 is hosting some ha master routers. I am trying to reboot node2. Before the reboot, 'status' for all ha network ports on this node2 is 'ACTIVE'(same status is stored in DB). I have rebooted the node2.
1) before the node2 is up, keepalived on host1 turned some routers to master
2) when node2 is up, he will try to run l2 and l3 agents.
3) then, l3 agent through fetch_and_sync_all_routers gets all ha router ports it was hosting, but with status 'ACTIVE' as this status was stored in DB before shutdown. Now l3 agent will spawn keepalived as ha network port status is active, though l2 agent has not wired the port(some times).
As wiring is not yet done on node2, keepalived on node2 for the ha router will transition it to master.

Changed in neutron:
status: Fix Released → Confirmed
Changed in neutron:
assignee: John Schwarz (jschwarz) → venkata anil (anil-venkata)

Fix proposed to branch: master
Review: https://review.openstack.org/470905

Changed in neutron:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/470905
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d730b1010277138136512eb6efb12ab893ca6793
Submitter: Jenkins
Branch: master

commit d730b1010277138136512eb6efb12ab893ca6793
Author: venkata anil <email address hidden>
Date: Mon Jun 5 09:56:18 2017 +0000

    Set HA network port to DOWN when l3 agent starts

    When l3 agent node is rebooted, if HA network port status is already
    ACTIVE in DB, agent will get this status from server and then spawn
    the keepalived (though l2 agent might not have wired the port),
    resulting in multiple HA masters active at the same time.

    To fix this, when the L3 agent starts up we can have it explicitly
    set the port status to DOWN for all of the HA ports on that node.
    Then we are guaranteed that when they go to ACTIVE it will be because
    the L2 agent has wired the ports.

    Closes-bug: #1597461
    Change-Id: Ib0c8a71b6ff97e43a414f3db4882914b12170d53

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/473819
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=633b452e28b7a95ced1917257ca0e200cbffa4ba
Submitter: Jenkins
Branch: stable/ocata

commit 633b452e28b7a95ced1917257ca0e200cbffa4ba
Author: venkata anil <email address hidden>
Date: Mon Jun 5 09:56:18 2017 +0000

    Set HA network port to DOWN when l3 agent starts

    When l3 agent node is rebooted, if HA network port status is already
    ACTIVE in DB, agent will get this status from server and then spawn
    the keepalived (though l2 agent might not have wired the port),
    resulting in multiple HA masters active at the same time.

    To fix this, when the L3 agent starts up we can have it explicitly
    set the port status to DOWN for all of the HA ports on that node.
    Then we are guaranteed that when they go to ACTIVE it will be because
    the L2 agent has wired the ports.

    Closes-bug: #1597461
    Change-Id: Ib0c8a71b6ff97e43a414f3db4882914b12170d53
    (cherry picked from commit d730b1010277138136512eb6efb12ab893ca6793)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/473820
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90c24e9d263eaef66369210f06c158b1425996aa
Submitter: Jenkins
Branch: stable/newton

commit 90c24e9d263eaef66369210f06c158b1425996aa
Author: venkata anil <email address hidden>
Date: Mon Jun 5 09:56:18 2017 +0000

    Set HA network port to DOWN when l3 agent starts

    When l3 agent node is rebooted, if HA network port status is already
    ACTIVE in DB, agent will get this status from server and then spawn
    the keepalived (though l2 agent might not have wired the port),
    resulting in multiple HA masters active at the same time.

    To fix this, when the L3 agent starts up we can have it explicitly
    set the port status to DOWN for all of the HA ports on that node.
    Then we are guaranteed that when they go to ACTIVE it will be because
    the L2 agent has wired the ports.

    Closes-bug: #1597461
    Change-Id: Ib0c8a71b6ff97e43a414f3db4882914b12170d53
    (cherry picked from commit d730b1010277138136512eb6efb12ab893ca6793)

tags: added: in-stable-newton

Change abandoned by Joshua Hesketh (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/473821
Reason: This branch (stable/mitaka) is at End Of Life

Reviewed: https://review.openstack.org/471575
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9647d68fdbaabf1a340eab0116cb4889e9ae7962
Submitter: Jenkins
Branch: master

commit 9647d68fdbaabf1a340eab0116cb4889e9ae7962
Author: venkata anil <email address hidden>
Date: Wed Jul 5 16:35:37 2017 +0300

    New RPC to set HA network port status to DOWN

    In commit 500b255278ab41974fe6febd9a3ed13de5ddf3f6 we are using
    "get_router_ids" RPC to update HA network port status. But that
    was needed to backport that commit to other branches.
    As "get_router_ids" RPC is expected to fetch only router ids and
    not to have any other processing, we are adding new RPC
    "update_ha_network_port_status". L3 agent will call this new RPC
    to set HA network port status to DOWN.

    Related-bug: #1597461
    Change-Id: I8f34c4f5178d2b422cfcfd082dfc9cf3f89a5d95

This issue was fixed in the openstack/neutron 11.0.0.0b3 development milestone.

This issue was fixed in the openstack/neutron 9.4.1 release.

This issue was fixed in the openstack/neutron 10.0.3 release.

Reviewed: https://review.openstack.org/522641
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9ed693228f90251c0f03fb842ef19628b439f9bc
Submitter: Zuul
Branch: master

commit 9ed693228f90251c0f03fb842ef19628b439f9bc
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595

Reviewed: https://review.openstack.org/522784
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f6560d14b6125906048b74c65f1f974b31206df3
Submitter: Zuul
Branch: stable/pike

commit f6560d14b6125906048b74c65f1f974b31206df3
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ed693228f90251c0f03fb842ef19628b439f9bc)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/522792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=385ac553e33f12c34e8a23459337b2f0af0b75eb
Submitter: Zuul
Branch: stable/ocata

commit 385ac553e33f12c34e8a23459337b2f0af0b75eb
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
    Conflicts:
     neutron/agent/l3/agent.py
            neutron/api/rpc/handlers/l3_rpc.py

    Note: This RPC update_all_ha_network_port_statuses is added in only pike
    and later branches. In older branches, we were using get_router_ids RPC
    to invoke _update_ha_network_port_status. As we need to invoke this
    functionality during l3 agent start and get_service_plugin_list() is the
    only available RPC which is called during l3 agent start, we call
    _update_ha_network_port_status from get_service_plugin_list.

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ab1ad1433d54fec3e5b04f1edf8ca436e1f7af1)
    (cherry picked from commit a6d985bbca57b5027eecaa43071964b14d9075d9)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.