Bug #1597461 “L3 HA: 2 masters after reboot of controller” : Bugs : neutron

Revision history for this message

Ann Taraday (akamyshnikova) wrote on 2016-06-29:

#1

openvswitch agent logs Edit (6.0 MiB, application/octet-stream)

Revision history for this message

Ann Taraday (akamyshnikova) wrote on 2016-06-29:

#2

l3-agent logs Edit (1.1 MiB, application/octet-stream)

Revision history for this message

Rossella Sblendido (rossella-o) wrote on 2016-06-30:

#3

we were hitting this problem too, we solved it making sure that the l3 agent is started after the l2 agent is running.

Ann Taraday (akamyshnikova) on 2016-07-04

summary:	- L3 HA + DVR: 2 masters after reboot of controller + L3 HA: 2 masters after reboot of controller
description:	updated

Revision history for this message

Ann Taraday (akamyshnikova) wrote on 2016-07-04:

#4

@rossella-o

Yes, but this is work around, not a solution. I want to raise a discussion in this bug as I think we should come up with some solid idea here.

One of the variants can be putting ha port in build status until it l2 agent will be able to handle it and only after that processed it with l3 agent.

Changed in neutron:
assignee:	nobody → Ann Taraday (akamyshnikova)

Ann Taraday (akamyshnikova) on 2016-07-04

description:

updated

Rossella Sblendido (rossella-o) on 2016-07-04

Changed in neutron:
status:	New → Confirmed

Revision history for this message

Brian Haley (brian-haley) wrote on 2016-07-11:

#5

Download full text (5.3 KiB)

We've been able to reproduce this internally as well, without even rebooting. I'll add the info here since the multiple active routers seems to be issue, not necessarily the reboot.

$ neutron router-gateway-set cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b 72e69016-085b-40a5-94aa-bdacafd5a075
Set gateway for router cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b

$ neutron subnet-create n1 99.99.99.0/24
Created a new subnet:
+-------------------+------------------------------------------------+
| Field | Value ...

We've been able to reproduce this internally as well, without even rebooting.  I'll add the info here since the multiple active routers seems to be issue, not necessarily the reboot.

$ neutron router-gateway-set cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b 72e69016-085b-40a5-94aa-bdacafd5a075
Set gateway for router cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b

$ neutron subnet-create n1 99.99.99.0/24
Created a new subnet:
+-------------------+------------------------------------------------+
| Field             | Value                                          |
+-------------------+------------------------------------------------+
| allocation_pools  | {"start": "99.99.99.2", "end": "99.99.99.254"} |
| cidr              | 99.99.99.0/24                                  |
| created_at        | 2016-06-15T12:29:01                            |
| description       |                                                |
| dns_nameservers   |                                                |
| enable_dhcp       | True                                           |
| gateway_ip        | 99.99.99.1                                     |
| host_routes       |                                                |
| id                | b7f2e674-887d-4282-b671-6e8edb5f9524           |
| ip_version        | 4                                              |
| ipv6_address_mode |                                                |
| ipv6_ra_mode      |                                                |
| name              |                                                |
| network_id        | 5b644fea-50b7-49b3-b4b3-88a05509f9a0           |
| subnetpool_id     |                                                |
| tenant_id         | 8beb0d59ef0a448d9e0da918931f3c22               |
| updated_at        | 2016-06-15T12:29:01                            |
+-------------------+------------------------------------------------+

$ neutron router-interface-add cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b b7f2e674-887d-4282-b671-6e8edb5f9524
Added interface 1b410fb9-6e87-49d3-a79d-f8226dd83325 to router cb484e8c-4de6-4a50-89d6-e8c53e6f6d4b.

Revision history for this message

John Schwarz (jschwarz) wrote on 2016-07-17:

#6

Would like to note that https://bugs.launchpad.net/neutron/+bug/1580648 might be the same bug, so this might also block upstream work. I think we should set the priority higher than "Undecided" if it is :)

Ann Taraday (akamyshnikova) on 2016-07-22

Changed in neutron:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-18: Fix proposed to neutron (master)

#7

Fix proposed to branch: master
Review: https://review.openstack.org/357458

Changed in neutron:
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2016-08-23

Changed in neutron:
assignee:	Ann Taraday (akamyshnikova) → venkata anil (anil-venkata)

Ann Taraday (akamyshnikova) on 2016-08-23

Changed in neutron:
assignee:	venkata anil (anil-venkata) → Ann Taraday (akamyshnikova)

OpenStack Infra (hudson-openstack) on 2016-08-24

Changed in neutron:
assignee:	Ann Taraday (akamyshnikova) → venkata anil (anil-venkata)

OpenStack Infra (hudson-openstack) on 2016-08-26

Changed in neutron:
assignee:	venkata anil (anil-venkata) → Ann Taraday (akamyshnikova)

Revision history for this message

Randeep Jalli (jallirs) wrote on 2016-08-26:

#8

Is this meant to fix just the split brain or also the shuffle back and fourth between both masters?

Revision history for this message

Randeep Jalli (jallirs) wrote on 2016-08-26:

#9

Also for curiousity's sake it would be interesting to know which version of keepalived is being used....

Revision history for this message

Ann Taraday (akamyshnikova) wrote on 2016-08-29:

#10

@jallirs

The issues that described here is not classical split brain issue (see https://bugs.launchpad.net/neutron/+bug/1365461), it reproduced with reboot of node or restart of l2 and l3 agent simultaneously.

Used keepalived vesrions v1.2.13, v1.2.19

Revision history for this message

Gustavo Randich (gustavo-randich) wrote on 2016-08-29:

#11

Don't know if the following is the expected behaviour, nor if it is related to this issue, but when we reboot a backup node in a 2-node setup, it becomes the new master after starting up, producing an unnecessary failover/SNAT downtime.

Using DVR neutron 8.1.2, keepalived v1.2.19

OpenStack Infra (hudson-openstack) on 2016-08-29

Changed in neutron:
assignee:	Ann Taraday (akamyshnikova) → John Schwarz (jschwarz)

Revision history for this message

Hemachandra Reddy (hr858f) wrote on 2016-08-29:

#12

@Gustavo Randich, not just with backup node, it is happening even with the master node. When master node is rebooted, it is assuming master state once it is up, causing unnecessary failover once again. All nodes set to BACKUP state and with same priority in keepalived.conf.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-29: Fix merged to neutron (master)

#13

Reviewed: https://review.openstack.org/357458
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=25f5912cf8f69f18d111bd60a6cc6ee488755ff3
Submitter: Jenkins
Branch: master

commit 25f5912cf8f69f18d111bd60a6cc6ee488755ff3
Author: AKamyshnikova <email address hidden>
Date: Thu Aug 18 23:18:40 2016 +0300

Check for ha port to become ACTIVE

    After reboot(restart of l3 and l2 agents) of the node routers
    can be processed by l3 agent before openvswitch agent sets up
    appropriate ha ports. This change add notification for l3 agent
    that ha port becomes ACTIVE and keepalived can be enabled.

Closes-bug: #1597461

Co-Authored-By: venkata anil <email address hidden>

Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-01: Fix proposed to neutron (stable/mitaka)

#14

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/364407

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-01: Fix included in openstack/neutron 9.0.0.0b3

#15

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Revision history for this message

Gustavo Randich (gustavo-randich) wrote on 2016-09-02:

#16

Just wanted to clarify the behaviour described in my last comment. It is not related to this issue, but to keepalived's VRRP implementation, which preempts equal-priority BACKUP nodes when a higher IP address node comes online again (see https://github.com/acassen/keepalived/issues/107).

To avoid this unnecessary fail back when l2 and l3 services in any node, I've configured one of my two nodes with higher priority, as suggested here: http://serverfault.com/a/579979 . Nonetheless, if I reboot the higher priority node, it preempts the other node when it comes online.

This is obviously a keepalived limitation, but I wanted to make it clear that with the default generated keepalived.conf we are experiencing VIP flapping (and extra downtime).

Revision history for this message

Gustavo Randich (gustavo-randich) wrote on 2016-09-02:

#17

meant: "when l2 and l3 services are restarted in any node"

Revision history for this message

Hemachandra Reddy (hr858f) wrote on 2016-09-02:

#18

Thank you for those details. They are very useful. We see exactly same issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Change abandoned on neutron (master)

#19

Change abandoned by Dongcan Ye (<email address hidden>) on branch: master
Review: https://review.openstack.org/342730
Reason: Fixed in https://review.openstack.org/#/c/366493/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-05: Fix proposed to neutron (stable/liberty)

#20

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/382191

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-12: Change abandoned on neutron (stable/liberty)

#21

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/382191
Reason: Liberty is in CVE only mode.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-12: Fix proposed to neutron (stable/mitaka)

#22

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/385395

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-12: Change abandoned on neutron (stable/mitaka)

#23

Change abandoned by venkata anil (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/385395
Reason: in favor of https://review.openstack.org/#/c/364407/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-13: Fix merged to neutron (stable/mitaka)

#24

Reviewed: https://review.openstack.org/364407
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5860fb21e966ab8f1e011654dd477d7af35f7a27
Submitter: Jenkins
Branch: stable/mitaka

commit 5860fb21e966ab8f1e011654dd477d7af35f7a27
Author: venkata anil <email address hidden>
Date: Wed Oct 12 10:57:46 2016 +0000

Check for ha port to become ACTIVE

    After reboot(restart of l3 and l2 agents) of the node routers
    can be processed by l3 agent before openvswitch agent sets up
    appropriate ha ports. This change add notification for l3 agent
    that ha port becomes ACTIVE and keepalived can be enabled.

note: Release notes added to specify l3 agent dependency on neutron
server.

Closes-bug: #1597461

Co-Authored-By: venkata anil <email address hidden>

(cherry picked from commit 25f5912cf8f69f18d111bd60a6cc6ee488755ff3)

    Conflicts:
            neutron/db/l3_hascheduler_db.py
            neutron/services/l3_router/l3_router_plugin.py
            neutron/tests/unit/plugins/ml2/test_plugin.py
            neutron/tests/functional/agent/l3/test_ha_router.py
            releasenotes/notes/l3ha-agent-server-dependency-1fcb775328ac4502.yaml

Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a
(cherry picked from commit 4ad841c4cf1b23695a792ea6facf1dbf66cb48e9)

split out l3-ha specific test from TestMl2PortsV2

split out test_update_port_status_notify_port_event_after_update
from ml2.test_plugin.TestMl2PortsV2 into TestMl2PortsV2WithL3

    The change set of 25f5912cf8f69f18d111bd60a6cc6ee488755ff3
    change id of Iedad1ccae45005efaaa74d5571df04197757d07a
    introduced a test,
    test_update_port_status_notify_port_event_after_update, that is valid
    only when l3 plugin support l3-ha. Such assumption isn't always true
    depending on actual ml2 driver.
    Since test cases in ml2.test_plugin is used as a common base for
    multiple drivers,
    test_update_port_status_notify_port_event_after_update, may or may not
    pass. So split out tests with very specific assumption into a new
    dedicated testcase so that each driver can safely reuse tests in
    tests/unit/plugin/ml2 based on their characteristics.

Conflicts:
neutron/tests/unit/plugins/ml2/test_plugin.py

    Closes-Bug: #1618601
    Change-Id: Ie81dde976649111d029a7d107c99960aded64915
    (cherry picked from commit 03c412ff011a8d4e86afbada24db675028861728)

Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a
(cherry picked from commit 4ad841c4cf1b23695a792ea6facf1dbf66cb48e9)

Reviewed:  https://review.openstack.org/364407
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5860fb21e966ab8f1e011654dd477d7af35f7a27
Submitter: Jenkins
Branch:    stable/mitaka

commit 5860fb21e966ab8f1e011654dd477d7af35f7a27
Author: venkata anil <anilvenkata@redhat.com>
Date:   Wed Oct 12 10:57:46 2016 +0000

Check for ha port to become ACTIVE
    
    After reboot(restart of l3 and l2 agents) of the node routers
    can be processed by l3 agent before openvswitch agent sets up
    appropriate ha ports. This change add notification for l3 agent
    that ha port becomes ACTIVE and keepalived can be enabled.
    
    note: Release notes added to specify l3 agent dependency on neutron
    server.
    
    Closes-bug: #1597461
    
    Co-Authored-By: venkata anil <anilvenkata@redhat.com>
    
    (cherry picked from commit 25f5912cf8f69f18d111bd60a6cc6ee488755ff3)
    
    Conflicts:
            neutron/db/l3_hascheduler_db.py
            neutron/services/l3_router/l3_router_plugin.py
            neutron/tests/unit/plugins/ml2/test_plugin.py
            neutron/tests/functional/agent/l3/test_ha_router.py
            releasenotes/notes/l3ha-agent-server-dependency-1fcb775328ac4502.yaml
    
    Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a
    (cherry picked from commit 4ad841c4cf1b23695a792ea6facf1dbf66cb48e9)
    
    split out l3-ha specific test from TestMl2PortsV2
    
    split out test_update_port_status_notify_port_event_after_update
    from ml2.test_plugin.TestMl2PortsV2 into TestMl2PortsV2WithL3
    
    The change set of 25f5912cf8f69f18d111bd60a6cc6ee488755ff3
    change id of Iedad1ccae45005efaaa74d5571df04197757d07a
    introduced a test,
    test_update_port_status_notify_port_event_after_update, that is valid
    only when l3 plugin support l3-ha. Such assumption isn't always true
    depending on actual ml2 driver.
    Since test cases in ml2.test_plugin is used as a common base for
    multiple drivers,
    test_update_port_status_notify_port_event_after_update, may or may not
    pass. So split out tests with very specific assumption into a new
    dedicated testcase so that each driver can safely reuse tests in
    tests/unit/plugin/ml2 based on their characteristics.
    
    Conflicts:
            neutron/tests/unit/plugins/ml2/test_plugin.py
    
    Closes-Bug: #1618601
    Change-Id: Ie81dde976649111d029a7d107c99960aded64915
    (cherry picked from commit 03c412ff011a8d4e86afbada24db675028861728)
    
    Change-Id: Iedad1ccae45005efaaa74d5571df04197757d07a
    (cherry picked from commit 4ad841c4cf1b23695a792ea6facf1dbf66cb48e9)

tags:

added: in-stable-mitaka

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-01: Fix included in openstack/neutron 8.4.0

#25

This issue was fixed in the openstack/neutron 8.4.0 release.

Revision history for this message

venkata anil (anil-venkata) wrote on 2017-05-26:

#26

https://review.openstack.org/#/c/357458/ can't completely resolve the issue.

I have a two node setup. node2 is hosting some ha master routers. I am trying to reboot node2. Before the reboot, 'status' for all ha network ports on this node2 is 'ACTIVE'(same status is stored in DB). I have rebooted the node2.
1) before the node2 is up, keepalived on host1 turned some routers to master
2) when node2 is up, he will try to run l2 and l3 agents.
3) then, l3 agent through fetch_and_sync_all_routers gets all ha router ports it was hosting, but with status 'ACTIVE' as this status was stored in DB before shutdown. Now l3 agent will spawn keepalived as ha network port status is active, though l2 agent has not wired the port(some times).
As wiring is not yet done on node2, keepalived on node2 for the ha router will transition it to master.

Miguel Angel Ajo (mangelajo) on 2017-05-26

Changed in neutron:
status:	Fix Released → Confirmed

venkata anil (anil-venkata) on 2017-05-26

Changed in neutron:
assignee:	John Schwarz (jschwarz) → venkata anil (anil-venkata)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-05: Fix proposed to neutron (master)

#27

Fix proposed to branch: master
Review: https://review.openstack.org/470905

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-07: Related fix proposed to neutron (master)

#28

Related fix proposed to branch: master
Review: https://review.openstack.org/471575

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-13: Fix merged to neutron (master)

#29

Reviewed: https://review.openstack.org/470905
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d730b1010277138136512eb6efb12ab893ca6793
Submitter: Jenkins
Branch: master

commit d730b1010277138136512eb6efb12ab893ca6793
Author: venkata anil <email address hidden>
Date: Mon Jun 5 09:56:18 2017 +0000

Set HA network port to DOWN when l3 agent starts

    When l3 agent node is rebooted, if HA network port status is already
    ACTIVE in DB, agent will get this status from server and then spawn
    the keepalived (though l2 agent might not have wired the port),
    resulting in multiple HA masters active at the same time.

    To fix this, when the L3 agent starts up we can have it explicitly
    set the port status to DOWN for all of the HA ports on that node.
    Then we are guaranteed that when they go to ACTIVE it will be because
    the L2 agent has wired the ports.

Closes-bug: #1597461
Change-Id: Ib0c8a71b6ff97e43a414f3db4882914b12170d53

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-13: Fix proposed to neutron (stable/ocata)

#30

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/473819

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-13: Fix proposed to neutron (stable/newton)

#31

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/473820

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-13: Fix proposed to neutron (stable/mitaka)

#32

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/473821

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-13: Fix merged to neutron (stable/ocata)

#33

Reviewed: https://review.openstack.org/473819
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=633b452e28b7a95ced1917257ca0e200cbffa4ba
Submitter: Jenkins
Branch: stable/ocata

commit 633b452e28b7a95ced1917257ca0e200cbffa4ba
Author: venkata anil <email address hidden>
Date: Mon Jun 5 09:56:18 2017 +0000

Set HA network port to DOWN when l3 agent starts

    When l3 agent node is rebooted, if HA network port status is already
    ACTIVE in DB, agent will get this status from server and then spawn
    the keepalived (though l2 agent might not have wired the port),
    resulting in multiple HA masters active at the same time.

    To fix this, when the L3 agent starts up we can have it explicitly
    set the port status to DOWN for all of the HA ports on that node.
    Then we are guaranteed that when they go to ACTIVE it will be because
    the L2 agent has wired the ports.

    Closes-bug: #1597461
    Change-Id: Ib0c8a71b6ff97e43a414f3db4882914b12170d53
    (cherry picked from commit d730b1010277138136512eb6efb12ab893ca6793)

tags:

added: in-stable-ocata

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-14: Fix merged to neutron (stable/newton)

#34

Reviewed: https://review.openstack.org/473820
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90c24e9d263eaef66369210f06c158b1425996aa
Submitter: Jenkins
Branch: stable/newton

commit 90c24e9d263eaef66369210f06c158b1425996aa
Author: venkata anil <email address hidden>
Date: Mon Jun 5 09:56:18 2017 +0000

Set HA network port to DOWN when l3 agent starts

    When l3 agent node is rebooted, if HA network port status is already
    ACTIVE in DB, agent will get this status from server and then spawn
    the keepalived (though l2 agent might not have wired the port),
    resulting in multiple HA masters active at the same time.

    To fix this, when the L3 agent starts up we can have it explicitly
    set the port status to DOWN for all of the HA ports on that node.
    Then we are guaranteed that when they go to ACTIVE it will be because
    the L2 agent has wired the ports.

    Closes-bug: #1597461
    Change-Id: Ib0c8a71b6ff97e43a414f3db4882914b12170d53
    (cherry picked from commit d730b1010277138136512eb6efb12ab893ca6793)

tags:

added: in-stable-newton

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-07-05: Change abandoned on neutron (stable/mitaka)

#35

Change abandoned by Joshua Hesketh (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/473821
Reason: This branch (stable/mitaka) is at End Of Life

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-07-21: Related fix merged to neutron (master)

#36

Reviewed: https://review.openstack.org/471575
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9647d68fdbaabf1a340eab0116cb4889e9ae7962
Submitter: Jenkins
Branch: master

commit 9647d68fdbaabf1a340eab0116cb4889e9ae7962
Author: venkata anil <email address hidden>
Date: Wed Jul 5 16:35:37 2017 +0300

New RPC to set HA network port status to DOWN

    In commit 500b255278ab41974fe6febd9a3ed13de5ddf3f6 we are using
    "get_router_ids" RPC to update HA network port status. But that
    was needed to backport that commit to other branches.
    As "get_router_ids" RPC is expected to fetch only router ids and
    not to have any other processing, we are adding new RPC
    "update_ha_network_port_status". L3 agent will call this new RPC
    to set HA network port status to DOWN.

Related-bug: #1597461
Change-Id: I8f34c4f5178d2b422cfcfd082dfc9cf3f89a5d95

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-07-28: Fix included in openstack/neutron 11.0.0.0b3

#37

This issue was fixed in the openstack/neutron 11.0.0.0b3 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-28: Fix included in openstack/neutron 9.4.1

#38

This issue was fixed in the openstack/neutron 9.4.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-28: Fix included in openstack/neutron 10.0.3

#39

This issue was fixed in the openstack/neutron 10.0.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-23: Related fix proposed to neutron (master)

#40

Related fix proposed to branch: master
Review: https://review.openstack.org/522641

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-24: Related fix proposed to neutron (stable/pike)

#41

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/522784

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-24: Related fix proposed to neutron (stable/ocata)

#42

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/522792

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-01: Related fix merged to neutron (master)

#43

Reviewed: https://review.openstack.org/522641
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9ed693228f90251c0f03fb842ef19628b439f9bc
Submitter: Zuul
Branch: master

commit 9ed693228f90251c0f03fb842ef19628b439f9bc
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

As l3 agent is repeatedly processing same routers, SIGHUPs are
frequently sent to keepalived, resulting in multiple masters.

To fix this, we call update_all_ha_network_port_statuses in l3 agent
start instead of calling from fetch_and_sync_all_routers.

[1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-01: Related fix merged to neutron (stable/pike)

#44

Reviewed: https://review.openstack.org/522784
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f6560d14b6125906048b74c65f1f974b31206df3
Submitter: Zuul
Branch: stable/pike

commit f6560d14b6125906048b74c65f1f974b31206df3
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

As l3 agent is repeatedly processing same routers, SIGHUPs are
frequently sent to keepalived, resulting in multiple masters.

To fix this, we call update_all_ha_network_port_statuses in l3 agent
start instead of calling from fetch_and_sync_all_routers.

[1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ed693228f90251c0f03fb842ef19628b439f9bc)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-07: Related fix merged to neutron (stable/ocata)

#45

Reviewed: https://review.openstack.org/522792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=385ac553e33f12c34e8a23459337b2f0af0b75eb
Submitter: Zuul
Branch: stable/ocata

commit 385ac553e33f12c34e8a23459337b2f0af0b75eb
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

As l3 agent is repeatedly processing same routers, SIGHUPs are
frequently sent to keepalived, resulting in multiple masters.

To fix this, we call update_all_ha_network_port_statuses in l3 agent
start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
    Conflicts:
     neutron/agent/l3/agent.py
            neutron/api/rpc/handlers/l3_rpc.py

    Note: This RPC update_all_ha_network_port_statuses is added in only pike
    and later branches. In older branches, we were using get_router_ids RPC
    to invoke _update_ha_network_port_status. As we need to invoke this
    functionality during l3 agent start and get_service_plugin_list() is the
    only available RPC which is called during l3 agent start, we call
    _update_ha_network_port_status from get_service_plugin_list.

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ab1ad1433d54fec3e5b04f1edf8ca436e1f7af1)
    (cherry picked from commit a6d985bbca57b5027eecaa43071964b14d9075d9)

Reviewed:  https://review.openstack.org/522792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=385ac553e33f12c34e8a23459337b2f0af0b75eb
Submitter: Zuul
Branch:    stable/ocata

commit 385ac553e33f12c34e8a23459337b2f0af0b75eb
Author: venkata anil <anilvenkata@redhat.com>
Date:   Thu Nov 23 18:40:30 2017 +0000

Call update_all_ha_network_port_statuses on agent start
    
    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.
    
    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.
    
    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.
    
    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
    Conflicts:
    	neutron/agent/l3/agent.py
            neutron/api/rpc/handlers/l3_rpc.py
    
    Note: This RPC update_all_ha_network_port_statuses is added in only pike
    and later branches. In older branches, we were using get_router_ids RPC
    to invoke _update_ha_network_port_status. As we need to invoke this
    functionality during l3 agent start and get_service_plugin_list() is the
    only available RPC which is called during l3 agent start, we call
    _update_ha_network_port_status from get_service_plugin_list.
    
    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ab1ad1433d54fec3e5b04f1edf8ca436e1f7af1)
    (cherry picked from commit a6d985bbca57b5027eecaa43071964b14d9075d9)

neutron

L3 HA: 2 masters after reboot of controller

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches