L3 HA: multiple agents are active at the same time

Bug #1731595 reported by Xav Paice on 2017-11-11
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Status tracked in Queens
Mitaka
High
Unassigned
Newton
High
Unassigned
Ocata
High
Corey Bryant
Pike
High
Corey Bryant
Queens
High
Corey Bryant
neutron
High
venkata anil
neutron (Ubuntu)
Status tracked in Bionic
Xenial
High
Unassigned
Zesty
High
Corey Bryant
Artful
High
Corey Bryant
Bionic
High
Corey Bryant

Bug Description

OS: Xenial, Ocata from Ubuntu Cloud Archive
We have three neutron-gateway hosts, with L3 HA enabled and a min of 2, max of 3. There are approx. 400 routers defined.

At some point (we weren't monitoring exactly) a number of the routers changed from being one active, and 1+ others standby, to >1 active. This included each of the 'active' namespaces having the same IP addresses allocated, and therefore traffic problems reaching instances.

Removing the routers from all but one agent, and re-adding, resolved the issue. Restarting one l3 agent also appeared to resolve the issue, but very slowly, to the point where we needed the system alive again faster and reverted to removing/re-adding.

At the same time, a number of routers were listed without any agents active at all. This situation appears to have been resolved by adding routers to agents, after several minutes downtime.

I'm finding it very difficult to find relevant keepalived messages to indicate what's going on, but what I do notice is that all the agents have equal priority and are configured as 'backup'.

I am trying to figure out a way to get a reproducer of this, it might be that we need to have a large number of routers configured on a small number of gateways.

Xav Paice (xavpaice) wrote :

See https://bugs.launchpad.net/neutron/+bug/1597461 which could be related, but we're running 10.0.3-0ubuntu1~cloud0.

Keepalived is 1.2.19-1ubuntu0.2

tags: added: l3-ha
Brian Haley (brian-haley) wrote :

Do you see any failures in the keepalived logs? Something like "Netlink: Received message overrun (No buffer space available)" ?

I've seen another report of this, and looking through the keepalived bugs/changes it seems there was a fix for that, then a bigger change in 1.3.6 titled "Add notify FIFO":

https://github.com/acassen/keepalived/commit/04905cdcb7d2b2fe4aaee9eabdf7f6945726f3c4

https://github.com/acassen/keepalived/issues/584

venkata anil (anil-venkata) wrote :

https://bugs.launchpad.net/neutron/+bug/1597461 is not related, it fixes l3 agent restart scenario. Looks like they are seeing multiple masters without restarting the agent.

Brian Haley (brian-haley) wrote :

So I have heard of someone trying with keepalived version 1.3.9 and still seeing this failure, so that "Add notify FIFO" change wasn't the silver bullet it seemed like.

Xav Paice (xavpaice) wrote :

In answer to "Do you see any failures in the keepalived logs?", no, unfortunately no indication of the reason for switching to master, just that it did. Same for syslog.

Brian Haley (brian-haley) wrote :

Thanks for the info.

I realize it's hard to reproduce, but if you had a time you know it happened and could attach logs from the neutron server, l3-agent, and keepalived from around that timeframe it might help to narrow-down what the possible problem is.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
venkata anil (anil-venkata) wrote :

In https://review.openstack.org/#/c/470905/4/neutron/api/rpc/handlers/l3_rpc.py we want to set all HA network ports(of a l3 agent) status to DOWN when that l3 agent is restarted. But we thought fetch_and_sync_all_routers(which invokes get_router_ids [1]) called only once during l3 agent restart.

In our customer setup, sometimes we have seen l3 agent unable to report state(as l3 agent is busy setting HA network ports status to DOWN and handling corresponding router update notifications), resulting in l3 agent state to AGENT_REVIVED.
When agent state is AGENT_REVIVED, it is again sets HA network ports status to DOWN, resulting in
1) ovs agent rebind the ports
2) l3 agent receiving multiple router updates
As server, ovs agent and l3 agent are busy with these unncessary processing and RPC calls, resulting l3 agent failing to report state(again AGENT_REVIVED state) and periodic syncs.

To fix this, we need to make sure _update_ha_network_port_status() called only when l3 agent is restarted.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L593
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L743

Fix proposed to branch: master
Review: https://review.openstack.org/522641

Changed in neutron:
status: Confirmed → In Progress
Ryan Beisner (1chb1n) on 2017-11-27
Changed in cloud-archive:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Corey Bryant (corey.bryant)
Changed in neutron (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Corey Bryant (corey.bryant)
John George (jog) wrote :

This bug falls under the Canonical Cloud Engineering service-level agreement (SLA) process, as a field critical bug.

Changed in neutron (Ubuntu Artful):
status: New → Triaged
Changed in neutron (Ubuntu Zesty):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Corey Bryant (corey.bryant)
Changed in neutron (Ubuntu Artful):
assignee: nobody → Corey Bryant (corey.bryant)
importance: Undecided → High
Ryan Beisner (1chb1n) wrote :

@jog ack, confirmed. We're tracking it as such.

Hello Xav, or anyone else affected,

Accepted neutron into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:11.0.2-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Artful):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-artful

Reviewed: https://review.openstack.org/522641
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9ed693228f90251c0f03fb842ef19628b439f9bc
Submitter: Zuul
Branch: master

commit 9ed693228f90251c0f03fb842ef19628b439f9bc
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595

Changed in neutron:
status: In Progress → Fix Released

Hello Xav, or anyone else affected,

Accepted neutron into pike-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:pike-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-pike-needed to verification-pike-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-pike-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Bionic):
status: Triaged → Fix Released
tags: added: verification-pike-needed

Reviewed: https://review.openstack.org/522784
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f6560d14b6125906048b74c65f1f974b31206df3
Submitter: Zuul
Branch: stable/pike

commit f6560d14b6125906048b74c65f1f974b31206df3
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ed693228f90251c0f03fb842ef19628b439f9bc)

Hello Xav, or anyone else affected,

Accepted neutron into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed

Reviewed: https://review.openstack.org/522792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=385ac553e33f12c34e8a23459337b2f0af0b75eb
Submitter: Zuul
Branch: stable/ocata

commit 385ac553e33f12c34e8a23459337b2f0af0b75eb
Author: venkata anil <email address hidden>
Date: Thu Nov 23 18:40:30 2017 +0000

    Call update_all_ha_network_port_statuses on agent start

    As explained in bug [1] when l3 agent fails to report state to the
    server, its state is set to AGENT_REVIVED, triggering
    fetch_and_sync_all_routers, which will set all its HA network ports
    to DOWN, resulting in
    1) ovs agent rewiring these ports and setting status to ACTIVE
    2) when these ports are active, server sends router update to l3 agent
    As server, ovs and l3 agents are busy with this processing, l3 agent
    may fail again reporting state, repeating this process.

    As l3 agent is repeatedly processing same routers, SIGHUPs are
    frequently sent to keepalived, resulting in multiple masters.

    To fix this, we call update_all_ha_network_port_statuses in l3 agent
    start instead of calling from fetch_and_sync_all_routers.

    [1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
    Conflicts:
     neutron/agent/l3/agent.py
            neutron/api/rpc/handlers/l3_rpc.py

    Note: This RPC update_all_ha_network_port_statuses is added in only pike
    and later branches. In older branches, we were using get_router_ids RPC
    to invoke _update_ha_network_port_status. As we need to invoke this
    functionality during l3 agent start and get_service_plugin_list() is the
    only available RPC which is called during l3 agent start, we call
    _update_ha_network_port_status from get_service_plugin_list.

    Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
    Related-bug: #1597461
    Closes-Bug: #1731595
    (cherry picked from commit 9ab1ad1433d54fec3e5b04f1edf8ca436e1f7af1)
    (cherry picked from commit a6d985bbca57b5027eecaa43071964b14d9075d9)

This issue was fixed in the openstack/neutron 12.0.0.0b2 development milestone.

Corey Bryant (corey.bryant) wrote :

SRU details for Ubuntu:

[Impact]
Details of the issue are described thoroughly in this bug report. The fix prevents multiple L3HA masters from existing at the same time, and is already upstream for all affected branches.

[Test Case]
The following SRU process will be followed:
https://wiki.ubuntu.com/OpenStackUpdates

In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the proposed packages can be let into -updates.

The OpenStack team will be in charge of attaching the output summary of the executed tests. The OpenStack team members will not mark ‘verification-done’ until this has happened.

[Regression Potential]
The regression potential is lowered as the fix is cherry-picked without change from corresponding upstream stable branches. In order to mitigate the regression potential, the results of the aforementioned tests are attached to this bug.

[Discussion]

Hello Xav, or anyone else affected,

Accepted neutron into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:10.0.4-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Zesty):
status: Triaged → Fix Committed
tags: added: verification-needed-zesty
Corey Bryant (corey.bryant) wrote :

Hello Xav, or anyone else affected,

Accepted neutron into ocata-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ocata-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ocata-needed to verification-ocata-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ocata-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ocata-needed
Xav Paice (xavpaice) wrote :

Please note, we now have a client affected by this running Mitaka as well.

Corey Bryant (corey.bryant) wrote :

Hi Xav,

I took a look at the code and confirmed that this does appear to affect Newton and Mitaka so I've targeted those releases as well. We'll need to backport the Ocata patches to Newton and then Mitaka.

Corey

Changed in neutron (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Corey Bryant (corey.bryant) wrote :

SRU addendum for mitaka (xenial) and newton.

[Regression Potential]

For mitaka (xenial) and newton, the regression potential is a little higher than that of ocata+ as the patch(es) aren't available on the corresponding upstream branches (they're EOL). These patches were cherry picked from the upstream stable/ocata branch, and required slight modifications in order to apply to mitaka and newton. For mitaka, an additional patch was required as a pre-req (set-ha-network-port-to-down-when-l3-agent-starts.patch). This patch is already in the upstream branches and packages for newton+.

Corey Bryant (corey.bryant) wrote :

I've uploaded new versions of neutron with backported patches to fix this issue to xenial (awaiting SRU team review) and newton-staging.

Corey Bryant (corey.bryant) wrote :

Regression testing has completed successfully for artful, zesty, xenial-pike, and xenial-ocata.

artful-pike-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1551.5835 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 732.9467 sec.

artful-pike-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1419.3399 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 695.6942 sec.

zesty-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1665.0215 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 939.2629 sec.

zesty-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1744.4931 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 916.1659 sec.

xenial-pike-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1591.8960 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 695.8174 sec.

xenial-pike-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1609.1086 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 708.7850 sec.

xenial-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1650.0841 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 858.3489 sec.

xenial-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 2173.3217 sec.
 - Passed: 93
 - Skipped: 9
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 1031.2947 sec.

Corey Bryant (corey.bryant) wrote :

Hello Xav, or anyone else affected,

Accepted neutron into newton-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:newton-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-newton-needed to verification-newton-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-newton-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-newton-needed
Xav Paice (xavpaice) wrote :

We have installed the Ocata -proposed package, however the situation is this:

- there's 464 routers configured, on 3 Neutron gateway hosts, using l3-ha, and each router is scheduled to all 3 hosts.
- we installed the package because were in a situation with a current incident with multiple l3 agents active, hoping the package update would solve the problem. One of the gateway hosts was being rebooted at the time to also try to do a King Canute and halt the tidal wave of arp.
- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.
- Removing/ re-adding the routers to agents seemed to clean things up, we saw some routers with multiple agents active, and some with none active (all 3 agents 'standby').
- After a few iterations of that, things cleaned up.
- 15-20 mins later, we saw more routers with multiple agents active (ones which weren't before), and ran through the same cleanup steps. At this time, there were a large number of keepalived messages in syslog, particularly routers becoming MASTER then BACKUP again. (https://pastebin.canonical.com/205361/)
- after another hour or two, we're still clean.

I can't at this stage whether the fix actually fixed the problem or not - I need to dig further to find out if there could have been some process running cleanups.

Corey Bryant (corey.bryant) wrote :

I'm marking this as fix released for artful/pike as it has passed regression testing successfully and it is paired up with a stable point release that has CVE fixes.

Xav, I know you are testing ocata still and it is up in the air still as to whether this has fixed your problem. Please keep us posted on further results.

tags: added: verification-done-artful verification-pike-done
removed: verification-needed-artful verification-pike-needed
Corey Bryant (corey.bryant) wrote :

s/fix released/verified

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:11.0.2-0ubuntu1.1

---------------
neutron (2:11.0.2-0ubuntu1.1) artful; urgency=medium

  * d/gbp.conf: Set debian-branch to stable/pike.
  * New upstream version.
  * New stable point release for OpenStack Pike (LP: #1734990).
  * d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
    Cherry-pick from upstream to prevent multiple masters for L3HA
    (LP: #1731595).

 -- Corey Bryant <email address hidden> Tue, 28 Nov 2017 14:55:02 -0500

Changed in neutron (Ubuntu Artful):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:11.0.2-0ubuntu1.1~cloud0
---------------

 neutron (2:11.0.2-0ubuntu1.1~cloud0) xenial-pike; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:11.0.2-0ubuntu1.1) artful; urgency=medium
 .
   * d/gbp.conf: Set debian-branch to stable/pike.
   * New upstream version.
   * New stable point release for OpenStack Pike (LP: #1734990).
   * d/p/call-update_all_ha_network_port_statuses-on-agent-start.patch:
     Cherry-pick from upstream to prevent multiple masters for L3HA
     (LP: #1731595).

Hello Xav, or anyone else affected,

Accepted neutron into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:8.4.0-0ubuntu6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed-xenial
Corey Bryant (corey.bryant) wrote :

Hello Xav, or anyone else affected,

Accepted neutron into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Corey Bryant (corey.bryant) wrote :

Hi Xav, do you have any more feedback on ocata-proposed testing?

tags: added: neutron-proactive-backport-potential
James Hebden (jhebden) wrote :

Hi Corey,

Unfortunately, in the case of the cloud where we are seeing this behaviour, the updated package which Xav installed per his previous comment does not seem to have addressed the issue. This was neutron 10.0.4-0ubuntu1~cloud0 from Cloud Archive xenial-updates/ocata.

I did notice that the packages being released for other Ubuntu releases appear to be a newer version, 2:11.0.2-0ubuntu1.1 - is this intended?

As an update, the workaround in place for this particular issue has been to disable L3HA on individual routers as we detect this issue. We have this particular cloud down to a number of routers where things seem relatively stable, now that we are closer to the 400 L3HA router mark.

Let me know if you need further information or testing performed.

Akash (taloleakash) wrote :

Hi,

Same issue is coming for openstack pike with openstack-neutron-11.0.3 on Centos 7.
Any solution/patch?

Corey Bryant (corey.bryant) wrote :

Hi jhebden, thanks for the feedback. Yes, Pike has a newer pacakage versions. Once a release is GA, we stay at the major version (ie. 10 in the case of Ocata, 10.0.4) so as not to introduce any new features to a stable release.

Corey Bryant (corey.bryant) wrote :

It seems as if this bug surfaces due to load issues. While the fix provided by Venkata (https://review.openstack.org/#/c/522641/) should help clean things up at the time of l3 agent restart, issues seem to come back later down the line in some circumstances. xavpaice mentioned he saw multiple routers active at the same time when they had 464 routers configured on 3 neutron gateway hosts using L3HA, and each router was scheduled to all 3 hosts. However, jhebden mentions that things seem stable at the 400 L3HA router mark, and it's worth noting this is the same deployment that xavpaice was referring to.

It seems to me that something is being pushed to it's limit, and possibly once that limit is hit, master router advertisements aren't being received, causing a new master to be elected. If this is the case it would be great to get to the bottom of what resource is getting constrained.

venkata anil (anil-venkata) wrote :

As I am unable to reproduce it, I will be happy if someone takes over this issue.

Corey Bryant (corey.bryant) wrote :

I think we need to get a new bug opened for this. As it's been marked fix released upstream it's probably not on anyone's radar.

Corey Bryant (corey.bryant) wrote :

I wasn't able to change the upstream status back to "New" so I've opened a new bug to track this at https://bugs.launchpad.net/ubuntu/artful/+source/neutron/+bug/1744062.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.