L3 HA: Unstable behavior

Bug #1563298 reported by Ann Taraday
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
Medium
Ann Taraday
9.x
Won't Fix
Medium
Ann Taraday

Bug Description

During manual testing L3 HA on scale(env-11 3 controllers, 46 computes VxLAN) was found instability in rescheduling routers behavior. As a result connection was established after longer period of time than it is expected.

During multiple reboots of (primary and non-primary) controllers (multiple restart of L3 agents) with large amount of routers >=175 after recover of rebooted node(start L3 agent that was banned) multiple agents try to become active(master):

http://paste.openstack.org/show/491593/
http://paste.openstack.org/show/491598/

After around 5 minutes only one agent is "active" and connection established normal.

This issues is floating: was reproduced with 175 routers and then with 200 routers was not reproduced(during rebooting of controllers), then it was reproduced with 200 routers and was not reproduced for 250 routers (ban/clear l3 agent).

Changed in mos:
importance: Undecided → Medium
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Changed in mos:
status: New → Confirmed
status: Confirmed → New
Changed in mos:
status: New → Confirmed
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

On Neutron labs I was not able to reproduce such problem. I will continue investigation on scale lab.

Changed in mos:
status: Confirmed → Won't Fix
Revision history for this message
Dina Belova (dbelova) wrote :

Added move-to-10.0 tag due to the fact bug was transferred from 9.0 to 10.0

tags: added: move-to-10.0
Revision history for this message
John Schwarz (jschwarz) wrote :

Do we have any server/agent logs as an example of this error? Were there any exceptions noticeable when encountering this?

Revision history for this message
Assaf Muller (amuller) wrote :

@Ann, do you have https://review.openstack.org/#/c/162260/ applied in the environments you're testing? The issue you're seeing *might* be related to the bug described in https://bugs.launchpad.net/neutron/+bug/1525901 and is mitigated by the patch I linked.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

@Assaf,thanks for pointing this patch and the bug! Yes, seems that this was missing for the moment and synced later. I will recheck this, however on the virtual env with 5 nodes with Mitaka I once saw something similar, but it is really hard to reproduce :(

Logs of neutron server at that moment http://paste.openstack.org/show/495528/

tags: added: 10.0-reviewed
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

Scale testing on 9.0 helped to find root cause of this issue. Upstream bug https://bugs.launchpad.net/neutron/+bug/1597461 is described it.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

The bug https://bugs.launchpad.net/neutron/+bug/1597461 is fixed in upstream, but fix is not backportable, so remove it for 9.x.

Changed in mos:
status: In Progress → Won't Fix
Revision history for this message
Alexander Ignatov (aignatov) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.