neutron-l3-agent restart some random ha routers get wrong state

Bug #2009043 reported by Maximilian Stinsky
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

Since a couple of weeks we have a problem in our production environment when restarting our l3-agent. (Our assumption is that this might has something to do with our upgrade to wallaby, as we never saw this problem on prior releases before.)

The l3 agent is hosting around 300 ha routers so when restarting the agent it takes a couple of seconds which results in the alive state to go down and therefore all active routers that were hosted on that agent flip to standby state. Now when the agent finished its startup it should set the correct active state for its routers again but fails for some random amount of routers. It does not log any exceptions or errors so we started to debug this problem in our lab environment which has at most 10-20 routers.

To reproduce this we stopped an l3-agent completely until the alive state is down and routers flip into standy, after starting the agent again some states as in production also dont get back into active again.

We dug quite deep into the code and what we see for routers that are not functioning correctly is that they only get into the _process_added_router function [1] and never go into the _process_updated_router function [2]

For all other routers that work we see that they first hit [1] and then a couple of seconds later they go into [2] which then sets the correct state again.

What is quite confusing is that it happens for different routers on each stop/start sequence of the l3-agent and restarting an agent sometimes fixes this and sometimes it does not.

At this point we are not really sure how to debug this further as we are not really experienced how and where update events come from.
Does anyone has an idea where this could be broken or point us in any direction how to debug this further?

Neutron is running on wallaby(18.5.0).

Thanks in advance

[1] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L631
[2] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L633

Tags: l3-ha
tags: added: l3-ha
Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

I think I found the commit that is the culprit of this issue.

When I revert patch [1] in my lab environment all states are getting set correctly again after each stop/start sequence. So it seems [1] indroduced some kind of race condition maybe?
Does anyone has an idea where the problem with this patch comes from?

[1] https://github.com/openstack/neutron/commit/0d8ae1576724b196580c9af3782fd3c9e9072bef

Revision history for this message
Lajos Katona (lajos-katona) wrote :

Hi, thanks for debugging more this issue,the original bug which was solved by the patch you found (https://review.opendev.org/c/openstack/neutron/+/776423 ) is https://bugs.launchpad.net/neutron/+bug/1916022.

I ask Slawek about it and perhaps he has an idea what should be the next step

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

@Lajos did you possibly get any feedback from Slawek what might be the problem here?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Sorry but I didn't had time to investigate this. What I would like You to ask first is version of oslo_log as recently we have seen issue with HA routers in our CI and it was caused by by the https://github.com/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 which is in oslo_log 5.3.0

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Hi Slawek,

fyi in the meantime we upgraded neutron to Xena (19.6.0).
We just tested if we are still hitting this issue and can confirm that in the Xena release it is still present.

Regarding your question regarding oslo_log we are building kolla containers for our installation which seem to install oslo_log in version 4.6.0 for the Xena release.

We will most likely upgrade neutron in our test env to Yoga in the next couple of weeks, so we will be able to see if we are still seeing the same issue with Yoga and will report back. But my assumption is that this will most likely be still a problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.