neutron-l3-agent restart some random ha routers get wrong state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Undecided
|
Unassigned |
Bug Description
Since a couple of weeks we have a problem in our production environment when restarting our l3-agent. (Our assumption is that this might has something to do with our upgrade to wallaby, as we never saw this problem on prior releases before.)
The l3 agent is hosting around 300 ha routers so when restarting the agent it takes a couple of seconds which results in the alive state to go down and therefore all active routers that were hosted on that agent flip to standby state. Now when the agent finished its startup it should set the correct active state for its routers again but fails for some random amount of routers. It does not log any exceptions or errors so we started to debug this problem in our lab environment which has at most 10-20 routers.
To reproduce this we stopped an l3-agent completely until the alive state is down and routers flip into standy, after starting the agent again some states as in production also dont get back into active again.
We dug quite deep into the code and what we see for routers that are not functioning correctly is that they only get into the _process_
For all other routers that work we see that they first hit [1] and then a couple of seconds later they go into [2] which then sets the correct state again.
What is quite confusing is that it happens for different routers on each stop/start sequence of the l3-agent and restarting an agent sometimes fixes this and sometimes it does not.
At this point we are not really sure how to debug this further as we are not really experienced how and where update events come from.
Does anyone has an idea where this could be broken or point us in any direction how to debug this further?
Neutron is running on wallaby(18.5.0).
Thanks in advance
[1] https:/
[2] https:/
tags: | added: l3-ha |
I think I found the commit that is the culprit of this issue.
When I revert patch [1] in my lab environment all states are getting set correctly again after each stop/start sequence. So it seems [1] indroduced some kind of race condition maybe?
Does anyone has an idea where the problem with this patch comes from?
[1] https:/ /github. com/openstack/ neutron/ commit/ 0d8ae1576724b19 6580c9af3782fd3 c9e9072bef