l3ha router delete race condition if state changes to master
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Medium
|
Edward Hope-Morley |
Bug Description
We have hit an issue whereby nodes running Neutron Ussuri with ML2 L3HA are occasionally running into the following:
On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent sometimes gets into a state where it is repeatedly trying to spawn a metadata proxy (haproxy) for a router that no longer exists and fails because the namespace is no longer there. This happens thousands of times a day and basically blocks the l3-agent from processing other updates. The error looks like:
2022-05-21 06:26:12.882 30127 DEBUG neutron.
conf', 'ip', 'netns', 'exec', 'qrouter-
-4a1b-9393-
2022-05-21 06:26:13.116 30127 ERROR neutron.
Some background; when the l3-agent starts is subscribes callback methods to certain events. One of those events is "before_delete" [1] and the method is neutron.
A successful callback execution looks like:
2022-05-21 03:05:54.676 30127 DEBUG neutron_
2022-05-21 03:05:54.676 30127 DEBUG neutron.
And an unsuccessful one looks like:
2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-
...
2022-05-10 23:36:10.646 30127 DEBUG neutron_
2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-
The difference being that instead of killing the proxy monitor it is actually spawning a new one! The other thing to notice is that in the second case it isn't servicing the same request (it has "[-]"). I looked at the code and found that this is because when the router transitions to master the l3-agent puts the thread to eventlet.sleep(2) [2] and then proceeds with the update i.e. spawning the metadata proxy and while it was asleep it started to process the before_delete callback but then got pre-empted.
So this looks like a simple race condition and occurs if a router transitions to master at the same time as being deleted.
A simple interim workaround is to manually create the non-existent namespace which will allow the respawn to succeed and then hopefully the callback gets to run and deletes it again to clean up. That or restart your neutron-l3-agent service.
[1] https:/
[2] https:/
tags: | added: l3-ha |
Changed in neutron: | |
assignee: | nobody → Edward Hope-Morley (hopem) |
Thanks for the detailed report. Now let me ask:
So, you are experiencing a situation where thousands of times a day a router transitions to master and is deleted at the same time? Is this a test / CI environment where you are creating this situation
In other words, in what use case you are finding this transition / deletion siatustions?