l3ha router delete race condition if state changes to master

Bug #1979089 reported by Edward Hope-Morley
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Edward Hope-Morley

Bug Description

We have hit an issue whereby nodes running Neutron Ussuri with ML2 L3HA are occasionally running into the following:

On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent sometimes gets into a state where it is repeatedly trying to spawn a metadata proxy (haproxy) for a router that no longer exists and fails because the namespace is no longer there. This happens thousands of times a day and basically blocks the l3-agent from processing other updates. The error looks like:

2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.
conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b
-4a1b-9393-1374a8c744c3.conf'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory

Some background; when the l3-agent starts is subscribes callback methods to certain events. One of those events is "before_delete" [1] and the method is neutron.agent.metadata.driver.before_router_removed. The idea here is that when an update to delete a router is received it will first execute this method which will delete the "neutron metadata-proxy monitor" which looks at the haproxy pid and respawns it if it dies.

A successful callback execution looks like:

2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88

And an unsuccessful one looks like:

2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent sgdemr0114bp007
...
2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager [req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy /usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219

The difference being that instead of killing the proxy monitor it is actually spawning a new one! The other thing to notice is that in the second case it isn't servicing the same request (it has "[-]"). I looked at the code and found that this is because when the router transitions to master the l3-agent puts the thread to eventlet.sleep(2) [2] and then proceeds with the update i.e. spawning the metadata proxy and while it was asleep it started to process the before_delete callback but then got pre-empted.

So this looks like a simple race condition and occurs if a router transitions to master at the same time as being deleted.

A simple interim workaround is to manually create the non-existent namespace which will allow the respawn to succeed and then hopefully the callback gets to run and deletes it again to clean up. That or restart your neutron-l3-agent service.

[1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186
[2] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149

Tags: l3-ha
Revision history for this message
Miguel Lavalle (minsel) wrote (last edit ):

Thanks for the detailed report. Now let me ask:

So, you are experiencing a situation where thousands of times a day a router transitions to master and is deleted at the same time? Is this a test / CI environment where you are creating this situation

In other words, in what use case you are finding this transition / deletion siatustions?

Changed in neutron:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Miguel Lavalle (minsel) wrote :

Marking as incomplete for now while we gather more info

Revision history for this message
Miguel Lavalle (minsel) wrote :

Listed this bug as requiring further follow up in the weekly bugs report sent to the mailing list

Revision history for this message
Edward Hope-Morley (hopem) wrote :

@minsel no that's not quite right. What I have is a production env where there are circa 280 HA routers at any one time. Sometimes routers get deleted (at least once a day lets say) and due to this bug, if during the process of the router being deleted, the vrrp transitions the master, you may hit this issue. The "thousands a day" bit refers to the amount of times the l3-agent tries to unsuccessfully spawn a new haproxy for the deleted router i.e. it keeps doing it over and over and over because when it tries to spawn it fails since the namespace is no longer there and then the metadata monitor notices it failed and tries again (and it never gives up). Hope that makes sense.

Changed in neutron:
status: Incomplete → In Progress
status: In Progress → New
tags: added: l3-ha
Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi Edward,

I'm no expert of l3ha, but in the hope of moving this bug forward, I have a suggestion and a question:

1) It's hard to work with a bug without having a reproduction method. With a race condition being the likely root cause of course everything is dependent on timing and is likely likely based on chance. When I face a bug like this I try the following: after having an idea of where the race is, I may have a hunch where to add a sleep() to open the race window intentionally wide. After adding a sufficiently big sleep() I find I can usually create a practically deterministic reproduction method. Could you maybe give a try to create a reproduction method like this?

2) Why does the failed spawn repeat infinitely? Is it because of the process monitor? As an idea for a workaround: could the process monitor exit when the failure reason is that the namespace does not exist?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi @bence-romsics, I believe the logs I pasted and the code referenced fairly clearly show where and why the problem is happening. If during the final stages of deleting a router, the vrrp master transitions, this causes l3-agent keepalived-state-change monitor to enqueue an action to update the api that leads to the sleep() at [1]. If while that sleep is happening the router delete proceeds to delete the namespace but before it gets to delete the metadata-proxy monitor the other thread wakes up, the metadata monitor realises that the haproxy is gone, tries to spawn a new one then fails because the namespace is gone and this keeps happening over and over because the monitor never gives up. When I get some time I will look closer but I believe the solution to this is to kill the keepalived-state-change monitor earlier in the router delete sequence so that it won't response to any vrrp transitions that occur while the router is being deleted. Hope that helps.

[1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149

Changed in neutron:
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
Edward Hope-Morley (hopem) wrote :

This is proving quite a hard issue to reproduce in the lab but further analysis of the logs seems to indicate this might be related to the router being deleted immediately after creation. Here is the sequence of updates and transitions that occur for the mentioned router. Immediately after the delete update finishes, the neutron.agent.linux.external_process.ProcessMonitor wakes up, notices haproxy is no longer running, attempts to spawn a new one which then fails because the namespace no longer exists. I still believe this is related to the agent putting the transition thread to sleep just before the delete is processes and it re-awakens after assuming the router is still there.

router update dfe9df14-d58f-4114-a909-dddb0679736b
    type: ADD_UPDATE_ROUTER
    start: 2022-05-10 23:33:09.473
    finish: 2022-05-10 23:33:40.168
router transition:
    state: backup
    enqueue: 2022-05-10 23:33:23.272
    transition: 2022-05-10 23:33:23.272
router update 285660c8-58df-4fb0-9401-13b079eadb
    type: ADD_UPDATE_ROUTER
    start: 2022-05-10 23:33:40.168
    finish: 2022-05-10 23:33:47.745
router update 4c9f263a-51d5-440d-bc44-7a10a2beb36e
    type: ADD_UPDATE_ROUTER
    start: 2022-05-10 23:34:10.909
    finish: 2022-05-10 23:34:33.659
router update 9478e3bb-b68d-41b6-8088-0888d6f9a681
    type: ADD_UPDATE_ROUTER
    start: 2022-05-10 23:35:29.244
    finish: 2022-05-10 23:35:37.686
router update 6137df6c-237c-4740-9c91-aa97f1777a7e
    type: ADD_UPDATE_ROUTER
    start: 2022-05-10 23:35:37.687
    finish: 2022-05-10 23:35:45.696
router update 49c31d3c-44ba-4fe7-9e3c-6af6b5679875
    type: ADD_UPDATE_ROUTER
    start: 2022-05-10 23:35:56.177
    finish: 022-05-10 23:36:10.645
router transition:
    state: master
    enqueue: 2022-05-10 23:36:08.479
    transition: 2022-05-10 23:36:10.480
router update f3825e32-e03e-486d-8891-5fd38a8389a7
    type: DELETE_ROUTER
    start: 2022-05-10 23:36:10.646
    finish: 2022-05-10 23:36:17.032

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Greetings,

I think we ran into a similar issue if not even the same.
In our prod environment we had a l3 agent stuck trying to start a metadata haproxy for a router which already got removed. This seemed to even get the l3 agent stuck to not be able to update any states for new router anymore, so all new routers got into an unknown state.

A restart of the agent seemed to fix this and it did not try to create the haproxy after that anymore.

We are running neutron wallaby (18.5.0).

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Just wanted to note that we had this happen a second time in our production environment.
I took a deeper look into the logs of the exact time the exception started to occur and I think Edward seems right with his assumption as I can see that the keepalived process was promoted to master just before it got deleted.

What I could see is that the l3 agent was getting a router delete and around 44s later it starts and tries to respawn the metadata proxy, which fails because the namespace already got deleted.
After that all newly created routers fail to update there state and stay in an unknown state.
As before a restart of the l3 agent fixes the issue and the agents stops trying to start respawning the metadata-proxy of a deleted router.

Feb 28, 2023 @ 02:24:04.000
Tue Feb 28 01:24:04 2023: (VR_120) Entering MASTER STATE

Feb 28, 2023 @ 02:24:14.000:
Finished a router delete for 0828f373-2d43-4959-ba25-68ac9f178af2, update_id 12f83512-1442-45f6-a6f5-a3eee69fc6ae. Time elapsed: 4.929

Feb 28, 2023 @ 02:24:58.000
Respawning metadata-proxy for uuid 0828f373-2d43-4959-ba25-68ac9f178af2

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Since we first hit this bug we upgraded a couple releases further and are now on Yoga (20.4.0) and are still occasionally hitting this bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.