L3-agent queue is processed by single worker
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mirantis OpenStack |
Fix Released
|
Critical
|
Eugene Nikanorov | ||
7.0.x |
Fix Released
|
Critical
|
Eugene Nikanorov | ||
8.0.x |
Fix Released
|
Critical
|
Eugene Nikanorov |
Bug Description
Steps to reproduce:
1. Deploy MOS with Neutron with DVR enabled
2. Restart all l3 agents at once. You can use the following one-liner on the master node for that:
fuel nodes | grep comp | awk '{ print $1; }' | xargs -I@ ssh node-@ initctl restart neutron-l3-agent
With some probability all agents will go down (as displayed by 'neutron agent-list | grep L3'). The issue is not healed automatically with time.
So far reproduced only on 200 node environment, with around 200 l3 agents living on compute nodes. It might be not reproducible for smaller environments.
======== Other symptoms
neutron-server logs on all three controllers are full of errors like this:
http://
Also if one executes
rabbitmqctl list_queues messages consumers name
it can be seen that queue 'q-l3-plugin' is full of messages.
======== RCA
1. l3 agent periodically does RPC calls to the neutron-server.
2. If an agent is restarted after it has sent an RPC request and before it has received the reply, the neutron-server has to send reply to already non-existing reply queue. The queue disappears because l3 agent restart makes it create a new reply queue with different name, while the old queue is removed because it has auto-delete flag.
3. It takes oslo.messaging 60 seconds to try to send message to non-existing queue. After that the message is discarded.
4. l3 agent after start makes an initial RPC call to neutron-server. If call is not responded 5 times in a row, after 5 minutes l3 agent dies with critical error and is respawned by systemd.
5. Assume the following situation: 'q-l3-plugin' queue is full of 'old' RPC requests from already died RPC requests. It takes one neutron-server thread at least 60 seconds to process one message (60 seconds are spent trying to send reply to non-existing queue). In that case new requests from l3 agent are not going to be processed in 5 minutes and so it dies and restarts, meaning it just contributed 5 more messages with invalid reply queue to 'q-l3-plugin'. When number of agents is greater than number of Neutron threads processing requests, the issue never ends by itself as l3 agents produce more messages then neutron-server can process. 'q-l3-plugin' queue constantly grows.
Changed in mos: | |
assignee: | nobody → MOS Oslo (mos-oslo) |
Changed in mos: | |
milestone: | none → 7.0 |
tags: | added: scale |
Changed in mos: | |
status: | Confirmed → In Progress |
tags: | added: on-verification |
tags: | added: on-verification |
RCA shows that only 1 process out of all configured neutron-server processes was listening to q-l3-agent.
It's not enough for DVR environments with lots of L3 agents.