AMQP disconnects, q-reports-plugin queue grows, leading to DBDeadlocks while trying to update agent heartbeats
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Won't Fix
|
High
|
Unassigned |
Bug Description
Since upgrading to Rocky, we have seen this issue pop up in several environments, small and large. First we see various AMQP/Rabbit related errors - missed heartbeats from neutron-server to rabbitmq, then repeated errors such as Socket Closed, Broken Pipe, etc...
This continues on for a while and all agents report as dead. On the agent side, we see RPC timeouts when trying to report state. Meanwhile, the q-reports-plugin queue in rabbit grows, to 10k+ - presumably because neutron-server can't connect to Rabbit and process messages.
Eventually sometime later, we see "DBDeadlock: (_mysql_
Example of various AMQP related errors - all slightly different:
2019-11-18 07:38:55,200.200 22488 ERROR oslo.messaging.
2019-11-18 07:40:22,454.454 22489 ERROR oslo.messaging.
2019-11-18 07:40:22,586.586 22489 ERROR oslo.messaging.
2019-11-18 07:42:06,010.010 22487 WARNING oslo.messaging.
2019-11-18 07:58:26,692.692 22489 WARNING oslo.messaging.
2019-11-18 07:58:26,696.696 22489 ERROR oslo.messaging.
Along with following Broken Pipe stacktrace in oslo messaging: http://
This continues for some time (30 min - 1 hour) and all agents report as dead, and we see following errors in rabbitmq broker logs: First missed heartbeat errors, then handshake_timeout errors:
2019-11-18 07:41:01.448 [error] <0.6126.71> closing AMQP connection <0.6126.71> (127.0.0.1:39817 -> 127.0.0.1:5672 - neutron-
missed heartbeats from client, timeout: 60s
2019-11-18 07:41:07.665 [error] <0.18727.72> closing AMQP connection <0.18727.72> (127.0.0.1:51762 -> 127.0.0.1:5672):
{handshake_
Eventually we see Rabbitmq q-reports has grown and neutron reporting following DBDeadlock stacktrace:
2019-11-18 08:51:14,505.505 22493 ERROR oslo_db.api [req-231004a2-
Full stacktrace here: http://
The only way to recover is we stop neutron-server and rabbitmq, kill any neutron workers still dangling (which they usually are), then restart. But then we see problem manifest days or a week later.
Rabbitmq is on same host as neutron-server - it is all localhost communication. So we are unsure of why it can't heartbeat or connect. Also the subsequent DBDeadlock leads me to think there is some syncrhonization issue when neutron gets overwhelmed with outstanding RPC messages.
Changed in neutron: | |
importance: | Undecided → High |
Changed in neutron: | |
assignee: | nobody → Arjun Baindur (abaindur) |
Changed in neutron: | |
assignee: | Arjun Baindur (abaindur) → nobody |
I see the exact same in our environment, each time its just LBaaSV2 agent that is still ":-)" in neutron agent-list, all other neutron related agents are death to us. After restarting of neutron-server it recovers for an x period of time. Couldn't get hold of what is triggering it.