AMQP disconnects, q-reports-plugin queue grows, leading to DBDeadlocks while trying to update agent heartbeats

Bug #1853071 reported by Arjun Baindur
56
This bug affects 9 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
High
Unassigned

Bug Description

Since upgrading to Rocky, we have seen this issue pop up in several environments, small and large. First we see various AMQP/Rabbit related errors - missed heartbeats from neutron-server to rabbitmq, then repeated errors such as Socket Closed, Broken Pipe, etc...

This continues on for a while and all agents report as dead. On the agent side, we see RPC timeouts when trying to report state. Meanwhile, the q-reports-plugin queue in rabbit grows, to 10k+ - presumably because neutron-server can't connect to Rabbit and process messages.

Eventually sometime later, we see "DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')" errors when neutron-server is trying to update stale agent heartbeats.

Example of various AMQP related errors - all slightly different:

2019-11-18 07:38:55,200.200 22488 ERROR oslo.messaging._drivers.impl_rabbit [req-cba0d0fa-8e5a-42f1-a93b-4bb398b22275 - - - - -] [eba472a9-021d-4738-801b-7944aad3e3af] AMQP server 127.0.0.1:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed

2019-11-18 07:40:22,454.454 22489 ERROR oslo.messaging._drivers.impl_rabbit [-] [6eb8d074-02c7-4add-8d91-768cbae60fdc] AMQP server on 127.0.0.1:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.: ConnectionForced: Too many heartbeats missed

2019-11-18 07:40:22,586.586 22489 ERROR oslo.messaging._drivers.impl_rabbit [req-0b9f092f-27f2-4be1-bdf5-2c5208e54321 - - - - -] [4b43df2c-cc3e-4442-807c-dcfd4057cb3d] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.: error: [Errno 32] Broken pipe

2019-11-18 07:42:06,010.010 22487 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 32] Broken pipe

2019-11-18 07:58:26,692.692 22489 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed

2019-11-18 07:58:26,696.696 22489 ERROR oslo.messaging._drivers.impl_rabbit [-] [84273ffb-1610-44b1-aff7-d5e4606b7f59] AMQP server on 127.0.0.1:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <AMQPError: unknown error>

Along with following Broken Pipe stacktrace in oslo messaging: http://paste.openstack.org/show/786312/

This continues for some time (30 min - 1 hour) and all agents report as dead, and we see following errors in rabbitmq broker logs: First missed heartbeat errors, then handshake_timeout errors:

2019-11-18 07:41:01.448 [error] <0.6126.71> closing AMQP connection <0.6126.71> (127.0.0.1:39817 -> 127.0.0.1:5672 - neutron-server:22487:ee468e25-42d7-45b8-aea0-4f6fb58a9034):
missed heartbeats from client, timeout: 60s
2019-11-18 07:41:07.665 [error] <0.18727.72> closing AMQP connection <0.18727.72> (127.0.0.1:51762 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

Eventually we see Rabbitmq q-reports has grown and neutron reporting following DBDeadlock stacktrace:

2019-11-18 08:51:14,505.505 22493 ERROR oslo_db.api [req-231004a2-d988-47b3-9730-d6b5276fdcf8 - - - - -] DB exceeded retry limit.: DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE agents SET heartbeat_timestamp=%s WHERE agents.id = %s'] [parameters: (datetime.datetime(2019, 11, 18, 8, 50, 23, 804716), '223c754e-9d7f-4df3-b5a5-9be4eb8692b0')] (Background on this error at: http://sqlalche.me/e/e3q8)

Full stacktrace here: http://paste.openstack.org/show/786313/

The only way to recover is we stop neutron-server and rabbitmq, kill any neutron workers still dangling (which they usually are), then restart. But then we see problem manifest days or a week later.

Rabbitmq is on same host as neutron-server - it is all localhost communication. So we are unsure of why it can't heartbeat or connect. Also the subsequent DBDeadlock leads me to think there is some syncrhonization issue when neutron gets overwhelmed with outstanding RPC messages.

Revision history for this message
Dyon van Bavel (dvbavel) wrote :

I see the exact same in our environment, each time its just LBaaSV2 agent that is still ":-)" in neutron agent-list, all other neutron related agents are death to us. After restarting of neutron-server it recovers for an x period of time. Couldn't get hold of what is triggering it.

Revision history for this message
Arjun Baindur (abaindur) wrote :

This is easy to reproduce. Setup here is a 2 vCPU controller VM. MySQL is on same host as neutron-server, at 127.0.0.1:3306. There are only 3 remote hypervisors, each with 4 neutron agents (ovs, dhcp, l3, and metadata) for a total of 4 agents.

Use iptables to simulate a DB connection failure by blocking neutron user's UID to mysql port:

iptables -I OUTPUT -o lo -p tcp --dport 3306 --match owner --uid-owner 988 -j DROP

Let this sit for 5-10 minutes. You should see neutron-server blocked in logs. Remove the iptables rule:

iptables -D OUTPUT -o lo -p tcp -m tcp --dport 3306 -m owner --uid-owner 988 -j DROP

Immediately upon removing rule we saw neutron resume, but RabbitMQ q-reports-plugin keeps growing:

[<email address hidden> rabbitmq(admin)]# rabbitmqctl list_queues | grep -vw "0$"
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
q-reports-plugin 30
q-plugin 4
[<email address hidden> rabbitmq(admin)]#
[<email address hidden> rabbitmq(admin)]# rabbitmqctl list_queues | grep -vw "0$"
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
q-reports-plugin 151
q-plugin 7
pf9-loopback-queue 1
[<email address hidden> rabbitmq(admin)]# rabbitmqctl list_queues | grep -vw "0$"
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
q-reports-plugin 193
q-plugin 8

At the same time, seeing a lot of the same AMQP errors in neutron logs:

2019-11-19 21:48:10,386.386 12595 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 32] Broken pipe
2019-11-19 21:48:10,386.386 12595 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 32] Broken pipe
2019-11-19 21:49:01,521.521 12595 ERROR oslo.messaging._drivers.impl_rabbit [-] [2bebb99d-aac7-41e4-9c0d-7c8b9db62fc8] AMQP server on 127.0.0.1:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <AMQPError: unknown error>

Revision history for this message
Arjun Baindur (abaindur) wrote :
Download full text (7.0 KiB)

Not seeing the DBDeadlock stacktraces in neutron logs yet. but Here's some more logs/info fro mysql side:

These are the queries that we can see mysql is repeatedly executing. See the age/time increase to 60, then a new query with different localhost:PORTNUM is observed. Sometimes the agent.id is different. But when it tries the same agent.id again, What is strange is the heartbeat timestamp is off by a fraction of a second. Is this a different state report RPC altogether, or is timestamp recalculated on server side?

mysql> SELECT host,time, INFO FROM INFORMATION_SCHEMA.PROCESSLIST where INFO is NOT NULL and DB ='neutron' ORDER BY time DESC LIMIT 30;
+-----------------+------+----------------------------------------------------------------------------------------------------------------------------------+
| host | time | INFO |
+-----------------+------+----------------------------------------------------------------------------------------------------------------------------------+
| localhost:50606 | 6 | UPDATE agents SET heartbeat_timestamp='2019-11-19 21:54:18.817098' WHERE agents.id = '17443741-0a8f-4cbb-9bc3-94b6407d849b' |
| localhost:39315 | 6 | UPDATE agents SET heartbeat_timestamp='2019-11-19 21:49:01.483936' WHERE agents.id = '17443741-0a8f-4cbb-9bc3-94b6407d849b' |
| localhost | 0 | SELECT host,time, INFO FROM INFORMATION_SCHEMA.PROCESSLIST where INFO is NOT NULL and DB ='neutron' ORDER BY time DESC LIMIT 30 |
+-----------------+------+----------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

mysql> SELECT host,time, INFO FROM INFORMATION_SCHEMA.PROCESSLIST where INFO is NOT NULL and DB ='neutron' ORDER BY time DESC LIMIT 30;
+-----------------+------+----------------------------------------------------------------------------------------------------------------------------------+
| host | time | INFO |
+-----------------+------+----------------------------------------------------------------------------------------------------------------------------------+
| localhost:50608 | 42 | UPDATE agents SET heartbeat_timestamp='2019-11-19 21:54:18.823186' WHERE agents.id = '5e0f0692-4f55-4a35-9efa-12aa866153a2' |
| localhost:39317 | 42 | UPDATE agents SET heartbeat_timestamp='2019-11-19 21:49:01.487855' WHERE agents.id = 'e2682460-c019-43fd-848b-871a5b7c13bf' |
| localhost | 0 | SELECT host,time, INFO FROM INFORMATION_SCHEMA.PROCESSLIST where INFO is NOT NULL and DB ='neutron' ORDER BY time DESC LIMIT 30 |
+-----------------+------+----------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.01 sec)

mysql> SELECT host,time, INFO FROM INFORMATION_SCHEMA.PROCESSLIST where INFO is NOT NULL and DB ='neutron' ORDER BY time DESC LIMIT 30;
+...

Read more...

Revision history for this message
Arjun Baindur (abaindur) wrote :

Finally since i'm not an SQL expert, here is output of SHOW ENGINE INNODB STATUS': http://paste.openstack.org/show/786369/

There are a bunch of neutron transactions active for 1397 seconds. Some for 96 seconds. And a bunch that are "not started".

Two of them indicate they have been waiting on a lock like so:

---TRANSACTION 57170620, ACTIVE 1397 sec
MySQL thread id 169, OS thread handle 0x7fa3375f5700, query id 36577 localhost 127.0.0.1 neutron
Trx read view will not see trx with id >= 57170621, sees < 57170585
---TRANSACTION 57170619, ACTIVE 1397 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 360, 1 row lock(s)
MySQL thread id 168, OS thread handle 0x7fa337637700, query id 123513 localhost 127.0.0.1 neutron updating
UPDATE agents SET heartbeat_timestamp='2019-11-19 20:54:45.480969' WHERE agents.id = '1d7c5726-a336-4156-ace5-952d5363ad50'
Trx read view will not see trx with id >= 57170620, sees < 57170585
------- TRX HAS BEEN WAITING 17 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 204 page no 3 n bits 88 index `PRIMARY` of table `neutron`.`agents` trx id 57170619 lock_mode X locks rec but not gap waiting
Record lock, heap no 6 PHYSICAL RECORD: n_fields 16; compact format; info bits 0
 0: len 30; hex 31643763353732362d613333362d343135362d616365352d393532643533; asc 1d7c5726-a336-4156-ace5-952d53; (total 36 bytes);
 1: len 6; hex 000003685a9e; asc hZ ;;
 2: len 7; hex 560000026d1233; asc V m 3;;
 3: len 14; hex 4d65746164617461206167656e74; asc Metadata agent;;
 4: len 22; hex 6e657574726f6e2d6d657461646174612d6167656e74; asc neutron-metadata-agent;;
 5: len 3; hex 4e2f41; asc N/A;;
 6: len 30; hex 64333062303339302d323066642d343366652d393566382d313037396330; asc d30b0390-20fd-43fe-95f8-1079c0; (total 36 bytes);
 7: len 1; hex 81; asc ;;
 8: len 5; hex 99a34f6033; asc O`3;;
 9: len 5; hex 99a41b7ab4; asc z ;;
 10: len 5; hex 99a4a74dad; asc M ;;
 11: SQL NULL;
 12: len 30; hex 7b226c6f675f6167656e745f68656172746265617473223a2066616c7365; asc {"log_agent_heartbeats": false; (total 158 bytes);
 13: len 4; hex 80000000; asc ;;
 14: SQL NULL;
 15: SQL NULL;

Changed in neutron:
importance: Undecided → High
Revision history for this message
Arjun Baindur (abaindur) wrote :

I have some more updates/clues:

There appears to be a lot of queries causing the blocking that show a NULL SQL query. It seems these are actually "rollback" operations.

As you can see there are 2 blocked transactions, each waiting on another transaction. If you view these transactions, or view the processlist, you will see it reports a NULL query: http://paste.openstack.org/show/786464/

We can use the event_transaction_history and _current tables in performance_schema to show actual SQL history for these blocking transactions. Many appear to be rollbacks from other agent timestamp update operations themselves. Others are sometimes different transactions altogether. We can see many threads currently trying to perform operations too, with majority being ROLLBACK:

http://paste.openstack.org/show/786465/

Arjun Baindur (abaindur)
Changed in neutron:
assignee: nobody → Arjun Baindur (abaindur)
Arjun Baindur (abaindur)
Changed in neutron:
assignee: Arjun Baindur (abaindur) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/697215

Changed in neutron:
assignee: nobody → Arjun Baindur (abaindur)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/697396

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/697404

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/697405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/697407

Revision history for this message
wang (yunhua) wrote :

stay tuned

Revision history for this message
jichenjc (jichenjc) wrote :

we are having this issue as well and much much like the errors pasted above

@Arjun Baindur have you fixed the issue locally or any additional info can be shared?

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Arjun Baindur (abaindur) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/697215
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/stein)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/697404
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/rocky)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/697405
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/train)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/697396
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: New → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Rodolfo Alonso <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/697407
Reason: This patch has been abandoned due to the lack of activity. Please propose it again if needed. Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.