MariaDB: database backup cause DB Deadlock

Bug #2034018 reported by Dongcan Ye
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla
Expired
Undecided
Unassigned

Bug Description

Using kolla-ansible to backup database based on MariaDB, it will cause some OpenStack projects DB deadlock,
especially for Octavia, Octavia health-manager uses amphora_health table as the basis for failover. If amphora_health table can't update, it will cause failover Amphora.

2023-08-05 03:06:28.789 49 ERROR octavia.amphorae.drivers.health.heartbeat_udp [-] Health update for amphora 467e8b0b-8314-4936-ae85-a039643e26c8 encountered error (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: UPDATE amphora_health SET last_update=%(last_update)s WHERE amphora_health.amphora_id = %(amphora_id_1)s]
[parameters: {'last_update': datetime.datetime(2023, 8, 4, 19, 5, 38, 784220), 'amphora_id_1': '467e8b0b-8314-4936-ae85-a039643e26c8'}]
(Background on this error at: https://sqlalche.me/e/14/e3q8). Skipping health update.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: UPDATE amphora_health SET last_update=%(last_update)s WHERE amphora_health.amphora_id = %(amphora_id_1)s]
[parameters: {'last_update': datetime.datetime(2023, 8, 4, 19, 5, 38, 784220), 'amphora_id_1': '467e8b0b-8314-4936-ae85-a039643e26c8'}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Same problem can breaks Nova/Neutron/Cinder Service heartbeat.

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Can you please post:
Kolla-Ansible version
MariaDB version
MariaDB log with relevant entries

Changed in kolla:
status: New → Incomplete
Revision history for this message
Matthew Heler (mheler) wrote :

Same issue is seen for my env during backups

kolla-ansible version: stable/zed
mariadb version: 10.6.15+maria~ubu2204

neutron logs:
4a8e80ebfc - - default default] 10.4.24.13,10.4.24.73 "GET /v2.0/subnetpools HTTP/1.1" status: 200 len: 192 time: 0.0103774
2023-10-24 16:47:18.464 755 INFO neutron.wsgi [-] 10.4.24.13 "GET / HTTP/1.1" status: 200 len: 228 time: 0.0014348
2023-10-24 16:47:21.171 741 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovsdb_monitor [None req-e34fec0a-5936-4023-bbfa-e0cee5569e42 - - - - - -] HashRing is empty, error: Hash Ring returned empty when hashing "b'85ed18dc-b427-4bbd-ab24-1c5c42e6801d'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster: neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'85ed18dc-b427-4bbd-ab24-1c5c42e6801d'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster
2023-10-24 16:47:21.176 744 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovsdb_monitor [None req-a93dfb9c-e658-4815-aeda-28aec097a248 - - - - - -] HashRing is empty, error: Hash Ring returned empty when hashing "b'a83cced1-3212-4fb9-88ac-2ea22edd66f9'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster: neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'a83cced1-3212-4fb9-88ac-2ea22edd66f9'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster

octavia logs:
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp [SQL: UPDATE amphora_health SET last_update=%(last_update)s WHERE amphora_health.amphora_id = %(amphora_id_1)s]
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp [parameters: {'last_update': datetime.datetime(2023, 10, 24, 16, 46, 27, 315309), 'amphora_id_1': 'adc858be-c42f-4cca-a59e-d35da8819ee9'}]
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp (Background on this error at: https://sqlalche.me/e/14/e3q8)
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp

Revision history for this message
Michal Nasiadka (mnasiadka) wrote (last edit ):

Anything in mariadb logs?
If it worked in earlier version - it might be more of a MariaDB bug.

Revision history for this message
Matthew Heler (mheler) wrote (last edit ):

Hi Michal,

Looks that way yes, reverted to an earlier mariadb container with 10.6.13+maria~ubu2204 and the issue isn't present during backups.

Revision history for this message
Dongcan Ye (hellochosen) wrote :

Sorry for late reply.

kolla-ansible version: stable/yoga

MariaDB version: MariaDB 10.6.10-1.el8

From the attachment, mariadb log shows:
2023-08-24 3:05:03 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-24 3:05:06 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-25 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25 3:05:06 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25 3:05:06 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-25 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25 3:05:08 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25 3:05:08 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
...

2023-08-30 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-30 3:05:08 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30 3:05:11 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30 3:05:11 0 [Note] WSREP: Member 0.0 (node-2) synced with group.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for kolla because there has been no activity for 60 days.]

Changed in kolla:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.