kolla

MariaDB: database backup cause DB Deadlock

Bug #2034018 reported by Dongcan Ye on 2023-09-04

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	kolla	Expired	Undecided	Unassigned

Bug Description

Using kolla-ansible to backup database based on MariaDB, it will cause some OpenStack projects DB deadlock,
especially for Octavia, Octavia health-manager uses amphora_health table as the basis for failover. If amphora_health table can't update, it will cause failover Amphora.

2023-08-05 03:06:28.789 49 ERROR octavia.amphorae.drivers.health.heartbeat_udp [-] Health update for amphora 467e8b0b-8314-4936-ae85-a039643e26c8 encountered error (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: UPDATE amphora_health SET last_update=%(last_update)s WHERE amphora_health.amphora_id = %(amphora_id_1)s]
[parameters: {'last_update': datetime.datetime(2023, 8, 4, 19, 5, 38, 784220), 'amphora_id_1': '467e8b0b-8314-4936-ae85-a039643e26c8'}]
(Background on this error at: https://sqlalche.me/e/14/e3q8). Skipping health update.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: UPDATE amphora_health SET last_update=%(last_update)s WHERE amphora_health.amphora_id = %(amphora_id_1)s]
[parameters: {'last_update': datetime.datetime(2023, 8, 4, 19, 5, 38, 784220), 'amphora_id_1': '467e8b0b-8314-4936-ae85-a039643e26c8'}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Same problem can breaks Nova/Neutron/Cinder Service heartbeat.

Revision history for this message

Michal Nasiadka (mnasiadka) wrote on 2023-09-21:

Can you please post:
Kolla-Ansible version
MariaDB version
MariaDB log with relevant entries

Changed in kolla:
status:	New → Incomplete

Revision history for this message

Matthew Heler (mheler) wrote on 2023-10-24:

Same issue is seen for my env during backups

kolla-ansible version: stable/zed
mariadb version: 10.6.15+maria~ubu2204

neutron logs:
4a8e80ebfc - - default default] 10.4.24.13,10.4.24.73 "GET /v2.0/subnetpools HTTP/1.1" status: 200 len: 192 time: 0.0103774
2023-10-24 16:47:18.464 755 INFO neutron.wsgi [-] 10.4.24.13 "GET / HTTP/1.1" status: 200 len: 228 time: 0.0014348
2023-10-24 16:47:21.171 741 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovsdb_monitor [None req-e34fec0a-5936-4023-bbfa-e0cee5569e42 - - - - - -] HashRing is empty, error: Hash Ring returned empty when hashing "b'85ed18dc-b427-4bbd-ab24-1c5c42e6801d'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster: neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'85ed18dc-b427-4bbd-ab24-1c5c42e6801d'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster
2023-10-24 16:47:21.176 744 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovsdb_monitor [None req-a93dfb9c-e658-4815-aeda-28aec097a248 - - - - - -] HashRing is empty, error: Hash Ring returned empty when hashing "b'a83cced1-3212-4fb9-88ac-2ea22edd66f9'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster: neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'a83cced1-3212-4fb9-88ac-2ea22edd66f9'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster

octavia logs:
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp [SQL: UPDATE amphora_health SET last_update=%(last_update)s WHERE amphora_health.amphora_id = %(amphora_id_1)s]
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp [parameters: {'last_update': datetime.datetime(2023, 10, 24, 16, 46, 27, 315309), 'amphora_id_1': 'adc858be-c42f-4cca-a59e-d35da8819ee9'}]
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp (Background on this error at: https://sqlalche.me/e/14/e3q8)
2023-10-24 16:47:17.318 747 ERROR octavia.amphorae.drivers.health.heartbeat_udp

Same issue is seen for my env during backups

kolla-ansible version: stable/zed
mariadb version: 10.6.15+maria~ubu2204

neutron logs:
4a8e80ebfc - - default default] 10.4.24.13,10.4.24.73 "GET /v2.0/subnetpools HTTP/1.1" status: 200  len: 192 time: 0.0103774
2023-10-24 16:47:18.464 755 INFO neutron.wsgi [-] 10.4.24.13 "GET / HTTP/1.1" status: 200  len: 228 time: 0.0014348
2023-10-24 16:47:21.171 741 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovsdb_monitor [None req-e34fec0a-5936-4023-bbfa-e0cee5569e42 - - - - - -] HashRing is empty, error: Hash Ring returned empty when hashing "b'85ed18dc-b427-4bbd-ab24-1c5c42e6801d'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster: neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'85ed18dc-b427-4bbd-ab24-1c5c42e6801d'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster
2023-10-24 16:47:21.176 744 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovsdb_monitor [None req-a93dfb9c-e658-4815-aeda-28aec097a248 - - - - - -] HashRing is empty, error: Hash Ring returned empty when hashing "b'a83cced1-3212-4fb9-88ac-2ea22edd66f9'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster: neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'a83cced1-3212-4fb9-88ac-2ea22edd66f9'". All 72 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster

Revision history for this message

Michal Nasiadka (mnasiadka) wrote on 2023-10-24 (last edit on 2023-10-24):

Anything in mariadb logs?
If it worked in earlier version - it might be more of a MariaDB bug.

Revision history for this message

Matthew Heler (mheler) wrote on 2023-10-24 (last edit on 2023-10-24):

Hi Michal,

Looks that way yes, reverted to an earlier mariadb container with 10.6.13+maria~ubu2204 and the issue isn't present during backups.

Revision history for this message

Dongcan Ye (hellochosen) wrote on 2023-10-30:

mariadb.log Edit (3.2 MiB, application/octet-stream)

Sorry for late reply.

kolla-ansible version: stable/yoga

MariaDB version: MariaDB 10.6.10-1.el8

From the attachment, mariadb log shows:
2023-08-24 3:05:03 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-24 3:05:06 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-25 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25 3:05:06 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25 3:05:06 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-25 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25 3:05:08 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25 3:05:08 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
...

2023-08-30 3:05:05 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30 3:05:07 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-30 3:05:08 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30 3:05:09 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30 3:05:11 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30 3:05:11 0 [Note] WSREP: Member 0.0 (node-2) synced with group.

Sorry for late reply.

kolla-ansible version: stable/yoga

MariaDB version: MariaDB 10.6.10-1.el8

From the attachment, mariadb log shows:
2023-08-24  3:05:03 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24  3:05:05 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24  3:05:05 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-24  3:05:06 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24  3:05:07 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24  3:05:07 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-24  3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-24  3:05:09 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-25  3:05:05 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25  3:05:06 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25  3:05:06 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-25  3:05:07 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25  3:05:08 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25  3:05:08 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-25  3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-25  3:05:09 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
...

2023-08-30  3:05:05 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30  3:05:07 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30  3:05:07 0 [Note] WSREP: Member 0.0 (node-2) synced with group.
2023-08-30  3:05:08 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30  3:05:09 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30  3:05:09 0 [Note] WSREP: Member 0.0 (node-2) desyncs itself from group
2023-08-30  3:05:11 0 [Note] WSREP: Member 0.0 (node-2) resyncs itself to group.
2023-08-30  3:05:11 0 [Note] WSREP: Member 0.0 (node-2) synced with group.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-12-29:

#11

[Expired for kolla because there has been no activity for 60 days.]