Percona XtraDB Cluster - HA scalable solution for MySQL

applier fails with lock wait timeout exceeded on rsync SST donor

Reported by Teemu Ollakka on 2013-11-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
High
Teemu Ollakka
3.x
High
Teemu Ollakka
MySQL patches by Codership
Status tracked in 5.6
5.6
High
Teemu Ollakka
Percona XtraDB Cluster
Status tracked in Trunk
5.6
Undecided
Unassigned
Trunk
Undecided
Unassigned

Bug Description

Test: Standard seesaw test with rsync SST.

During one full SST donor got stuck in " Flushing tables for SST...". Normally this stage is resolved when some client transaction lock wait times out and FTWRL can proceed. However, this time applier (or replayer) could not apply events. The question is, why would the BF transaction ever had to wait for lock until it times out.

Log snippet:

2013-11-27 02:08:26 32076 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 16875835)
2013-11-27 02:08:26 32076 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2013-11-27 02:08:26 32076 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address 'gw:10023/rsync_sst' --auth 'root:rootpass' --socket '/run/shm/galera/local3/mysql/var/mysqld.sock' --datadir '/run/shm/galera/local3/mysql/var/' --defaults-file '/run/shm/galera/local3/mysql/etc/my.cnf' '' --gtid '8fc7b0b3-56a5-11e3-b5fe-1aa9415117a6:16875835''
2013-11-27 02:08:26 32076 [Note] WSREP: sst_donor_thread signaled with 0
2013-11-27 02:08:26 32076 [Note] WSREP: Flushing tables for SST...
2013-11-27 02:09:17 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:09:17 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:09:17 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2013-11-27 02:10:08 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:10:08 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:10:08 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2013-11-27 02:10:59 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:10:59 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:10:59 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2013-11-27 02:11:50 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:11:50 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:11:50 32076 [ERROR] WSREP: Failed to apply trx: source: fabb5d8b-56f7-11e3-97a2-bad682cc23e2 version: 3 local: 0 state: APPLYING flags: 129 conn_id: 15 trx_id: 56293974 seqnos (l: 19474, g: 16875911, s: 16875909, d: 16875910, ts: 630454850094192)
2013-11-27 02:11:50 32076 [ERROR] WSREP: Failed to apply trx 16875911 4 times
2013-11-27 02:11:50 32076 [ERROR] WSREP: Node consistency compromized, aborting...

Teemu Ollakka (teemu-ollakka) wrote :

It appeared that 5.6 branch is missing protection for BF threads against innodb lock wait time out.

Teemu Ollakka (teemu-ollakka) wrote :

Affects Galera too.

Long applier lock wait happens due to race between three threads:
1) Donor thread that is trying to pause provider
2) Applier thread that has BF aborted committing thread
3) BF aborted thread is hanging in local monitor pre_enter() because thread 1) has grabbed monitor lock

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers