MySQL patches by Codership

applier fails with lock wait timeout exceeded on rsync SST donor

Bug #1255501 reported by Teemu Ollakka on 2013-11-27

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Galera	Status tracked in 3.x
2.x	Fix Released	High	Teemu Ollakka	Galera 25.2.9
3.x	Fix Released	High	Teemu Ollakka	Galera 25.3.4
MySQL patches by Codership	Status tracked in 5.6
5.6	Fix Released	High	Teemu Ollakka	MySQL patches by Codership 5.6.15-25.2
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.37-25.10
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.15-25.2

Bug Description

Test: Standard seesaw test with rsync SST.

During one full SST donor got stuck in " Flushing tables for SST...". Normally this stage is resolved when some client transaction lock wait times out and FTWRL can proceed. However, this time applier (or replayer) could not apply events. The question is, why would the BF transaction ever had to wait for lock until it times out.

Log snippet:

2013-11-27 02:08:26 32076 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 16875835)
2013-11-27 02:08:26 32076 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2013-11-27 02:08:26 32076 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address 'gw:10023/rsync_sst' --auth 'root:rootpass' --socket '/run/shm/galera/local3/mysql/var/mysqld.sock' --datadir '/run/shm/galera/local3/mysql/var/' --defaults-file '/run/shm/galera/local3/mysql/etc/my.cnf' '' --gtid '8fc7b0b3-56a5-11e3-b5fe-1aa9415117a6:16875835''
2013-11-27 02:08:26 32076 [Note] WSREP: sst_donor_thread signaled with 0
2013-11-27 02:08:26 32076 [Note] WSREP: Flushing tables for SST...
2013-11-27 02:09:17 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:09:17 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:09:17 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2013-11-27 02:10:08 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:10:08 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:10:08 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2013-11-27 02:10:59 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:10:59 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:10:59 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2013-11-27 02:11:50 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:11:50 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:11:50 32076 [ERROR] WSREP: Failed to apply trx: source: fabb5d8b-56f7-11e3-97a2-bad682cc23e2 version: 3 local: 0 state: APPLYING flags: 129 conn_id: 15 trx_id: 56293974 seqnos (l: 19474, g: 16875911, s: 16875909, d: 16875910, ts: 630454850094192)
2013-11-27 02:11:50 32076 [ERROR] WSREP: Failed to apply trx 16875911 4 times
2013-11-27 02:11:50 32076 [ERROR] WSREP: Node consistency compromized, aborting...

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-11-28:

It appeared that 5.6 branch is missing protection for BF threads against innodb lock wait time out.

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-11-30:

Affects Galera too.

Long applier lock wait happens due to race between three threads:
1) Donor thread that is trying to pause provider
2) Applier thread that has BF aborted committing thread
3) BF aborted thread is hanging in local monitor pre_enter() because thread 1) has grabbed monitor lock

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-11-30:

Fix to mysql 5.6 part pushed in http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/4022