applier fails with lock wait timeout exceeded on rsync SST donor

Bug #1255501 reported by Teemu Ollakka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
Fix Released
High
Teemu Ollakka
3.x
Fix Released
High
Teemu Ollakka
MySQL patches by Codership
Status tracked in 5.6
5.6
Fix Released
High
Teemu Ollakka
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

Test: Standard seesaw test with rsync SST.

During one full SST donor got stuck in " Flushing tables for SST...". Normally this stage is resolved when some client transaction lock wait times out and FTWRL can proceed. However, this time applier (or replayer) could not apply events. The question is, why would the BF transaction ever had to wait for lock until it times out.

Log snippet:

2013-11-27 02:08:26 32076 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 16875835)
2013-11-27 02:08:26 32076 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2013-11-27 02:08:26 32076 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address 'gw:10023/rsync_sst' --auth 'root:rootpass' --socket '/run/shm/galera/local3/mysql/var/mysqld.sock' --datadir '/run/shm/galera/local3/mysql/var/' --defaults-file '/run/shm/galera/local3/mysql/etc/my.cnf' '' --gtid '8fc7b0b3-56a5-11e3-b5fe-1aa9415117a6:16875835''
2013-11-27 02:08:26 32076 [Note] WSREP: sst_donor_thread signaled with 0
2013-11-27 02:08:26 32076 [Note] WSREP: Flushing tables for SST...
2013-11-27 02:09:17 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:09:17 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:09:17 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2013-11-27 02:10:08 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:10:08 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:10:08 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2013-11-27 02:10:59 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:10:59 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:10:59 32076 [Warning] WSREP: Failed to apply app buffer: seqno: 16875911, status: 1
  at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2013-11-27 02:11:50 32076 [Warning] Slave SQL: Could not execute Update_rows event on table test.comm01; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 202, Error_code: 1205
2013-11-27 02:11:50 32076 [Warning] WSREP: RBR event 3 Update_rows apply warning: 146, 16875911
2013-11-27 02:11:50 32076 [ERROR] WSREP: Failed to apply trx: source: fabb5d8b-56f7-11e3-97a2-bad682cc23e2 version: 3 local: 0 state: APPLYING flags: 129 conn_id: 15 trx_id: 56293974 seqnos (l: 19474, g: 16875911, s: 16875909, d: 16875910, ts: 630454850094192)
2013-11-27 02:11:50 32076 [ERROR] WSREP: Failed to apply trx 16875911 4 times
2013-11-27 02:11:50 32076 [ERROR] WSREP: Node consistency compromized, aborting...

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

It appeared that 5.6 branch is missing protection for BF threads against innodb lock wait time out.

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

Affects Galera too.

Long applier lock wait happens due to race between three threads:
1) Donor thread that is trying to pause provider
2) Applier thread that has BF aborted committing thread
3) BF aborted thread is hanging in local monitor pre_enter() because thread 1) has grabbed monitor lock

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :
Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1530

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.