pt-slave-restart skips binlog position on lock timeout

Bug #1463314 reported by Przemek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Confirmed
Undecided
Unassigned

Bug Description

By default, when the error numbers are not explicitly listed with --error-numbers, pt-slave-restart uses SQL_SLAVE_SKIP_COUNTER=1 before resuming replication, even or errors like 'Lock wait timeout exceeded' (1205). Such behaviour is dangerous and introduces data inconsistencies on the slave.
IMHO for the slave thread stopped on lock timeout, the tool should just do 'start slave sql_thread' couple of times before taking further action.
Then, either abort with warning that some user thread is blocking the slave sql_thread, or skip the event. Though the second option I think should be only used when user consciously forces it knowing it may (or almost for sure it will) introduce new data inconsistencies on the slave. Hence a new option like '--force-skip-on-lock-timout' maybe could be added.

Current behaviour example(version 2.2.14):
$ pt-slave-restart --user=msandbox --password=msandbox -S /tmp/mysql_sandbox20485.sock
2015-06-09T04:41:11 S=/tmp/mysql_sandbox20485.sock,p=...,u=msandbox mysql_sandbox20485-relay-bin.000026 6987 1205

2015-06-09 04:37:01 5614 [Warning] Slave SQL: Could not execute Write_rows event on table test.address_detail; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.000005, end_log_pos 97297, Error_code: 1205
2015-06-09 04:37:01 5614 [ERROR] Slave SQL: Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
2015-06-09 04:37:01 5614 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2015-06-09 04:37:01 5614 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000005' position 95652
2015-06-09 04:41:11 5614 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000005' at position 95652, relay log './mysql_sandbox20485-relay-bin.000026' position: 6987
2015-06-09 04:41:11 5614 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./mysql_sandbox20485-relay-bin.000026', relay_log_pos='6987', master_log_name='mysql-bin.000005', master_log_pos='95652' and new position at relay_log_file='./mysql_sandbox20485-relay-bin.000026', relay_log_pos='8663', master_log_name='mysql-bin.000005', master_log_pos='97328'

Przemek (pmalkowski)
description: updated
Changed in percona-toolkit:
status: New → Confirmed
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1290

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.