Safe-slave-backups stucks with long query
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Percona XtraBackup moved to https://jira.percona.com/projects/PXB | Status tracked in 2.4 | |||||
2.3 |
Fix Released
|
Wishlist
|
Vasily Nemkov | |||
2.4 |
Fix Released
|
Wishlist
|
Vasily Nemkov |
Bug Description
When using --safe-slave-backup the following can happen:
t1: temporary table is created
t2: long DML starts
t3: safe-slave-backup STOP SLAVE arrives
t4: slave SQL thread is killed and long DML is rollback'ed.
t5: safe-slave-backup START SLAVE arrives... and restarts where long DML starts.
t6 safe-slave-backup sleeps a bit...but far less than the long DML, so it will STOP SLAVE again while the long DML is still running, repeating the above ad-eternum.
Proposed fix would be for safe-slave-backup to check position when it stops slave, and make sure replication position has moved forward since it last stopped. This way at least we won't loop over the exact same query every time, depleting all retries.
It makes sense to me. Before executing STOP SLAVE we will check if binlog position has changed and if it didn't, we will wait for 3 more seconds. However, if master running
CRETATE tmp_table; INSERT INTO SELECT; DROP tmp_table
in loop, it will not help much.
Chances to catch the moment right after tmp_table is dropped are small.