Safe-slave-backups stucks with long query

Bug #1717158 reported by markus_albe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraBackup moved to https://jira.percona.com/projects/PXB
Status tracked in 2.4
2.3
Fix Released
Wishlist
Vasily Nemkov
2.4
Fix Released
Wishlist
Vasily Nemkov

Bug Description

When using --safe-slave-backup the following can happen:

t1: temporary table is created
t2: long DML starts
t3: safe-slave-backup STOP SLAVE arrives
t4: slave SQL thread is killed and long DML is rollback'ed.
t5: safe-slave-backup START SLAVE arrives... and restarts where long DML starts.
t6 safe-slave-backup sleeps a bit...but far less than the long DML, so it will STOP SLAVE again while the long DML is still running, repeating the above ad-eternum.

Proposed fix would be for safe-slave-backup to check position when it stops slave, and make sure replication position has moved forward since it last stopped. This way at least we won't loop over the exact same query every time, depleting all retries.

Tags: i204247
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

It makes sense to me. Before executing STOP SLAVE we will check if binlog position has changed and if it didn't, we will wait for 3 more seconds. However, if master running

CRETATE tmp_table; INSERT INTO SELECT; DROP tmp_table

in loop, it will not help much.

Chances to catch the moment right after tmp_table is dropped are small.

Revision history for this message
markus_albe (markus-albe) wrote :

Yeah, the loop with CREATE TEMPORARY...; INSERT...SELECT; DROP TEMPORARY was an example to show the effect; I reckon it would be impossible to make --safe-slave-backup work in such case. Guess the best alternative (better than waiting for position to advance) is to inspect binlog and find the DROP TEMPORARY, stop immediately after and check... This would be the ultimate fix, as it only makes sense to check open_slave_temp_tables again after a DROP TEMPORARY.

Revision history for this message
Vasily Nemkov (vasily.nemkov) wrote :
Revision history for this message
Vasily Nemkov (vasily.nemkov) wrote :
Revision history for this message
Vasily Nemkov (vasily.nemkov) wrote :

PR merged to 2.3

Revision history for this message
Vasily Nemkov (vasily.nemkov) wrote :

PR merge to 2.4

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-1039

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.