FTWRL should only run when safe to do so
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona XtraBackup moved to https://jira.percona.com/projects/PXB |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Flush tables with read lock can run even though there may be an running query that has been executing for hours. In this case everything will be locked up in "Waiting for table flush" or "Waiting for master to send event" states. Killing the "flush tables with read lock" does not correct the issue either. In this case the only way to get the server operating normally again is to kill off the long running selects that blocked it to begin with.
With the above in mind we should make FTWRL safer to prevent production downtime. To do this I suggest the following:
flush-time that innobackupex will wait before issuing a FTWRL, (default 1800 seconds? configurable), during this time innobackupex will wait for running processes to finish. It will poll the process list and once there is no actively running queries it will issue the FTWRL. If --rsync option is set is still should run rsync prior to the FTWRL.
Once FTWRL has been run it should start another process that checks to make sure that process isn't blocked by anything (something that just started just as FTWRL was issued). If there is anything blocking at this point it should immediately kill the query so that FTWRL can finish successfully and the backup can complete. Logging what it killed would be nice.
The problem with using a flush-time is that there is no way to
prevent newer queries from starting (unless there is a write lock
on all tables or something like that). So, it is possible that
innobackupex will wait forever. It is also not possible to do "It
will poll the process list and once there is no actively running
queries it will issue the FTWRL." since it is possible that in
time between polls another query has sneaked in.
FTWRL ensures a barrier of sorts in that all the queries after
that (if FTWRL is waiting on 'waiting for table flush') will
queue up in FIFO and will complete after FTWRL (so only queries
will complete, the writes/updates will still wait).
What can/may be done (as the last paragraph of description suggests) is for FTWRL to be run and if it is waiting too long for table flush (due to bug or bad queries), is to kill said queries after a configurable timeout. But this can be unsafe too. Note that this won't be subject to race conditions like earlier since MySQL ensures that queries after FTWRL are in queue after that.