pt-table-checksum "Waiting for the --replicate table to replicate" forever

Bug #1144759 reported by Roman Vynar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Triaged
Undecided
Unassigned

Bug Description

PT 2.1.7

pt-table-checksum writes "Waiting for the --replicate table to replicate to some.host.com" errors forever without any timeout or max check attempts.

Say I have the following in processlist:
814095 repl db1:36772 NULL Binlog Dump 883714 Has sent all binlog to slave; waiting for binlog to be updated NULL
820633 repl db2:36321 NULL Binlog Dump 875287 Has sent all binlog to slave; waiting for binlog to be updated NULL
823574 repl db4:35662 NULL Binlog Dump 871764 Has sent all binlog to slave; waiting for binlog to be updated NULL
994242 user1 db4:40468 NULL Binlog Dump 434446 Has sent all binlog to slave; waiting for binlog to be updated NULL

First 3 are slaves but the fourth process is binlog puller that backups binlogs.

In my case checksumming never runs as it waits forever for db4 - but it's not actually a slave - so the tool should timeout or have some max attempts like for connect.
 "Waiting for the --replicate table to replicate to some.host.com" were written more than 24 hrs and I had to kill it.

I would like to avoid using DSN table because the script should have some threshold on the operation it can't do.

summary: - pt-table-checksum "Waiting for the --replicate table to replicate to
- some.host.com" forever
+ pt-table-checksum "Waiting for the --replicate table to replicate"
+ forever
tags: added: pt-table-checksum
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

I'll have to think about this one. In general, the DSN table is what one would use to prevent this. Or, in "normal" cases where all slaves are legit and one is running the tool regularly with --resume, then --run-time makes the tool timeout (for example, if some slave crashed during the night).

Picking timeouts is difficult. Someone will say 10 minutes is long enough, then someone else will say it needs to be 1 hour, then someone will say 1 hour is too long, etc. In essence, all such values arbitrary.

If the tool timed-out and stopped checking a slave, then the results aren't really trustworthy and --resume won't help either because it finished on other slaves. So in this case, one might say that's a bad thing: better to do nothing or have no results than do something that's useless and have to do it again.

I suppose if we introduced a timeout or threshold that was off by default, then the user could choose their own value.

In this particular case there's the issue of how to specify to timeout only for the 4th slave? I don't see a clear or easy way to do that.

It seems to me that the DSN table is really the best choice. :-)

Changed in percona-toolkit:
status: New → Triaged
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1083

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.