pt-table-checksum "Waiting for the --replicate table to replicate" forever
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Triaged
|
Undecided
|
Unassigned |
Bug Description
PT 2.1.7
pt-table-checksum writes "Waiting for the --replicate table to replicate to some.host.com" errors forever without any timeout or max check attempts.
Say I have the following in processlist:
814095 repl db1:36772 NULL Binlog Dump 883714 Has sent all binlog to slave; waiting for binlog to be updated NULL
820633 repl db2:36321 NULL Binlog Dump 875287 Has sent all binlog to slave; waiting for binlog to be updated NULL
823574 repl db4:35662 NULL Binlog Dump 871764 Has sent all binlog to slave; waiting for binlog to be updated NULL
994242 user1 db4:40468 NULL Binlog Dump 434446 Has sent all binlog to slave; waiting for binlog to be updated NULL
First 3 are slaves but the fourth process is binlog puller that backups binlogs.
In my case checksumming never runs as it waits forever for db4 - but it's not actually a slave - so the tool should timeout or have some max attempts like for connect.
"Waiting for the --replicate table to replicate to some.host.com" were written more than 24 hrs and I had to kill it.
I would like to avoid using DSN table because the script should have some threshold on the operation it can't do.
summary: |
- pt-table-checksum "Waiting for the --replicate table to replicate to - some.host.com" forever + pt-table-checksum "Waiting for the --replicate table to replicate" + forever |
tags: | added: pt-table-checksum |
I'll have to think about this one. In general, the DSN table is what one would use to prevent this. Or, in "normal" cases where all slaves are legit and one is running the tool regularly with --resume, then --run-time makes the tool timeout (for example, if some slave crashed during the night).
Picking timeouts is difficult. Someone will say 10 minutes is long enough, then someone else will say it needs to be 1 hour, then someone will say 1 hour is too long, etc. In essence, all such values arbitrary.
If the tool timed-out and stopped checking a slave, then the results aren't really trustworthy and --resume won't help either because it finished on other slaves. So in this case, one might say that's a bad thing: better to do nothing or have no results than do something that's useless and have to do it again.
I suppose if we introduced a timeout or threshold that was off by default, then the user could choose their own value.
In this particular case there's the issue of how to specify to timeout only for the 4th slave? I don't see a clear or easy way to do that.
It seems to me that the DSN table is really the best choice. :-)