pt-table-checksum doesn't honor --run-time while checking replication lag
Bug #1043438 reported by
Baron Schwartz
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Fix Released
|
High
|
Daniel Nichter |
Bug Description
I've run pt-table-checksum against a server with badly lagging replication, and with --run-time=6h so that it starts at 2am and ends at 8am. Much later, I shut down and restart the replica, and get:
08-29T11:11:16 Fatal error checksumming table <....>: Lost connection to replica <....> while attempting to get its lag
Related branches
lp:~percona-toolkit-dev/percona-toolkit/fix-run-time-bug-1043438
- Daniel Nichter: Approve
-
Diff: 67 lines (+19/-17)1 file modifiedbin/pt-table-checksum (+19/-17)
tags: | added: wrong-behavior |
Changed in percona-toolkit: | |
milestone: | none → 2.1.5 |
status: | New → Confirmed |
Changed in percona-toolkit: | |
importance: | Undecided → High |
assignee: | nobody → Daniel Nichter (daniel-nichter) |
Changed in percona-toolkit: | |
status: | Confirmed → In Progress |
tags: |
added: run-time removed: wrong-behavior |
Changed in percona-toolkit: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
This is becoming a problem because I'm ending up with dozens of pt-table-checksum instances running for many days. If the replica ever does catch up, they will all dive-bomb the server at the same time and probably interact in undesirable ways. I am making the following change in my local copy of the tool:
@@ -7048,6 +7048,7 @@ >(set_vars => 1, dsn_string => shift @ARGV);
my $master_cxn = $make_cxn-
my $master_dbh = $master_cxn->dbh(); # just for brevity
my $master_dsn = $master_cxn->dsn(); # just for brevity
+ my $have_time;
# ####### ####### ####### ####### ####### ####### ####### ####### ####### ####### ## ####### ####### ####### ####### ####### ####### ####### ####### ####### ## ####### ####### ####### ####### ####### ####### ####### ####### ####### ## 'run-time' );
# If this is not a dry run (--explain was not specified), then we're
@@ -7231,7 +7232,7 @@
$replica_lag = new ReplicaLagWaiter(
slaves => $slave_lag_cxns,
max_lag => $o->get('max-lag'),
- oktorun => sub { return $oktorun },
+ oktorun => sub { return $oktorun && $have_time->() },
get_lag => $get_lag,
sleep => $sleep,
);
@@ -7334,7 +7335,6 @@
# #######
# Set up the run time, if any.
# #######
- my $have_time;
if ( my $run_time = $o->get('run-time') ) {
my $end = time() + $o->get(
$have_time = sub { return time() < $end };