pt-table-checksum doesn't honor --run-time while checking replication lag

Bug #1043438 reported by Baron Schwartz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Daniel Nichter

Bug Description

I've run pt-table-checksum against a server with badly lagging replication, and with --run-time=6h so that it starts at 2am and ends at 8am. Much later, I shut down and restart the replica, and get:

08-29T11:11:16 Fatal error checksumming table <....>: Lost connection to replica <....> while attempting to get its lag

Related branches

Revision history for this message
Baron Schwartz (baron-xaprb) wrote :

This is becoming a problem because I'm ending up with dozens of pt-table-checksum instances running for many days. If the replica ever does catch up, they will all dive-bomb the server at the same time and probably interact in undesirable ways. I am making the following change in my local copy of the tool:

@@ -7048,6 +7048,7 @@
    my $master_cxn = $make_cxn->(set_vars => 1, dsn_string => shift @ARGV);
    my $master_dbh = $master_cxn->dbh(); # just for brevity
    my $master_dsn = $master_cxn->dsn(); # just for brevity
+ my $have_time;

    # ########################################################################
    # If this is not a dry run (--explain was not specified), then we're
@@ -7231,7 +7232,7 @@
       $replica_lag = new ReplicaLagWaiter(
          slaves => $slave_lag_cxns,
          max_lag => $o->get('max-lag'),
- oktorun => sub { return $oktorun },
+ oktorun => sub { return $oktorun && $have_time->() },
          get_lag => $get_lag,
          sleep => $sleep,
       );
@@ -7334,7 +7335,6 @@
    # ########################################################################
    # Set up the run time, if any.
    # ########################################################################
- my $have_time;
    if ( my $run_time = $o->get('run-time') ) {
       my $end = time() + $o->get('run-time');
       $have_time = sub { return time() < $end };

tags: added: wrong-behavior
Changed in percona-toolkit:
milestone: none → 2.1.5
status: New → Confirmed
Changed in percona-toolkit:
importance: Undecided → High
assignee: nobody → Daniel Nichter (daniel-nichter)
Changed in percona-toolkit:
status: Confirmed → In Progress
tags: added: run-time
removed: wrong-behavior
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Baron, the attached branch is pretty much the same as your change. I just also did the same fix for --max-load. Want to try it on your end since testing this kind of thing is tricky?

Changed in percona-toolkit:
status: In Progress → Fix Committed
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Baron has left the building, so I've just gone ahead and merged this.

Brian Fraser (fraserbn)
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-330

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers