--max-lag doesn't get defaulted

Bug #1115556 reported by Brandon Johnson on 2013-02-04
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT

Bug Description

In the documentation, it says --max-lag is defaulted to 1 and pt-table-checksum won't allow a slave to delay more than 1 second unless otherwise set.

This isn't necessarily true. In searching through the code (pt-table-checksum 2.1.8) I find nowhere in which the max-lag value is set to 1.

[root@me]# grep -e "max_lag" -e "max-lag" -in /usr/bin/pt-table-checksum
7714: my @required_args = qw(oktorun get_lag sleep max_lag slaves);
7738: my $max_lag = $self->{max_lag};
7774: if ( !defined $lag || $lag > $max_lag ) {
9016: max_lag => $o->get('max-lag'),
10580: # for all slaves to catchup at least until --max-lag.
10581: $sleep_time += 0.25 if $sleep_time <= $o->get('max-lag');
11030:Sleep time between checks for L<"--max-lag">.
11095:Pause checksumming until this replica's lag is less than L<"--max-lag">. The
11385:=item --max-lag

I'm reporting this because I've had several instances lately where I sit and monitor the replication checksum and every time it's the only thing running, but replication delays by several hundred or even thousands of seconds during the checksum event, but catches right back up after and going.

It appears as though --max-lag doesn't default, but is respected if set.

Brian Fraser (fraserbn) wrote :

I think that this is just a case of our documentation-as-code being confusing to people looking at the source.

=item --max-lag

type: time; default: 1s; group: Throttle

The default: 1s in the docs is parsed and that value is used. An easy enough way to check that the defaults are all working is to grep the output of --help:

$ pt-table-checksum --help | grep ' --max-lag'
  --max-lag=m Pause checksumming until all replicas' lag
  --max-lag 1

>replication delays by several hundred or even thousands of seconds during the checksum event, but catches right back up after and going.

This is the expected behavior, according to the --max-lag docs:

Pause checksumming until all replicas' lag is less than this value. After each
checksum query (each chunk), pt-table-checksum looks at the replication lag of
all replicas to which it connects, using Seconds_Behind_Master. If any replica
is lagging more than the value of this option, then pt-table-checksum will sleep
for L<"--check-interval"> seconds, then check all replicas again. If you
specify L<"--check-slave-lag">, then the tool only examines that server for
lag, not all servers.

So I don't think that there's a bug here? Feel free to correct me, otherwise I'll close this in a week or so.

Changed in percona-toolkit:
assignee: nobody → Brian Fraser (fraserbn)
status: New → Incomplete

No, it is ignoring --max-lag unless set explicitly in version 2.1.8.

I tested with several backup servers (no load other than replication) and in each case it delayed replication significantly using 2.1.8. In checking the binary logs, the only thing occurring during these intervals were the checksums, and very few small write queries.

While I do see that set to 1 if I do the --help | grep max-lag I unfortunately see significant delay in replication specifically caused by pt-table-checksum.

The specific arguments we're using in our call to pt-table-checksum are:

--user (removed) --password (removed)
--replicate percona.checksums
(and more recently the addition of --no-check-binlog-format (because we use all MIXED mode servers)

If you'd like, I can also see about grabbing base64 decoded binary logs for that interim, but it is indeed only the checksum queries causing the replication delay.

tags: added: pt-table-checksum
Changed in percona-toolkit:
assignee: Brian Fraser (fraserbn) → nobody
status: Incomplete → Triaged
milestone: none → 2.2.4
Changed in percona-toolkit:
importance: Undecided → Medium
Daniel Nichter (daniel-nichter) wrote :


Can you run it on your backup servers with PTDEBUG and either post the debug output here or email it to me? I've looked at the code and everything seems to be in order. The debug output will tell us what the tool is seeing and doing apropos slave lag.

Changed in percona-toolkit:
status: Triaged → In Progress
Changed in percona-toolkit:
assignee: nobody → Daniel Nichter (daniel-nichter)
Daniel Nichter (daniel-nichter) wrote :

Brandon, are you able to try that ^?

Changed in percona-toolkit:
milestone: 2.2.4 → none
assignee: Daniel Nichter (daniel-nichter) → nobody
importance: Medium → Undecided
status: In Progress → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for Percona Toolkit because there has been no activity for 60 days.]

Changed in percona-toolkit:
status: Incomplete → Expired

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1072

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers