pt-slave-delay incorrectly computes lag if started when slave is already lagging
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Fix Released
|
High
|
Brian Fraser |
Bug Description
If slave is already lagging when pt-slave-delay is started (for example, after restarting pt-slave-delay) , replication lag falls way behind the defined delay.
command line : /usr/bin/
pt-slave-delay version : 2.0.3
MySQL version : 5.1.61-rel13.2-log
How to reproduce:
a) run pt-slave-delay --delay 60m
b) let replication to delay 60 minutes
c) kill pt-slave-delay
d) start again pt-slave-delay --delay 60m
Replication will lag up to 120 minutes.
==
Example from log file:
2012-03-14T16:00:38 START SLAVE until master 2012-03-14T15:00:37 mysql-bin.
2012-03-14T16:00:53 START SLAVE until master 2012-03-14T15:00:52 mysql-bin.
2012-03-14T16:01:08 START SLAVE until master 2012-03-14T15:01:07 mysql-bin.
2012-03-14T16:01:23 START SLAVE until master 2012-03-14T15:01:22 mysql-bin.
2012-03-14T16:01:38 START SLAVE until master 2012-03-14T15:01:37 mysql-bin.
2012-03-14T16:01:53 START SLAVE until master 2012-03-14T15:01:52 mysql-bin.
2012-03-14T16:02:08 START SLAVE until master 2012-03-14T15:02:07 mysql-bin.
2012-03-14T16:02:08 Setting slave to run normally
Exiting on SIGTERM.
2012-03-14T16:02:24 slave running 3563 seconds behind
2012-03-14T16:02:24 STOP SLAVE until 2012-03-14T16:03:01 at master position mysql-bin.
2012-03-14T16:02:39 slave stopped at master position mysql-bin.
2012-03-14T16:02:54 slave stopped at master position mysql-bin.
2012-03-14T16:03:09 slave stopped at master position mysql-bin.
[[[ output omitted ]]]
2012-03-14T17:01:40 slave stopped at master position mysql-bin.
2012-03-14T17:01:55 slave stopped at master position mysql-bin.
2012-03-14T17:02:10 slave stopped at master position mysql-bin.
2012-03-14T17:02:25 START SLAVE until master 2012-03-14T16:02:24 mysql-bin.
2012-03-14T17:02:40 slave running 7008 seconds behind
2012-03-14T17:02:40 slave running 7008 seconds behind at master position mysql-bin.
2012-03-14T17:02:55 slave running 6853 seconds behind
2012-03-14T17:02:55 slave running 6853 seconds behind at master position mysql-bin.
===
I think the behavior is caused by the fact that in @positions are pushed $now and the IO_THREAD coordinates:
push @positions,
[ $now, $status-
The position to run START SLAVE SQL_THREAD is chosen in :
while ( $i < @positions
&& $positions[
$pos = $i;
$i++;
}
Since $now is pushed in @positions, the condition below will never be true until pt-slave-delay is running for $o->get('delay') seconds:
$positions[
===
Suggested fix:
Process the initial replication lag in the formula above.
Related branches
- Brian Fraser (community): Approve
-
Diff: 3027 lines (+1337/-1066)25 files modifiedbin/pt-archiver (+54/-44)
bin/pt-config-diff (+54/-44)
bin/pt-deadlock-logger (+54/-44)
bin/pt-duplicate-key-checker (+54/-44)
bin/pt-find (+54/-44)
bin/pt-fk-error-logger (+54/-44)
bin/pt-heartbeat (+54/-44)
bin/pt-index-usage (+54/-44)
bin/pt-kill (+54/-44)
bin/pt-log-player (+54/-44)
bin/pt-online-schema-change (+54/-44)
bin/pt-query-advisor (+54/-44)
bin/pt-query-digest (+54/-44)
bin/pt-show-grants (+54/-44)
bin/pt-slave-delay (+64/-47)
bin/pt-slave-find (+54/-44)
bin/pt-slave-restart (+54/-44)
bin/pt-table-checksum (+54/-44)
bin/pt-table-sync (+54/-44)
bin/pt-table-usage (+54/-44)
bin/pt-upgrade (+54/-44)
bin/pt-variable-advisor (+54/-44)
bin/pt-visual-explain (+54/-44)
lib/DSNParser.pm (+60/-50)
t/lib/DSNParser.t (+25/-1)
tags: | added: pt-slave-delay wrong-output |
Changed in percona-toolkit: | |
status: | New → Triaged |
tags: | added: percona-24284 |
Changed in percona-toolkit: | |
assignee: | nobody → Brian Fraser (fraserbn) |
importance: | Undecided → High |
status: | Triaged → In Progress |
Changed in percona-toolkit: | |
milestone: | none → 2.1.3 |
Changed in percona-toolkit: | |
status: | In Progress → Fix Committed |
Changed in percona-toolkit: | |
status: | Fix Committed → Fix Released |
Thank you for the bug report. We will need to investigate and see what is the best way to fix this.