pt-slave-delay incorrectly computes lag if started when slave is already lagging

Bug #962330 reported by Rene' Cannao'
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Brian Fraser

Bug Description

If slave is already lagging when pt-slave-delay is started (for example, after restarting pt-slave-delay) , replication lag falls way behind the defined delay.

command line : /usr/bin/pt-slave-delay --delay 60m --interval 15s --pid=/var/run/pt-slave-delay.pid --daemonize --log /var/log/delay.log localhost
pt-slave-delay version : 2.0.3
MySQL version : 5.1.61-rel13.2-log

How to reproduce:
a) run pt-slave-delay --delay 60m
b) let replication to delay 60 minutes
c) kill pt-slave-delay
d) start again pt-slave-delay --delay 60m

Replication will lag up to 120 minutes.

==

Example from log file:
2012-03-14T16:00:38 START SLAVE until master 2012-03-14T15:00:37 mysql-bin.000557/77122111
2012-03-14T16:00:53 START SLAVE until master 2012-03-14T15:00:52 mysql-bin.000557/79246412
2012-03-14T16:01:08 START SLAVE until master 2012-03-14T15:01:07 mysql-bin.000557/82404644
2012-03-14T16:01:23 START SLAVE until master 2012-03-14T15:01:22 mysql-bin.000557/84278071
2012-03-14T16:01:38 START SLAVE until master 2012-03-14T15:01:37 mysql-bin.000557/86866326
2012-03-14T16:01:53 START SLAVE until master 2012-03-14T15:01:52 mysql-bin.000557/88490971
2012-03-14T16:02:08 START SLAVE until master 2012-03-14T15:02:07 mysql-bin.000557/90049756
2012-03-14T16:02:08 Setting slave to run normally
Exiting on SIGTERM.
2012-03-14T16:02:24 slave running 3563 seconds behind
2012-03-14T16:02:24 STOP SLAVE until 2012-03-14T16:03:01 at master position mysql-bin.000558/71837313
2012-03-14T16:02:39 slave stopped at master position mysql-bin.000558/74070901
2012-03-14T16:02:54 slave stopped at master position mysql-bin.000558/75261121
2012-03-14T16:03:09 slave stopped at master position mysql-bin.000558/76535697
[[[ output omitted ]]]
2012-03-14T17:01:40 slave stopped at master position mysql-bin.000558/517469513
2012-03-14T17:01:55 slave stopped at master position mysql-bin.000558/519802388
2012-03-14T17:02:10 slave stopped at master position mysql-bin.000558/521274178
2012-03-14T17:02:25 START SLAVE until master 2012-03-14T16:02:24 mysql-bin.000558/71837313
2012-03-14T17:02:40 slave running 7008 seconds behind
2012-03-14T17:02:40 slave running 7008 seconds behind at master position mysql-bin.000559/235280
2012-03-14T17:02:55 slave running 6853 seconds behind
2012-03-14T17:02:55 slave running 6853 seconds behind at master position mysql-bin.000559/2592232

===

I think the behavior is caused by the fact that in @positions are pushed $now and the IO_THREAD coordinates:
            push @positions,
               [ $now, $status->{master_log_file},
                 $status->{read_master_log_pos} ];

The position to run START SLAVE SQL_THREAD is chosen in :
         while ( $i < @positions
                 && $positions[$i]->[$TS] <= $now - $o->get('delay') ) {
            $pos = $i;
            $i++;
         }

Since $now is pushed in @positions, the condition below will never be true until pt-slave-delay is running for $o->get('delay') seconds:
$positions[$i]->[$TS] <= $now - $o->get('delay')

===

Suggested fix:
Process the initial replication lag in the formula above.

Related branches

Revision history for this message
Baron Schwartz (baron-xaprb) wrote :

Thank you for the bug report. We will need to investigate and see what is the best way to fix this.

tags: added: pt-slave-delay wrong-output
Changed in percona-toolkit:
status: New → Triaged
tags: added: percona-24284
Changed in percona-toolkit:
assignee: nobody → Brian Fraser (fraserbn)
importance: Undecided → High
status: Triaged → In Progress
Brian Fraser (fraserbn)
Changed in percona-toolkit:
milestone: none → 2.1.3
Brian Fraser (fraserbn)
Changed in percona-toolkit:
status: In Progress → Fix Committed
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Customer confirmed fix.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Stuart Cianos (scianos-s) wrote :

Looks like the issue still persists in some environments. We identified a corner case with the current patch for this bug. To reproduce the issue:

* Pick an arbitrary time to delay replication. For the purposes of this demonstration, a 1000 second delay line has been set.
* Start pt-slave-delay while the slave is behind. For the purposes of this demonstration, the server is 500 seconds behind when pt-slave-delay starts.
* Pt-slave-delay will catch the host up to the correct limit on the delay line, but will then fail to maintain the delay line at the correct value. The delay line will be increased by however many seconds the host was alreadyt behind its master in addition to the configured value. In this case, the total replication delay will be 1500.

A patch has been attached which corrects the behavior of the previously committed fix.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

I've created bug 1095476 for the corner case described by Stuart.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-303

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.