pt-slave-restart does not reset timer after restarting replication

Bug #1528931 reported by Matthew B
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Confirmed
Undecided
Unassigned

Bug Description

When pt-slave-restart checks slave status and everthing is OK, it sleeps for 2x the previous sleep time. If the slave status is bad and pt-slave-restart actually restarts replication, the sleep timer is not reset back to default, it continues sleeping the previous amount. This causes unnecessary delays in repeatable replication failure checking.

In my opinion, if pt-slave-restart actually restarts replication, the counter should go back to default and the 2x process started all over again.

[MYSQL INSTANCE 0]:root@ke-s2-01-mysql-instance:[/data/dbdata]: pt-slave-restart --error-numbers=1022,1677 --min-sleep=1 -vvv
 delayed 52709 sec
 sleeping 2.000000
 delayed 52697 sec
 sleeping 4.000000
 delayed 52667 sec
 sleeping 8.000000
 delayed 52613 sec
 sleeping 16.000000
 delayed 52515 sec
 sleeping 32.000000
 delayed 52295 sec
 sleeping 64.000000
 delayed 51899 sec
 sleeping 64.000000
 delayed 51519 sec
 sleeping 64.000000
 delayed 51167 sec
 sleeping 64.000000
2015-12-23T21:53:09 mysqld-relay-bin.000502 607106504 1022 Error 'Can't write; duplicate key in table ''' on query. Default database: ''. Query: 'COMMIT'
 sleeping 1.000000
 delayed 51103 sec
 sleeping 64.000000
2015-12-23T21:54:14 mysqld-relay-bin.000502 608754267 1022 Error 'Can't write; duplicate key in table ''' on query. Default database: ''. Query: 'COMMIT'
 sleeping 1.000000
 delayed 51120 sec
 sleeping 64.000000
2015-12-23T21:55:19 mysqld-relay-bin.000502 615136344 1022 Error 'Can't write; duplicate key in table ''' on query. Default database: ''. Query: 'COMMIT'
 sleeping 1.000000
 delayed 51020 sec
 sleeping 64.000000

As you can see above, each time replication was restarted, pt-slave-restart kept waiting 64 seconds to check again. It should not have waited so long. The likelihood of another error happening is very high and this should be detected earlier.

Revision history for this message
Nickolay Ihalainen (ihanick) wrote :

Installed pt-slave-restart 3.0.1
Setup replication,
Make a simple load:
CREATE TABLE test.`t` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB

while mysql -e 'insert into test.t values ();' ; do sleep 1; echo . ; done

pt-slave-restart --error-numbers=1022,1677,1062 --min-sleep=1 -vvv

Execute mysql -e 'insert into test.t values();' on slave to cause duplicate key error.

 delayed 0 sec
 sleeping 64.000000
 delayed 0 sec
 sleeping 64.000000

2017-03-18T06:30:41 mysqld-relay-bin.000002 138701 1062 Error 'Duplicate entry '583' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into test.t values ()'
 sleeping 1.000000
 delayed 51 sec
 sleeping 64.000000

If we have several errors or error happens right after one second sleep, sleep time decreased:
 sleeping 64.000000
 delayed 0 sec
 sleeping 64.000000
2017-03-18T06:26:55 mysqld-relay-bin.000002 95513 1062 Error 'Duplicate entry '400' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into test.t values ()'
 sleeping 1.000000
2017-03-18T06:26:56 mysqld-relay-bin.000002 95749 1062 Error 'Duplicate entry '401' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into test.t values ()'
 sleeping 1.000000
 delayed 10 sec
 sleeping 32.000000

This is explained as a side effect in source code:
      # Errors are very likely to follow each other in quick succession. NOTE:
      # this policy has a side effect with respect to $sleep. Suppose $sleep is
      # 512 and pt-slave-restart finds an error; now $sleep is 256, but
      # pt-slave-restart sleeps only 1 (the initial value of --sleep). Suppose
      # there is no error when it wakes up after 1 second, because 1 was too
      # short. Now it doubles $sleep, back to 512. $sleep has the same value
      # it did before the error was ever found.

But it could be a good idea to decrease sleep time after fault even if there is no immediate error (e.g. if we have a duplicate key error every 5 second, avoid sleeping for 64 seconds).

Changed in percona-toolkit:
status: New → Confirmed
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1319

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.