pt-slave-restart does not reset timer after restarting replication
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
When pt-slave-restart checks slave status and everthing is OK, it sleeps for 2x the previous sleep time. If the slave status is bad and pt-slave-restart actually restarts replication, the sleep timer is not reset back to default, it continues sleeping the previous amount. This causes unnecessary delays in repeatable replication failure checking.
In my opinion, if pt-slave-restart actually restarts replication, the counter should go back to default and the 2x process started all over again.
[MYSQL INSTANCE 0]:root@
delayed 52709 sec
sleeping 2.000000
delayed 52697 sec
sleeping 4.000000
delayed 52667 sec
sleeping 8.000000
delayed 52613 sec
sleeping 16.000000
delayed 52515 sec
sleeping 32.000000
delayed 52295 sec
sleeping 64.000000
delayed 51899 sec
sleeping 64.000000
delayed 51519 sec
sleeping 64.000000
delayed 51167 sec
sleeping 64.000000
2015-12-23T21:53:09 mysqld-
sleeping 1.000000
delayed 51103 sec
sleeping 64.000000
2015-12-23T21:54:14 mysqld-
sleeping 1.000000
delayed 51120 sec
sleeping 64.000000
2015-12-23T21:55:19 mysqld-
sleeping 1.000000
delayed 51020 sec
sleeping 64.000000
As you can see above, each time replication was restarted, pt-slave-restart kept waiting 64 seconds to check again. It should not have waited so long. The likelihood of another error happening is very high and this should be detected earlier.
Installed pt-slave-restart 3.0.1
Setup replication,
Make a simple load:
CREATE TABLE test.`t` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
while mysql -e 'insert into test.t values ();' ; do sleep 1; echo . ; done
pt-slave-restart --error- numbers= 1022,1677, 1062 --min-sleep=1 -vvv
Execute mysql -e 'insert into test.t values();' on slave to cause duplicate key error.
delayed 0 sec
sleeping 64.000000
delayed 0 sec
sleeping 64.000000
2017-03-18T06:30:41 mysqld- relay-bin. 000002 138701 1062 Error 'Duplicate entry '583' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into test.t values ()'
sleeping 1.000000
delayed 51 sec
sleeping 64.000000
If we have several errors or error happens right after one second sleep, sleep time decreased: relay-bin. 000002 95513 1062 Error 'Duplicate entry '400' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into test.t values ()' relay-bin. 000002 95749 1062 Error 'Duplicate entry '401' for key 'PRIMARY'' on query. Default database: ''. Query: 'insert into test.t values ()'
sleeping 64.000000
delayed 0 sec
sleeping 64.000000
2017-03-18T06:26:55 mysqld-
sleeping 1.000000
2017-03-18T06:26:56 mysqld-
sleeping 1.000000
delayed 10 sec
sleeping 32.000000
This is explained as a side effect in source code:
# Errors are very likely to follow each other in quick succession. NOTE:
# this policy has a side effect with respect to $sleep. Suppose $sleep is
# 512 and pt-slave-restart finds an error; now $sleep is 256, but
# pt-slave-restart sleeps only 1 (the initial value of --sleep). Suppose
# there is no error when it wakes up after 1 second, because 1 was too
# short. Now it doubles $sleep, back to 512. $sleep has the same value
# it did before the error was ever found.
But it could be a good idea to decrease sleep time after fault even if there is no immediate error (e.g. if we have a duplicate key error every 5 second, avoid sleeping for 64 seconds).