pt-online-schema-change prints scary/misleading message while pausing for slave lag
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Triaged
|
Undecided
|
Unassigned |
Bug Description
pt-online-
How to reproduce :
1) use default values of max-lag=1, check-interval=1, progress=time,30
2) pt-osc ALTER a master with a slave
3) FLUSH TABLES WITH READ LOCK; on the slave
4) get a message like "Replica lag is 32 seconds on qa.example.com. Waiting."
5) become worried that max-lag and check-interval are malfunctioning somehow
6) set progress=time,1
7) repeat 2-3
8) get a message like "Replica lag is 2 seconds on qa.example.com. Waiting."
In actual reality, the max-lag and check-interval options are working as intended. There is occasionally a lag of up to 1 second (which I don't *think* is caused by off-by-one but could theoretically be) but this lag is much less than the 30+ seconds implied by the message printed in 4) above.
People who don't have the time or skill to set PTDEBUG=1 and then use this DEBUG output to verify the behavior of the tool might be quite reasonably freaked out by this message. It would be ideal if the "Replica lag is.. Waiting" messages printed without regard for the --progress setting.
Of course, given that the underlying feature, should prevent values larger than 1 or 2 for slave lag in a non-test-case (no FLUSH TABLES WITH READ LOCK on the slave), this bug will only bite people who do have the skill and time to design this test case.
tags: | added: ambiguity progress pt-online-schema-change |
Changed in percona-toolkit: | |
status: | New → Triaged |
Percona now uses JIRA for bug reports so this bug report is migrated to: https:/ /jira.percona. com/browse/ PT-1056