pt-online-schema-change prints scary/misleading message while pausing for slave lag

Reported by rcoli on 2012-12-12
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
Undecided
Unassigned

Bug Description

pt-online-schema-change version 2.1.4

How to reproduce :

1) use default values of max-lag=1, check-interval=1, progress=time,30
2) pt-osc ALTER a master with a slave
3) FLUSH TABLES WITH READ LOCK; on the slave
4) get a message like "Replica lag is 32 seconds on qa.example.com. Waiting."
5) become worried that max-lag and check-interval are malfunctioning somehow
6) set progress=time,1
7) repeat 2-3
8) get a message like "Replica lag is 2 seconds on qa.example.com. Waiting."

In actual reality, the max-lag and check-interval options are working as intended. There is occasionally a lag of up to 1 second (which I don't *think* is caused by off-by-one but could theoretically be) but this lag is much less than the 30+ seconds implied by the message printed in 4) above.

People who don't have the time or skill to set PTDEBUG=1 and then use this DEBUG output to verify the behavior of the tool might be quite reasonably freaked out by this message. It would be ideal if the "Replica lag is.. Waiting" messages printed without regard for the --progress setting.

Of course, given that the underlying feature, should prevent values larger than 1 or 2 for slave lag in a non-test-case (no FLUSH TABLES WITH READ LOCK on the slave), this bug will only bite people who do have the skill and time to design this test case.

tags: added: ambiguity progress pt-online-schema-change
Changed in percona-toolkit:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers