pt-heartbeat 2.1.8 reports big time drift with UTC_TIMESTAMP

Bug #1099665 reported by Jervin R
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Daniel Nichter

Bug Description

I just noticed pt-heartbeat is now using UTC_TIMESTAMP with the check/monitor query, for updates by older version and check by 2.1.8 or vice versa, there will be a large time delay reported. See a snippet of debug output below:

This is 2.1.8:

# $DBD::mysql::VERSION: 4.013 $DBI::VERSION: 1.609
# pt_heartbeat:4904 16322 Heartbeat row primary key: server_id = 1
# pt_heartbeat:4910 16322 SELECT 1 FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:5048 16322 SELECT SQL: SELECT UNIX_TIMESTAMP(UTC_TIMESTAMP()), UNIX_TIMESTAMP(ts)/*!50038, @@hostname AS host*/, server_id FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:5055 16322 SELECT UNIX_TIMESTAMP(UTC_TIMESTAMP()), UNIX_TIMESTAMP(ts)/*!50038, @@hostname AS host*/, server_id FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:5057 16322 Heartbeat from server 1
# now: 1358235876
# ts: 1358217876
# skew: 0.5
# pt_heartbeat:5062 16322 Delay 17999.500000 on forge.dotmanila.com
# pt_heartbeat:5103 16322 --check and exit
# pt_heartbeat:5234 16322 Checking slave delay
# pt_heartbeat:5274 16322 Checking slave D=test,S=/tmp/mysql_sandbox5091.sock,h=localhost,p=...,u=msandbox
# pt_heartbeat:5284 16322 Woke up at 2013-01-14T21:44:37.501070
# pt_heartbeat:5055 16322 SELECT UNIX_TIMESTAMP(UTC_TIMESTAMP()), UNIX_TIMESTAMP(ts)/*!50038, @@hostname AS host*/, server_id FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:5057 16322 Heartbeat from server 1
# now: 1358235877
# ts: 1358217877
# skew: 0.5
# pt_heartbeat:5062 16322 Delay 17999.500000 on forge.dotmanila.com
17999.50
# pt_heartbeat:5329 16322 Disconnecting

This is 2.1.7

# $DBD::mysql::VERSION: 4.013 $DBI::VERSION: 1.609
# pt_heartbeat:4822 16328 Heartbeat row primary key: server_id = 1
# pt_heartbeat:4828 16328 SELECT 1 FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:4940 16328 SELECT SQL: SELECT ts/*!50038, @@hostname AS host*/, server_id FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:4947 16328 SELECT ts/*!50038, @@hostname AS host*/, server_id FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:4950 16328 Heartbeat from server 1
# now: 2013-01-14T21:45:28.657450
# ts: 2013-01-14T21:45:28.000600
# skew: 0.5
# pt_heartbeat:4955 16328 Delay 0.156854 on forge.dotmanila.com
# pt_heartbeat:4996 16328 --check and exit
# pt_heartbeat:5127 16328 Checking slave delay
# pt_heartbeat:5167 16328 Checking slave D=test,S=/tmp/mysql_sandbox5091.sock,h=localhost,p=...,u=msandbox
# pt_heartbeat:5177 16328 Woke up at 2013-01-14T21:45:29.500810
# pt_heartbeat:4947 16328 SELECT ts/*!50038, @@hostname AS host*/, server_id FROM `test`.`heartbeat` WHERE server_id='1' LIMIT 1
# pt_heartbeat:4950 16328 Heartbeat from server 1
# now: 2013-01-14T21:45:29.501140
# ts: 2013-01-14T21:45:29.001320
# skew: 0.5
# pt_heartbeat:4955 16328 Delay -0.000180 on forge.dotmanila.com
0.00
# pt_heartbeat:5222 16328 Disconnecting

This should at least be documented if not made aware of the ts offset that generated the ts on the master.

Related branches

Revision history for this message
David Bigham (david-bigham) wrote :

Yes this bug just bit me. This new version is no longer compatible with older version. Servers started reporting massive replication delays after update.

Instead of writting "2013-01-16T11:58:35.002250" to the ts column, it now writes "2013-01-16 00:58:35"

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Everything was changed to UTC in 2.1.8 to avoid time zone issues, and all the tool's tests still pass, but apparently we missed something not tested. We'll look into this soon. In the meantime, the workaround is to use 2.1.7.

tags: added: pt-heartbeat
Changed in percona-toolkit:
status: New → Triaged
importance: Undecided → High
Changed in percona-toolkit:
milestone: none → 2.1.9
assignee: nobody → Daniel Nichter (daniel-nichter)
Changed in percona-toolkit:
status: Triaged → In Progress
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Yeah, 2.1.8 broke compat with previous 2.1 releases. :-( We do intend to keep compat within a series, so this shouldn't have happened. I think we got so caught up in eliminate time zone issues that we forgot about backwards-compat (iirc, we did discuss it, but obviously not well enough). I'll fix this, and bug 1103198, and we'll just have to mark pt-heartbeat 2.1.8 as an anomaly that should be avoided by 2.1 users.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Bug 1103198 isn't a bug, just another 2.1.7 vs. 2.1.8 oddity.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

But bug 1103221 is real: 2.1.8 broke high-res timestamps; it has only 1s resolution instead of sub-second.

tags: added: backwards-incompat
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Jervin, David, et al., the linked branch has a fixed version of the tool. Would you try it too? I'm pretty certain this version is backwards-compat with 2.1.7 and older because now UTC requires --utc.

Changed in percona-toolkit:
status: In Progress → Fix Committed
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-344

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.