pt-table-checksum reports false errors and misses real errors

Bug #1080765 reported by Rob Wagner
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Incomplete
Undecided
Daniel Nichter

Bug Description

Tool version = pt-table-checksum 2.1.5
Command line = pt-table-checksum --user=$USER --password=$PASS --chunk-size-limit=50.0 --replicate-check --pid=pt-table-checksum_ssodb03.pid --recursion-method=DSN=D=percona,t=dsns_ssodb03 --databases=
vsp_jira_current --no-check-replication-filters --replicate=percona.checksums_ssodb03

MySQL versions:
Master = Server version: 5.5.9-enterprise-commercial-advanced-log MySQL Enterprise Server - Advanced Edition (Commercial)
Slave = Server version: 5.5.28-log MySQL Community Server (GPL)

Output:
            TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
11-19T10:58:01 0 0 0 1 0 0.265 vsp_jira_current.AO_563AEE_ACTIVITY_ENTITY
11-19T10:58:02 0 0 0 1 0 0.264 vsp_jira_current.AO_563AEE_ACTOR_ENTITY
11-19T10:58:02 0 0 0 1 0 0.261 vsp_jira_current.AO_563AEE_MEDIA_LINK_ENTITY
11-19T10:58:02 0 0 0 1 0 0.263 vsp_jira_current.AO_563AEE_OBJECT_ENTITY
11-19T10:58:02 0 0 0 1 0 0.298 vsp_jira_current.AO_563AEE_TARGET_ENTITY
11-19T10:58:03 0 0 134732 4 0 0.418 vsp_jira_current.OS_CURRENTSTEP
11-19T10:58:03 0 0 254319 1 0 0.157 vsp_jira_current.OS_CURRENTSTEP_PREV
11-19T10:58:04 0 0 298507 1 0 0.688 vsp_jira_current.OS_HISTORYSTEP
11-19T10:58:04 0 0 168484 1 0 0.119 vsp_jira_current.OS_HISTORYSTEP_PREV
11-19T10:58:04 0 0 180960 1 0 0.189 vsp_jira_current.OS_WFENTRY
11-19T10:58:04 0 0 87 1 0 0.010 vsp_jira_current.SEQUENCE_VALUE_ITEM
11-19T10:58:04 0 0 491 1 0 0.011 vsp_jira_current.avatar
11-19T10:58:05 0 0 907274 1 0 1.079 vsp_jira_current.changegroup
11-19T10:58:08 0 0 1244909 1 0 2.878 vsp_jira_current.changeitem
11-19T10:58:08 0 0 180 1 0 0.009 vsp_jira_current.columnlayout
11-19T10:58:08 0 0 2216 1 0 0.013 vsp_jira_current.columnlayoutitem
11-19T10:58:08 0 0 252 1 0 0.009 vsp_jira_current.component
11-19T10:58:08 0 0 1777 1 0 0.013 vsp_jira_current.configurationcontext
11-19T10:58:08 0 0 636 1 0 0.013 vsp_jira_current.customfield
11-19T10:58:08 0 0 1721 1 0 0.013 vsp_jira_current.customfieldoption
11-19T10:58:09 0 0 405238 1 0 0.826 vsp_jira_current.customfieldvalue
11-19T10:58:09 0 0 1 1 0 0.010 vsp_jira_current.cwd_application
11-19T10:58:09 0 0 0 1 0 0.009 vsp_jira_current.cwd_application_address
11-19T10:58:09 0 0 2 1 0 0.011 vsp_jira_current.cwd_directory
11-19T10:58:09 0 0 48 1 0 0.010 vsp_jira_current.cwd_directory_attribute
11-19T10:58:09 0 0 14 1 0 0.011 vsp_jira_current.cwd_directory_operation
11-19T10:58:09 0 0 1561 1 0 0.018 vsp_jira_current.cwd_group
11-19T10:58:09 0 0 0 1 0 0.010 vsp_jira_current.cwd_group_attributes
11-19T10:58:09 0 0 38285 1 0 0.106 vsp_jira_current.cwd_membership
11-19T10:58:09 0 0 6331 1 0 0.041 vsp_jira_current.cwd_user
11-19T10:58:09 0 0 18889 1 0 0.049 vsp_jira_current.cwd_user_attributes
11-19T10:58:09 0 0 5834 1 0 0.017 vsp_jira_current.external_entities
11-19T10:58:09 0 0 0 1 0 0.010 vsp_jira_current.externalgadget
11-19T10:58:09 0 0 3475 1 0 0.016 vsp_jira_current.favouriteassociations
11-19T10:58:09 0 0 719 1 0 0.012 vsp_jira_current.fieldconfigscheme
11-19T10:58:09 0 0 1301 1 0 0.011 vsp_jira_current.fieldconfigschemeissuetype
11-19T10:58:09 0 0 721 1 0 0.009 vsp_jira_current.fieldconfiguration
11-19T10:58:09 0 0 41 1 0 0.008 vsp_jira_current.fieldlayout
11-19T10:58:09 0 0 25081 1 0 0.081 vsp_jira_current.fieldlayoutitem
11-19T10:58:09 0 0 23 1 0 0.011 vsp_jira_current.fieldlayoutscheme
11-19T10:58:09 0 0 0 1 0 0.011 vsp_jira_current.fieldlayoutschemeassociation
11-19T10:58:09 0 0 245 1 0 0.009 vsp_jira_current.fieldlayoutschemeentity
11-19T10:58:09 0 0 9 1 0 0.010 vsp_jira_current.fieldscreen
11-19T10:58:09 0 0 1039 1 0 0.011 vsp_jira_current.fieldscreenlayoutitem
11-19T10:58:09 0 0 12 1 0 0.010 vsp_jira_current.fieldscreenscheme
11-19T10:58:09 0 0 35 1 0 0.012 vsp_jira_current.fieldscreenschemeitem
11-19T10:58:09 0 0 9 1 0 0.010 vsp_jira_current.fieldscreentab
11-19T10:58:15 0 0 53065 1 0 5.154 vsp_jira_current.fileattachment
11-19T10:58:15 0 0 105 1 0 0.010 vsp_jira_current.filtersubscription
11-19T10:58:15 0 0 13882 1 0 0.035 vsp_jira_current.gadgetuserpreference
11-19T10:58:15 0 0 97 1 0 0.009 vsp_jira_current.genericconfiguration
11-19T10:58:15 0 0 0 1 0 0.011 vsp_jira_current.groupbase
11-19T10:58:15 0 0 59163 1 0 0.089 vsp_jira_current.issuelink
11-19T10:58:15 0 0 7 1 0 0.011 vsp_jira_current.issuelinktype
11-19T10:58:15 0 0 2 1 0 0.010 vsp_jira_current.issuesecurityscheme
11-19T10:58:15 0 0 55 1 0 0.008 vsp_jira_current.issuestatus
11-19T10:58:15 0 0 85 1 0 0.008 vsp_jira_current.issuetype
11-19T10:58:15 0 0 9 1 0 0.010 vsp_jira_current.issuetypescreenscheme
11-19T10:58:15 0 0 50 1 0 0.011 vsp_jira_current.issuetypescreenschemeentity
11-19T10:58:22 0 0 805395 1 0 7.054 vsp_jira_current.jiraaction
11-19T10:58:22 0 0 3 1 0 0.265 vsp_jira_current.jiradraftworkflows
11-19T10:58:22 0 0 16 1 0 0.262 vsp_jira_current.jiraeventtype
11-19T10:58:23 0 0 134754 1 0 0.976 vsp_jira_current.jiraissue
11-19T10:58:23 0 0 0 1 0 0.011 vsp_jira_current.jiraperms
11-19T10:58:23 0 0 11 1 0 0.013 vsp_jira_current.jiraworkflows
11-19T10:58:23 0 0 2158 1 0 0.014 vsp_jira_current.label
11-19T10:58:23 0 0 3 1 0 0.009 vsp_jira_current.listenerconfig
11-19T10:58:23 0 0 9 1 0 0.012 vsp_jira_current.mailserver
11-19T10:58:24 0 0 0 1 0 0.010 vsp_jira_current.membershipbase
11-19T10:58:26 0 0 25566 1 0 2.090 vsp_jira_current.nodeassociation
11-19T10:58:26 0 0 166 1 0 0.267 vsp_jira_current.notification
11-19T10:58:39 0 0 3914219 1 0 13.089 vsp_jira_current.notificationinstance
11-19T10:58:39 0 0 3 1 0 0.265 vsp_jira_current.notificationscheme
11-19T10:58:39 0 0 1 1 0 0.011 vsp_jira_current.oauthconsumer
11-19T10:58:39 0 0 0 1 0 0.264 vsp_jira_current.oauthconsumertoken
11-19T10:58:40 0 0 1 1 0 0.264 vsp_jira_current.oauthspconsumer
11-19T10:58:40 0 0 0 1 0 0.012 vsp_jira_current.oauthsptoken
11-19T10:58:40 0 0 624 1 0 0.264 vsp_jira_current.optionconfiguration
11-19T10:58:40 0 0 287 1 0 0.263 vsp_jira_current.permissionscheme
11-19T10:58:41 0 0 107 1 0 0.265 vsp_jira_current.pluginversion
11-19T10:58:41 0 0 620 1 0 0.266 vsp_jira_current.portalpage
11-19T10:58:41 0 0 3862 1 0 0.022 vsp_jira_current.portletconfiguration
11-19T10:58:41 0 0 5 1 0 0.011 vsp_jira_current.priority
11-19T10:58:41 0 0 402 1 0 0.011 vsp_jira_current.project
11-19T10:58:41 0 0 19 1 0 0.011 vsp_jira_current.projectcategory
11-19T10:58:41 0 0 7 1 0 0.010 vsp_jira_current.projectrole
11-19T10:58:41 0 1 1207 1 0 0.012 vsp_jira_current.projectroleactor
11-19T10:58:41 0 0 715 1 0 0.267 vsp_jira_current.projectversion
11-19T10:58:41 0 0 0 1 0 0.012 vsp_jira_current.propertydata
11-19T10:58:41 0 0 1 1 0 0.010 vsp_jira_current.propertydate
11-19T10:58:41 0 0 0 1 0 0.010 vsp_jira_current.propertydecimal
11-19T10:58:41 0 0 6216 1 0 0.021 vsp_jira_current.propertyentry
11-19T10:58:41 0 0 3110 1 0 0.013 vsp_jira_current.propertynumber
11-19T10:58:41 0 1 1186 1 0 0.013 vsp_jira_current.propertystring
11-19T10:58:42 0 0 1919 1 0 0.269 vsp_jira_current.propertytext
11-19T10:58:42 0 0 0 1 0 0.011 vsp_jira_current.qrtz_calendars
11-19T10:58:42 0 0 105 1 0 0.011 vsp_jira_current.qrtz_cron_triggers
11-19T10:58:42 0 0 0 1 0 0.264 vsp_jira_current.qrtz_fired_triggers
11-19T10:58:42 0 0 1 1 0 0.266 vsp_jira_current.qrtz_job_details
11-19T10:58:42 0 0 0 1 0 0.265 vsp_jira_current.qrtz_job_listeners
11-19T10:58:43 0 0 0 1 0 0.264 vsp_jira_current.qrtz_simple_triggers
11-19T10:58:43 0 0 0 1 0 0.260 vsp_jira_current.qrtz_trigger_listeners
11-19T10:58:43 0 0 105 1 0 0.010 vsp_jira_current.qrtz_triggers
11-19T10:58:43 0 0 4438 1 0 0.017 vsp_jira_current.remembermetoken
11-19T10:58:43 0 0 9 1 0 0.264 vsp_jira_current.remotelink
11-19T10:58:43 0 0 7 1 0 0.264 vsp_jira_current.resolution
11-19T10:58:44 0 0 9 1 0 0.263 vsp_jira_current.schemeissuesecurities
11-19T10:58:44 0 0 6 1 0 0.011 vsp_jira_current.schemeissuesecuritylevels
11-19T10:58:44 0 0 15430 1 0 0.034 vsp_jira_current.schemepermissions
11-19T10:58:44 0 0 2769 1 0 0.019 vsp_jira_current.searchrequest
11-19T10:58:44 0 0 15 1 0 0.010 vsp_jira_current.serviceconfig
11-19T10:58:44 0 0 934 1 0 0.011 vsp_jira_current.sharepermissions
11-19T10:58:44 0 0 0 1 0 0.010 vsp_jira_current.trackback_ping
11-19T10:58:44 0 0 1 1 0 0.009 vsp_jira_current.trustedapp
11-19T10:58:44 0 0 100 1 0 0.011 vsp_jira_current.upgradehistory
11-19T10:58:44 0 0 7 1 0 0.010 vsp_jira_current.upgradeversionhistory
11-19T10:58:44 0 0 209552 1 0 0.327 vsp_jira_current.userassociation
11-19T10:58:44 0 0 0 1 0 0.010 vsp_jira_current.userbase
11-19T10:58:44 0 0 74590 1 0 0.131 vsp_jira_current.userhistoryitem
11-19T10:58:44 0 0 0 1 0 0.010 vsp_jira_current.versioncontrol
11-19T10:58:44 0 0 0 1 0 0.009 vsp_jira_current.votehistory
11-19T10:58:44 0 0 9 1 0 0.010 vsp_jira_current.workflowscheme
11-19T10:58:44 0 0 44 1 0 0.010 vsp_jira_current.workflowschemeentity
11-19T10:58:44 0 0 2 1 0 0.010 vsp_jira_current.worklog

However, there are no differences between master and slave for vsp_jira_current.projectroleactor or vsp_jira_current.propertystring, as evidenced by the contents of the checksum table. In addition, there is a difference between master and slave for vsp_jira_current.votehistory, that is NOT reflected in the output, above.

Relevant contents of checksum table:
mysql> select * from checksums_ssodb03 where tbl in ('projectroleactor', 'propertystring', 'votehistory');
+------------------+------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
| db | tbl | chunk | chunk_time | chunk_index | lower_boundary | upper_boundary | this_crc | this_cnt | master_crc | master_cnt | ts |
+------------------+------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
| vsp_jira_current | projectroleactor | 1 | 0.002638 | NULL | NULL | NULL | 4cea971 | 1207 | 4cea971 | 1207 | 2012-11-19 10:58:41 |
| vsp_jira_current | propertystring | 1 | 0.002029 | NULL | NULL | NULL | d4d91c61 | 1186 | d4d91c61 | 1186 | 2012-11-19 10:58:41 |
| vsp_jira_current | votehistory | 1 | 0.000531 | NULL | NULL | NULL | bf83ed3d | 1 | 0 | 0 | 2012-11-19 10:58:44 |
+------------------+------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
3 rows in set (0.00 sec)

I will try to reproduce with PTDEBUG=1.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Reproduced part of the bug with PTDEBUG=1. This time, pt-table-checksum outputs differences for vsp_jira_current.external_entities, vsp_jira_current.fieldscreenscheme, and vsp_jira_current.remembermetoken, but no differences are actually present in the checksum table. pt-table-checksum also detected a difference in vsp_jira_current.votehistory, which is correct.

Attaching debug output and contents of checksum table from slave.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :
Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Problem also occurs with pt-table-checksum 2.1.7.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

I have a simpler test case that may help. I ran pt-table-checksum against 5 empty tables that are persistently empty (i.e., no DML occurs against these tables). I repeatedly ran pt-table-checksum until a difference was identified by the tool. I will upload the PTDEBUG output shortly (note line 953 identifies a difference in table AO_563AEE_ACTIVITY_ENTITY, though this table is empty on both the master and slave.

The following is relevant to this new test:

Tool version = pt-table-checksum 2.1.7
Command line = /sso/sfw/percona/bin/pt-table-checksum --user=root --password=hidden --chunk-size-limit=50.0 --replicate-check --recursion-method=DSN=D=percona,t=dsns_ssodb03 --databases=vsp_jira_current --tables=AO_563AEE_ACTIVITY_ENTITY,AO_563AEE_ACTOR_ENTITY,AO_563AEE_MEDIA_LINK_ENTITY,AO_563AEE_OBJECT_ENTITY,AO_563AEE_TARGET_ENTITY --no-check-replication-filters --no-check-binlog-format --replicate=percona.checksums_ssodb03 > bug2.txt 2>&1

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Sorry, the above should say "note line 996 identifies a difference in table AO_563AEE_ACTIVITY_ENTITY, though this table is empty on both the master and slave."

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Thanks for working on this Rob. We'll look into it too as soon as possible.

Changed in percona-toolkit:
status: New → Confirmed
milestone: none → 2.1.8
importance: Undecided → High
tags: added: pt-table-checksum wrong-output
Changed in percona-toolkit:
assignee: nobody → Daniel Nichter (daniel-nichter)
Changed in percona-toolkit:
status: Confirmed → In Progress
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

# NibbleIterator:5800 28000 0 rows in nibble 1

# ReplicaLagWaiter:7780 28000 All slaves caught up

# pt_table_checksum:10483 28000 ssodb03 max chunk: 1

# RowChecksum:5537 28000 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `percona`.`checksums_ssodb03` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='vsp_jira_current' AND tbl='AO_563AEE_ACTIVITY_ENTITY')

# pt_table_checksum:9539 28000 1 checksum diffs on ssodb03

That is strange. Unfortunately, the debug output doesn't help me understand why. I ran the same experiment (ptc in a while true loop), but couldn't reproduce the results.

Rob, could you reproduce this and once it happens, run that ^ SELECT query on master and slave and paste the results? (You may need to change tbl='...' for whatever table has the diff.) You may also want to run it multiple times to see if the results change. If they do, then it could be a MySQL bug (same deterministic query, same data, but different results = MySQL bug). If you can reproduce the false-positive diff with that SELECT query, a whole dump of the --replicate table WHERE db=... AND tbl=<the affected table> would be helpful too.

Otherwise, there's nothing in the debug output to help explain this. Then again, the debug output doesn't say much along these lines because this has never been wrong before.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :
Download full text (4.1 KiB)

I'm not sure how to test this. The problem isn't reflected in that select query. Whenever ptc finds an error, it exits with a non-zero exit code. But if I immediately query the replicate table, no differences are present. Here's the latest:

1. Ran the following in a while loop:
do
   /sso/sfw/percona/bin/pt-table-checksum --user=root --password=blah --chunk-size-limit=50.0 --replicate-check --recursion-method=DSN=D=percona,t=dsns_ssodb03 --databases=vsp_jira_current --tables=customfieldoption --no-check-replication-filters --replicate=percona.checksums_ssodb03
   if [ "$?" != "0" ]; then
      break;
   fi
done

2. Loop ran 9 times with a 0 exit code, and on the 10th try, it exited with the following output:
            TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
12-12T13:51:14 0 1 1734 1 0 0.018 vsp_jira_current.customfieldoption

So it thinks there is a difference in vsp_jira_current.customfieldoption.

3. Examine replicate table on master:
mysql> select * from checksums_ssodb03 where tbl='customfieldoption';
+------------------+-------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
| db | tbl | chunk | chunk_time | chunk_index | lower_boundary | upper_boundary | this_crc | this_cnt | master_crc | master_cnt | ts |
+------------------+-------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
| vsp_jira_current | customfieldoption | 1 | 0.006716 | NULL | NULL | NULL | a83c0464 | 1734 | a83c0464 | 1734 | 2012-12-12 13:51:14 |
+------------------+-------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
1 row in set (0.00 sec)
mysql> SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `percona`.`checksums_ssodb03` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='vsp_jira_current' AND tbl='customfieldoption');
Empty set (0.00 sec)

4. Examine replicate on slave:
mysql> select * from checksums_ssodb03 where tbl='customfieldoption';
+------------------+-------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
| db | tbl | chunk | chunk_time | chunk_index | lower_boundary | upper_boundary | this_crc | this_cnt | master_crc | master_cnt | ts |
+------------------+-------------------+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+
| vsp_jira_current | customfieldoption | 1 | 0.006716 | NULL ...

Read more...

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Not sure if this is a red herring or not, but the only place I've run into this issue is when mixing MySQL versions. In the above example, our master is 5.5.9, and our slave is 5.5.28. We have long-run ptc in another solely 5.5.9 master/slave environment and in another solely 5.5.28 master/slave environment without ever running into this. Give me a few days to set up a 5.5.9 slave and see if the problem reproduces in that environment.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Well, scratch that theory. I reverted the slave back to 5.5.9 (same as master) and still get these intermittent errors.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Trying another adjustment. Only one database is being replicated in this particular master/slave pair (i.e., my.cnf on slave includes replicate-do-db=vsp_jira_current). I have tried adding the replicate database, also, to the set of databases being replicated (i.e., also adding replicate-do-db=percona_ssodb03 to the slave's my.cnf). So, far, I have not been able to reproduce this issue since making that change. Will continue to test for a few more days.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Hm quite weird. To hep you Rob, I've attached a special copy of pt-table-checksum that will print the differing checksum results it supposedly sees as well as all checksum results for the table in question. You don't need to run with PTDEBUG since we're already established that that doesn't have the info we need. Just run as usual and anytime there's a diff, the tool will dump rows to STDERR.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :
Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Will give it a try and let you know when the problem recurs. Thanks.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :
Download full text (12.5 KiB)

Well, that didn't take long. I removed replicate-do-db=percona_ssodb03 from the slave's my.cnf and restarted it. Ran pt-table-checksum-bug-1080765 four times and then encountered the following:

            TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
12-17T14:35:25 0 0 0 1 0 0.025 vsp_jira_current.AO_563AEE_ACTIVITY_ENTITY
12-17T14:35:25 0 0 0 1 0 0.021 vsp_jira_current.AO_563AEE_ACTOR_ENTITY
12-17T14:35:25 0 0 0 1 0 0.022 vsp_jira_current.AO_563AEE_MEDIA_LINK_ENTITY
12-17T14:35:25 0 0 0 1 0 0.022 vsp_jira_current.AO_563AEE_OBJECT_ENTITY
12-17T14:35:25 0 0 0 1 0 0.021 vsp_jira_current.AO_563AEE_TARGET_ENTITY
12-17T14:35:26 0 0 139008 5 0 0.545 vsp_jira_current.OS_CURRENTSTEP
12-17T14:35:26 0 0 263797 1 0 0.186 vsp_jira_current.OS_CURRENTSTEP_PREV
12-17T14:35:27 0 0 308038 1 0 0.739 vsp_jira_current.OS_HISTORYSTEP
12-17T14:35:27 0 0 173870 1 0 0.243 vsp_jira_current.OS_HISTORYSTEP_PREV
12-17T14:35:27 0 0 185293 1 0 0.199 vsp_jira_current.OS_WFENTRY
12-17T14:35:27 0 0 87 1 0 0.018 vsp_jira_current.SEQUENCE_VALUE_ITEM
12-17T14:35:27 0 0 501 1 0 0.020 vsp_jira_current.avatar
12-17T14:35:28 0 0 935138 1 0 1.159 vsp_jira_current.changegroup
12-17T14:35:31 0 0 1283476 1 0 2.884 vsp_jira_current.changeitem
12-17T14:35:31 0 0 183 1 0 0.020 vsp_jira_current.columnlayout
12-17T14:35:31 0 0 2250 1 0 0.028 vsp_jira_current.columnlayoutitem
12-17T14:35:31 0 0 314 1 0 0.020 vsp_jira_current.component
12-17T14:35:31 0 0 1801 1 0 0.022 vsp_jira_current.configurationcontext
12-17T14:35:31 0 0 655 1 0 0.025 vsp_jira_current.customfield
12-17T14:35:36 0 0 1734 1 0 4.051 vsp_jira_current.customfieldoption
12-17T14:35:36 0 0 417352 1 0 0.835 vsp_jira_current.customfieldvalue
12-17T14:35:36 0 0 1 1 0 0.018 vsp_jira_current.cwd_application
12-17T14:35:36 0 0 0 1 0 0.020 vsp_jira_current.cwd_application_address
12-17T14:35:36 0 0 2 1 0 0.020 vsp_jira_current.cwd_directory
12-17T14:35:36 0 0 48 1 0 0.019 vsp_jira_current.cwd_directory_attribute
12-17T14:35:36 0 0 14 1 0 0.013 vsp_jira_current.cwd_directory_operation
12-17T14:35:37 0 0 1577 1 0 0.033 vsp_jira_current.cwd_group
12-17T14:35:37 0 0 0 1 0 0.018 vsp_jira_current.cwd_group_attributes
12-17T14:35:37 0 0 38568 1 0 0.195 vsp_jira_current.cwd_membership
12-17T14:35:37 0 0 6420 1 0 0.055 vsp_jira_current.cwd_user
12-17T14:35:37 0 0 19113 ...

Revision history for this message
Daniel Nichter (daniel-nichter) wrote : Re: [Bug 1080765] Re: pt-table-checksum reports false errors and misses real errors

On Dec 17, 2012, at 12:41 PM, Rob Wagner wrote:
> TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
> 12-17T14:36:02 0 0 19 1 0 0.021 vsp_jira_current.prodiffs:$VAR1 = [
> {
> chunk => '1',
> chunk_index => undef,
> cnt_diff => '0',
> crc_diff => '1',
> lower_boundary => undef,
> master_cnt => undef,
> master_crc => undef,
> table => 'vsp_jira_current.propertytext',
> this_cnt => '1987',
> this_crc => '128460a9',
> upper_boundary => undef
> }
> ];
> checksum:$VAR1 = [
> {
> chunk => '1',
> chunk_index => undef,
> chunk_time => '0.00435',
> db => 'vsp_jira_current',
> lower_boundary => undef,
> master_cnt => '1987',
> master_crc => '128460a9',
> tbl => 'propertytext',
> this_cnt => '1987',
> this_crc => '128460a9',
> ts => '2012-12-17 14:36:03',
> upper_boundary => undef
> }
> ];
> jectcategory

That is helpful. It shows us that the tool is failing to wait for updates on the master to replicate before checking the slave. The tool should do

1. INSERT..SELECT /* checksum query */
2. UPDATE /* master_* columns */
3. Wait for slave lag
4. Wait for last chunk to appear on all slaves
5. SELECT /* find diffs */

It seems that #5 is happening before #2 has replicated, probably because #3 is wrong. This can be seen from the fact that the diff the tool sees (first struct) has NULL values for master_cnt and master_crc, indicating that #2 hasn't happened yet. But a moment later (second struct), those columns have values and we see there's really no diff.

I think this could happen if replication is very fast and/or Seconds_Behind_Master is almost always zero, and the server on which the tool is running is also very fast, so #1 and #2 happen, then #3 returns 0 slave lag even though #2 hasn't actually been applied on the slave yet, then #4 and #5 happen very quickly, resulting in this case. -- I'll have to think how to prove this is the case or not (difficult with two systems and two programs involved). It would be nice to reproduce this reliably, but even if we can't, we can probably fix it by making #4 also wait for the last chunk to have defined master_cnt and master_crc values.

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Interesting theory. I'm surprised no one else has run into this before. I've witnessed inaccuracies in the slave's Seconds_Behind_Master metric, so I'm curious how you obtain #3? If MySQL docs are to be believed, then it seems like you should be able to compare Exec_Master_Log_Pos on the slave (i.e., show slave status) with the master's log position (maybe from show master status). If the difference is 0, then the slave should be fully caught up.

But you're correct; it seems like making #4 wait for the last chunk would fix it.

Changed in percona-toolkit:
milestone: 2.1.8 → 2.2.1
Revision history for this message
Joe (joegrasse) wrote :

I have actually ran into this before and I agree with the assessment in comment #17. I could not reliably reproduce it though.

Revision history for this message
Dan Farrell (dfarrell) wrote :

I have been running pt-table-checksum periodically in a test environment every 15 minutes. This environment has master-slave replication, but the data is not changing. Nevertheless, I occasionally (once every few days or so) get a false positive on the checksums as well.

My ears perked up when you hypothesized that ptc might check the slave before it was done replicating the checksum from the master. Since my databases never change in my test environment it seems as if you've hit the head here.

Mostly I just wanted to speak up that your theory is very much in line with my experience. However if you're bereft of ideas, may I suggest the tool be modified to use the master log position / file and the slave log position/file instead to ensure that the checksum tables have replicated? A timestamp in the checksums table could also be used, but it seems like it makes as much sense to get the master position off the master and then wait until that position is reached on the slave.

I also theorize that the option to only check, but not compute, the checksums might facilitate a work-around in which the end user could run the checksum, wait for the master pos to be reached on the slave externally, and then run in check-only mode to verify the checksum accuracy.

I look forward to seeing the fix! Thanks for all your hard work!

Revision history for this message
Dan Farrell (dfarrell) wrote :

Along those same lines, dumping the master log and slave exec position when comparing the checksums might be a good way to verify your theory from #17.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Since this isn't a wide-spread issue, I need to move this to 2.2.2 so we can finalize and get 2.2 out of the door. I needed some quality time to sit down with this, but that didn't happen while creating the new 2.2 series. Once 2.2 is out of the door, then I'll have more time.

Changed in percona-toolkit:
milestone: 2.2.1 → 2.2.2
status: In Progress → Confirmed
Changed in percona-toolkit:
status: Confirmed → In Progress
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Taking a fresh look at this, I remembered having already dealt with this issue:

   # Requiring "AND master_crc IS NOT NULL" avoids a race condition
   # when the system is fast but replication is slow. In such cases,
   # we can select on the slave before the update for $update_sth
   # replicates; this causes a false-positive diff.
   my $sql = "SELECT MAX(chunk) FROM $repl_table "
           . "WHERE db='$tbl->{db}' AND tbl='$tbl->{tbl}' "
           . "AND master_crc IS NOT NULL";

That was put in in 2.0. So #1-#4 (comment #17) shouldn't be able to happen before #5. Maybe something else is wrong. I'm taking another look.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Rob, Joe, Dan,

Going back to the output from comment #16 and my analysis in comment #17, although my analysis seems to explain the problem, it also strikes me now as a little too convient, my previous comment #23 notwithstanding, because there's something very weird about seeing:

    master_cnt => undef,
    master_crc => undef,

one microsecond, then:

    master_cnt => '1987',
    master_crc => '128460a9',

another few microseconds later. That output is from my code hack (comment #14):

> if ( scalar @$diffs ) {
> warn "diffs:", Dumper($diffs);
> my $sql = "SELECT * FROM $repl_table"
> . ($args{where} ? " WHERE $args{where}" : "")
> . " ORDER BY db, tbl, chunk";
> my $rows = $dbh->selectall_arrayref($sql, { Slice => {} });
> warn "checksum:", Dumper($rows);
> }

I say that's "too convenient" because what are the odds that the UPDATE statement happens/is replicated/is applied in that window of microseconds? Granted, that is what we see, and perhaps that explains why this issue only happens very rarely, if ever.

Dan says he can reproduce this one every few days, but Rob has reproduced after 4 consecutive runs. Maybe that's just the nature of randomness, but that this happens at all is still confusing given my previous comment, and it leads me to believe there's something else we're not seeing or accounting for.

So Rob, one thing I've never asked explicitly: what are all the replication filters and whatnot on your master and slave? And: InnoDB on all tables? And do you ever manually clear the replicate table? Are any other programs running? And are you *really* sure about all that? :-) The last important is actually perhaps the most important because my experience in cases like this has often been that the mystery is just some unknown, outside event (i.e. outside the tool), like some other DBA or dev had setup a script on their laptop but forgot to disable it, giving rise to a heisenbug.

Or, you can still easily reproduce this, Rob, and want to give me access to your systems, I'll debug it locally.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Since there has been no reply to my last comment #24, I'm going to remove this from the 2.2.2 milestone which is scheduled for release very soon. We'll keep working on this nonetheless as more information is made available.

Changed in percona-toolkit:
milestone: 2.2.2 → none
importance: High → Undecided
Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

Sorry for the delay. The most reproducible (aka most frequent) case happens in a particular master/slave relationship where we are only replicating one database from master to slave. That is, the following is present in /etc/my.cnf on the slave:

replicate-do-db=vsp_jira_current

There aren't any pt-table-checksum replication filters in play, though I do only check the one database; the command line is:
${PERCONA_DIRECTORY}/pt-table-checksum-bug-1080765 --no-check-binlog-format --user=$USER --password=$PASS --chunk-size-limit=50.0 --replicate-check --pid=pt-table-checksum_ssodb03.pid --recursion-method=DSN=D=percona_ssodb03,t=dsns --databases=vsp_jira_current --no-check-replication-filters --replicate=percona_ssodb03.checksums

Within this database, all tables are InnoDB. For testing purposes, I am not clearing the percona_ssodb03.checksums table.

As to your last question:

Are any other programs running? And are you *really* sure about all that? :-)

I am absolutely sure there ARE other programs running. We don't shut down our web apps whenever we run pt-table-checksum; it's certainly possible for the apps to be hitting this database with DML. However, we have another slave that is replicating ALL databases from this particular master, and it's not seeing the problem.

Unfortunately, these DBs contain private info, so I cannot grant outside access to them. But I'm happy to help out wherever I can (understanding that it'll slow down the process).

Revision history for this message
Rob Wagner (rjwagner-dba) wrote :

FYI, the issue is still very much reproducible. I just ran the above command twice and witnessed the error:

diffs:$VAR1 = [
  {
    chunk => '1',
    chunk_index => undef,
    cnt_diff => '0',
    crc_diff => '1',
    lower_boundary => undef,
    master_cnt => undef,
    master_crc => undef,
    table => 'vsp_jira_current.cwd_group',
    this_cnt => '1700',
    this_crc => '24cc282c',
    upper_boundary => undef
  }
];
checksum:$VAR1 = [
  {
    chunk => '1',
    chunk_index => undef,
    chunk_time => '0.004979',
    db => 'vsp_jira_current',
    lower_boundary => undef,
    master_cnt => '1700',
    master_crc => '24cc282c',
    tbl => 'cwd_group',
    this_cnt => '1700',
    this_crc => '24cc282c',
    ts => '2013-05-01 16:37:19',
    upper_boundary => undef
  }
];

-- AND --

diffs:$VAR1 = [
  {
    chunk => '1',
    chunk_index => undef,
    cnt_diff => '0',
    crc_diff => '1',
    lower_boundary => undef,
    master_cnt => undef,
    master_crc => undef,
    table => 'vsp_jira_current.propertytext',
    this_cnt => '2295',
    this_crc => '375ddbc',
    upper_boundary => undef
  }
];
checksum:$VAR1 = [
  {
    chunk => '1',
    chunk_index => undef,
    chunk_time => undef,
    db => 'vsp_jira_current',
    lower_boundary => undef,
    master_cnt => undef,
    master_crc => undef,
    tbl => 'propertytext',
    this_cnt => '2295',
    this_crc => '375ddbc',
    ts => '2013-05-01 16:37:50',
    upper_boundary => undef
  }
];

Revision history for this message
Roland Hager (roland-hager) wrote :

I'm experiencing the same problem here. The daily run pt-table-checksum job nearly always returns one or two tables with a DIFF. But there is no difference in the checksum table and pt-table-sync finds nothing to repair. The problem also occurs when I manually start pt-table-checksum so it is reproduceable.

If I can help in any way to produce some debug output, let me know.

pt-table-checksum version 2.2.2
mysql version 5.1.61 on both master and slave
We are using ROW based replication here.
We are ignoring some cache tables both within replication and pt-table-checksum.

Best regards
Roland Hager

Revision history for this message
Rowan Wookey (rwky) wrote :

I experienced this issue shortly after DST changed, in my case is was because the time on the servers was out of sync. Others that are experiencing this issue check your server times.

tags: added: 54898
tags: added: i54898
removed: 54898
Changed in percona-toolkit:
status: In Progress → Incomplete
Revision history for this message
Josh Heidenreich (josh-sickmate) wrote :

We've been seeing this issue on our production server with a MariaDB master / MariaDB slave which has really low replication lag. We've been running pt-table-checksum nightly on a cron job, and have recently been seeing false positives.

I found that the initial run was giving false positives on one or two tables, but running --replicate-check-only afterwards does not report any false positives.

I'm guessing that as per previous comments, the slave is slightly behind even though it reports not being behind at all, and is mid-way-though execution of the UPDATE query of the master_* columns, so the SELECT statement to check the difference is checking wrong values.

Adding the argument --replicate-check-retries=2 causes pt-table-checksum to check the checksum a second time if it finds a difference.

In my testing this has eliminated the false positives.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1048

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.