pt-table-checksum + PXC inconsistent results upon --resume
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Fix Released
|
Medium
|
Frank Cizmich |
Bug Description
If I interrupt and then resume a pt-table-checksum checking two PXC nodes, ~20-30% of times I get an incorrect result - checksum mismatch. This is easily reproducible with small tables. Here's the command I am running:
/usr/bin/
PTDEBUG output, as it containts sensitive customer information, will be sent privately.
Daniel's hack that adds an extra 1.5s delay before checking for the last chunk, decreased this effect to zero, but we were testing with very small tables, so such waits added a lot of overhead and I am guessing in most cases I would interrupt pt-table-checksum while it was waiting.
Tested with pt-table-checksum 2.2.7 and Percona XtraDB Cluster, Release 31.1, wsrep_25.9.r3928 (5.5.34-31.1)
Related branches
- Daniel Nichter: Approve
-
Diff: 210 lines (+74/-32)7 files modifiedbin/pt-table-checksum (+34/-13)
lib/RowChecksum.pm (+34/-13)
t/pt-table-checksum/basics.t (+2/-2)
t/pt-table-checksum/samples/default-results-5.5.txt (+1/-1)
t/pt-table-checksum/samples/default-results-5.6.txt (+1/-1)
t/pt-table-checksum/samples/static-chunk-size-results-5.5.txt (+1/-1)
t/pt-table-checksum/samples/static-chunk-size-results-5.6.txt (+1/-1)
tags: | added: pt-table-checksum |
Changed in percona-toolkit: | |
importance: | Undecided → Medium |
status: | New → Incomplete |
status: | Incomplete → Fix Committed |
milestone: | none → 2.2.10 |
assignee: | nobody → Frank Cizmich (frank-cizmich) |
Changed in percona-toolkit: | |
status: | Fix Committed → Fix Released |
Discrepant table checksums are now re-checked a number of times at short intervals before declaring them true.
This strategy does not add significant time to the overall run since differences are usually rare, and this is done at most once per table.