pt-table-checksum + PXC inconsistent results upon --resume

Bug #1311654 reported by Aurimas Mikalauskas on 2014-04-23
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit
Frank Cizmich

Bug Description

If I interrupt and then resume a pt-table-checksum checking two PXC nodes, ~20-30% of times I get an incorrect result - checksum mismatch. This is easily reproducible with small tables. Here's the command I am running:

/usr/bin/pt-table-checksum \
                        --recursion-method cluster \
                        --user $USER \
                        --password $PASSWORD \
                        --max-load Threads_running=$MAXTHREADS \
                        --progress time,3600 \
                        --chunk-size-limit 4 \
                        --pid $PID \
                        --databases db1,db2"

PTDEBUG output, as it containts sensitive customer information, will be sent privately.

Daniel's hack that adds an extra 1.5s delay before checking for the last chunk, decreased this effect to zero, but we were testing with very small tables, so such waits added a lot of overhead and I am guessing in most cases I would interrupt pt-table-checksum while it was waiting.

Tested with pt-table-checksum 2.2.7 and Percona XtraDB Cluster, Release 31.1, wsrep_25.9.r3928 (5.5.34-31.1)

Related branches

Daniel Nichter: Approve on 2014-08-05
tags: added: pt-table-checksum
Changed in percona-toolkit:
importance: Undecided → Medium
status: New → Incomplete
status: Incomplete → Fix Committed
milestone: none → 2.2.10
assignee: nobody → Frank Cizmich (frank-cizmich)
Frank Cizmich (frank-cizmich) wrote :

Discrepant table checksums are now re-checked a number of times at short intervals before declaring them true.
This strategy does not add significant time to the overall run since differences are usually rare, and this is done at most once per table.

Frank Cizmich (frank-cizmich) wrote :

To preserve the default behavior a new command line parameter was added.

If you are having resume problems you can now set --replicate-check-retries N , where N is the number of times to retry a discrepant checksum (default = 1 , no retries)

Setting a value of 3 is enough to completely eliminate spurious differences.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers