Comment 17 for bug 1080765

Revision history for this message
Daniel Nichter (daniel-nichter) wrote : Re: [Bug 1080765] Re: pt-table-checksum reports false errors and misses real errors

On Dec 17, 2012, at 12:41 PM, Rob Wagner wrote:
> TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
> 12-17T14:36:02 0 0 19 1 0 0.021 vsp_jira_current.prodiffs:$VAR1 = [
> {
> chunk => '1',
> chunk_index => undef,
> cnt_diff => '0',
> crc_diff => '1',
> lower_boundary => undef,
> master_cnt => undef,
> master_crc => undef,
> table => 'vsp_jira_current.propertytext',
> this_cnt => '1987',
> this_crc => '128460a9',
> upper_boundary => undef
> }
> ];
> checksum:$VAR1 = [
> {
> chunk => '1',
> chunk_index => undef,
> chunk_time => '0.00435',
> db => 'vsp_jira_current',
> lower_boundary => undef,
> master_cnt => '1987',
> master_crc => '128460a9',
> tbl => 'propertytext',
> this_cnt => '1987',
> this_crc => '128460a9',
> ts => '2012-12-17 14:36:03',
> upper_boundary => undef
> }
> ];
> jectcategory

That is helpful. It shows us that the tool is failing to wait for updates on the master to replicate before checking the slave. The tool should do

1. INSERT..SELECT /* checksum query */
2. UPDATE /* master_* columns */
3. Wait for slave lag
4. Wait for last chunk to appear on all slaves
5. SELECT /* find diffs */

It seems that #5 is happening before #2 has replicated, probably because #3 is wrong. This can be seen from the fact that the diff the tool sees (first struct) has NULL values for master_cnt and master_crc, indicating that #2 hasn't happened yet. But a moment later (second struct), those columns have values and we see there's really no diff.

I think this could happen if replication is very fast and/or Seconds_Behind_Master is almost always zero, and the server on which the tool is running is also very fast, so #1 and #2 happen, then #3 returns 0 slave lag even though #2 hasn't actually been applied on the slave yet, then #4 and #5 happen very quickly, resulting in this case. -- I'll have to think how to prove this is the case or not (difficult with two systems and two programs involved). It would be nice to reproduce this reliably, but even if we can't, we can probably fix it by making #4 also wait for the last chunk to have defined master_cnt and master_crc values.