pt-table-checksum has high likelyhood to skip a table when row count is around chunk-size * chunk-size-limit

Bug #1389041 reported by Peiran Song
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
Medium
Frank Cizmich

Bug Description

pt-table-checksum decides whether a table can be checksummed in one chunk based on:

estimated rows <= --chunk-size * --chunk-size-limit

This gives no allowance for inaccuracy of row estimates when the table size is around the number of --chunk-size * --chunk-size-limit. For example, with the default setting of --chunk-size=1000, --chunk-size-limit=2, if a table’s row estimate is 1999 on the master while 2001 on a slave which is quite legitimate, the tool would report a “Skipping table ... because on the master it would be checksummed in one chunk but on these replicas it has too many rows”.

One fix is to remove this one-chunk logic and always use --chunk-size or heuristic to do chunking.

This is related to support issue #43754.

Related branches

summary: - pt-table-checksum has high likelyhood to skip a table when rows around
- chunk-size * chunk-size-limit
+ pt-table-checksum has high likelyhood to skip a table when row count is
+ around chunk-size * chunk-size-limit
tags: added: i43754
tags: added: pt-table-checksum
Revision history for this message
Nilnandan Joshi (nilnandan-joshi) wrote :

Able to verified with PS 5.6 and pt-table-checksum 2.2.11

Test case is very simple. Just create one table in master-slave environment. like test and keep value < 2000 on master and > 2000 on slave.

On master:

mysql> select count(*) from test;
+----------+
| count(*) |
+----------+
| 1948 |
+----------+
1 row in set (0.01 sec)

On slave:

mysql> select count(*) from test;
+----------+
| count(*) |
+----------+
| 2048 |
+----------+
1 row in set (0.00 sec)

Then run checksum with --chunk-size option,

nilnandan@Dell-XPS:~$ pt-table-checksum --chunk-size=1000 --user=root --password=msandbox --socket=/tmp/mysql_sandbox20886.sock --recursion-method dsn=D=percona,t=dsns
            TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
11-12T14:21:48 0 0 0 1 0 0.041 mysql.columns_priv
11-12T14:21:48 0 0 0 1 0 0.032 mysql.db
11-12T14:21:48 0 0 0 1 0 0.037 mysql.event
11-12T14:21:48 0 0 0 1 0 0.035 mysql.func
11-12T14:21:48 0 0 40 1 0 0.031 mysql.help_category
11-12T14:21:48 0 0 485 1 0 0.033 mysql.help_keyword
11-12T14:21:48 0 0 1090 1 0 0.037 mysql.help_relation
11-12T14:21:49 0 0 533 1 0 0.042 mysql.help_topic
11-12T14:21:49 0 0 0 1 0 0.030 mysql.ndb_binlog_index
11-12T14:21:49 0 0 0 1 0 0.035 mysql.plugin
11-12T14:21:49 0 0 0 1 0 0.032 mysql.proc
11-12T14:21:49 0 0 0 1 0 0.034 mysql.procs_priv
11-12T14:21:49 0 0 2 1 0 0.039 mysql.proxies_priv
11-12T14:21:49 0 0 0 1 0 0.027 mysql.servers
11-12T14:21:49 0 0 0 1 0 0.031 mysql.tables_priv
11-12T14:21:49 0 0 0 1 0 0.036 mysql.time_zone
11-12T14:21:49 0 0 0 1 0 0.032 mysql.time_zone_leap_second
11-12T14:21:49 0 0 0 1 0 0.036 mysql.time_zone_name
11-12T14:21:49 0 0 0 1 0 0.021 mysql.time_zone_transition
11-12T14:21:49 0 0 0 1 0 0.031 mysql.time_zone_transition_type
11-12T14:21:49 0 0 8 1 0 0.022 mysql.user
11-12T14:21:49 Skipping table nil.test because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
  2031 rows on Dell-XPS
The current chunk size limit is 2000 rows (chunk size=1000 * chunk size limit=2.0).
11-12T14:21:49 0 0 1 1 0 0.026 percona.dsns
11-12T14:21:49 0 0 1 1 0 0.040 test.nil
11-12T14:21:49 Cannot checksum table test.nil_test: There is no good index and the table is oversized. at /usr/bin/pt-table-checksum line 6417.

nilnandan@Dell-XPS:~$

Changed in percona-toolkit:
status: New → Confirmed
Changed in percona-toolkit:
status: Confirmed → In Progress
importance: Undecided → Medium
assignee: nobody → Frank Cizmich (frank-cizmich)
milestone: none → 2.2.14
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

very simple fix adding 20% tolerance in the row discrepancy
2.2.13

Changed in percona-toolkit:
status: In Progress → Fix Committed
Changed in percona-toolkit:
milestone: 2.2.14 → none
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Removed from milestone 2.2.14 because of backwards compatibility concerns.
Patch provided in comment #2 is still highly reliable for those having this problem though.
Other workarounds include running the tool again against the skipped tables using --max-chunk-limit=3 (three) , instead of the default 2)
This is a multiplier for the tolerance of max-chunk-size.

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

errata:

previous comment should have read --chunk-size-limit=3 instead of --max-chunk-limit=3

Revision history for this message
Jervin R (revin) wrote :

I think a better fix here is to apply only the tolerance patch for the edge cases as originally described on this bug instead of applying to all cases.

Revision history for this message
Jervin R (revin) wrote :

More on my comment above after discussing with the customer - the --tolerance option has merits here as we are trying to solve a valid edge case. It's either we implement the feature or remove the heuristics for this edge case as Peiran suggested on the report.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
status: Fix Released → In Progress
Changed in percona-toolkit:
milestone: none → 2.3.1
Changed in percona-toolkit:
status: In Progress → Fix Committed
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Setting as "Won't Fix" because the real problem is a fundamental design “flaw” of the tool. First of all, the current solution is a hard-coded +20% to —chunk-size-limit, like: —chunk-size-limit * 1.2. This is unnecessary because if —chunk-size-limit=2 then the user can already accomplish this by specifying —chunk-size-limit=2.4. So a new hard-coded threshold is simply not needed; the same effect can be achieved with existing options.

The real problem is that the tool can’t fall back to chunking a table if a single chunk is too large. When the table is too large on a slave (even if by 1 row, because we can’t do much about EXPLAIN estimates other than specify a higher value for —chunk-size-limit), instead of skipping the table, the tool should fall back to chunking the table. The way the tool is currently written precludes this.

I looked into adding an option to simply prevent single chunks in the first place, i.e. force chunking every table, but the tool’s design precludes this, too. The best solution, imho, is simply to remove single chunking altogether. It’s a special case of normal chunking that’s nice in theory but makes the code complex and difficult to change. I don’t see this happening soon, though, because it’d take a lot of work.

For now, the solution is: specify a larger —chunk-size-limit.

I’ll document this edge case behavior more clearly in the tool.

Changed in percona-toolkit:
status: Fix Committed → Won't Fix
milestone: 2.3.1 → none
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :
Revision history for this message
Anders Kaseorg (andersk) wrote :

Wait, no, the proposed patch is not equivalent to multiplying --chunk-size-limit by 1.2.

--chunk-size-limit is used in several places, but the patch only adds a multiplier on one of them, to increase the tolerance for the slaves but not for the master. No single value of the existing --chunk-size-limit option has this effect. This is a critical difference: if you have many tables of different sizes, then raising --chunk-size-limit to work around this problem on some tables will probably just cause the same problem to be triggered on other tables.

The threshold is needed to tolerate a relative difference between the number of rows on the master and the slave, not some absolute limit.

Revision history for this message
Anders Kaseorg (andersk) wrote :
Changed in percona-toolkit:
status: Won't Fix → In Progress
milestone: none → 2.2.17
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

I'm falling back again to adding the tolerance option because, at the very least , will get this issue out of the way with low risk.

Note: I acknowledge that removing the single chunk code altogether might be the best way in the long run for the reasons Daniel mentioned.

Changed in percona-toolkit:
status: In Progress → Fix Committed
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-662

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.