pt-table-checksum deadlock

Bug #1287253 reported by Daniel Durand on 2014-03-03
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
Medium
Frank Cizmich

Bug Description

Percona Toolkik 2.2.6

We currently using pt-table-checksum to check data integrity on all the oltp cluster, we have 6 servers there with 8k qps and we start to get deadlocks on one of the biggest tables when this tool try to get the checksum on some chunks. I already drop chunk time, but it doesn't fix completly the problem :-/.

Command Line used:

pt-table-checksum --host=sql1v-ci.ci.prd.******** --user=pt_checksum --password=******* --nocheck-binlog-format --check-interval=1 --check-plan --check-replication-filters --check-slave-tables --chunk-size-limit=1 --chunk-time="0.025" --create-replicate-table --databases="tewn" --tables="landing_page_analytics" --empty-replicate-table --max-lag="1s" --max-load="Threads_running=40" --progress="time,30" --replicate="pt_data.checksums" --retries=5

Checksumming tewn.landing_page_analytics: 1% 25:28 remain
Checksumming tewn.landing_page_analytics: 4% 22:42 remain
Checksumming tewn.landing_page_analytics: 6% 20:59 remain
Checksumming tewn.landing_page_analytics: 8% 20:35 remain
Checksumming tewn.landing_page_analytics: 10% 21:59 remain
Checksumming tewn.landing_page_analytics: 12% 20:42 remain
Checksumming tewn.landing_page_analytics: 14% 20:39 remain
Checksumming tewn.landing_page_analytics: 16% 19:50 remain
Checksumming tewn.landing_page_analytics: 18% 20:04 remain
Checksumming tewn.landing_page_analytics: 19% 21:14 remain
Checksumming tewn.landing_page_analytics: 21% 20:21 remain
Checksumming tewn.landing_page_analytics: 23% 19:27 remain
Checksumming tewn.landing_page_analytics: 25% 18:53 remain
Checksumming tewn.landing_page_analytics: 28% 18:15 remain
Checksumming tewn.landing_page_analytics: 30% 17:41 remain
Checksumming tewn.landing_page_analytics: 32% 16:52 remain
Checksumming tewn.landing_page_analytics: 34% 16:30 remain
Checksumming tewn.landing_page_analytics: 36% 15:42 remain
Checksumming tewn.landing_page_analytics: 39% 15:29 remain
Checksumming tewn.landing_page_analytics: 41% 14:53 remain
Checksumming tewn.landing_page_analytics: 43% 14:15 remain
Checksumming tewn.landing_page_analytics: 45% 13:33 remain
Checksumming tewn.landing_page_analytics: 48% 12:50 remain
Checksumming tewn.landing_page_analytics: 50% 12:26 remain
Checksumming tewn.landing_page_analytics: 52% 11:43 remain
Checksumming tewn.landing_page_analytics: 54% 11:15 remain
Checksumming tewn.landing_page_analytics: 55% 11:04 remain
Checksumming tewn.landing_page_analytics: 57% 10:53 remain
02-28T15:30:44 Error checksumming table tewn.landing_page_analytics: Error executing checksum query: DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "REPLACE INTO `pt_data`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `id`, `domain_bitfield`, `page_id`, `category`, `demographic_sex`, `demographic_age`, `content_rating`, `num_ctr`, `num_display`, `weight`, `weight_vw`, `created`, `visible`, `type`, `key`, `updated` + 0, CONCAT(ISNULL(`domain_bitfield`), ISNULL(`page_id`), ISNULL(`num_ctr`), ISNULL(`num_display`), ISNULL(`weight`), ISNULL(`created`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `tewn`.`landing_page_analytics` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) AND ((`id` <= ?)) /*checksum chunk*/" with ParamValues: 0='tewn', 1='landing_page_analytics', 2=16297, 3='PRIMARY', 4='159650735', 5='159662859', 6='159650735', 7='159662859'] at /usr/bin/pt-table-checksum line 10459.

            TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
02-28T15:30:44 1 0 80558617 16297 0 886.019 tewn.landing_page_analytics

May suggest something, I was checking this script and MySQL suggest when you have this kind of deadlock you must to retry to execute again this sentence, but in pt-table-checksum script line 10510 we have "fail" function which is evaluating kind of error so, retry argument option for the tool will retry to execute the last sql statement, like I said MySQL suggest to retry to execute, so with this modification this could be fixed.

      fail => sub {
         my (%args) = @_;
         my $error = $args{error};

         if ( $error =~ m/Lock wait timeout exceeded/
             || $error =~ m/Query execution was interrupted/
             || $error =~ m/Deadlock found when trying to get lock/
         ) {
             return 1;
         }

Daniel Durand (daniel-durand) wrote :
Daniel Durand (daniel-durand) wrote :

Patch for this problem

Changed in percona-toolkit:
milestone: none → 2.2.10
status: New → Fix Committed
assignee: nobody → Frank Cizmich (frank-cizmich)
importance: Undecided → Medium
Frank Cizmich (frank-cizmich) wrote :

Fixed using Durand's patch.

Daniel Durand (daniel-durand) wrote :

Thank you Frank, I saw this problem is also happening with the other tools, it should have the same condition when this error happens " || $error =~ m/Deadlock found when trying to get lock/ "

Frank Cizmich (frank-cizmich) wrote :

Hi Daniel D,

Can you point out which other tools miss this, and what version of PT are you using?
It seems to be already covered in pt-online-schema-change and others in the current version.
By the way , pt-online-schema-change matches the shortened /Deadlock found/ , so we shortened it in pt-table-checksum too.

Regards

Frank Cizmich (frank-cizmich) wrote :

Sorry, I now see you mentioned 2.2.6 right of the bat.

Still, pt-osc, pt-archiver and pt-heartbeat catch that error in that version.

Download full text (5.5 KiB)

Hi Frank, let me check to be sure which one I saw it, what I remember it
was pt_table_sync but let me take a look, don't remember in this moment
when I check this was 3 or 4 months ago... thanks

2014-07-30 9:22 GMT-07:00 Frank Cizmich <email address hidden>:

> Sorry, I now see you mentioned 2.2.6 right of the bat.
>
> Still, pt-osc, pt-archiver and pt-heartbeat catch that error in that
> version.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1287253
>
> Title:
> pt-table-checksum deadlock
>
> Status in Percona Toolkit:
> Fix Committed
>
> Bug description:
> Percona Toolkik 2.2.6
>
> We currently using pt-table-checksum to check data integrity on all
> the oltp cluster, we have 6 servers there with 8k qps and we start to
> get deadlocks on one of the biggest tables when this tool try to get
> the checksum on some chunks. I already drop chunk time, but it doesn't
> fix completly the problem :-/.
>
> Command Line used:
>
> pt-table-checksum --host=sql1v-ci.ci.prd.********
> --user=pt_checksum --password=******* --nocheck-binlog-format
> --check-interval=1 --check-plan --check-replication-filters
> --check-slave-tables --chunk-size-limit=1 --chunk-time="0.025"
> --create-replicate-table --databases="tewn"
> --tables="landing_page_analytics" --empty-replicate-table --max-
> lag="1s" --max-load="Threads_running=40" --progress="time,30"
> --replicate="pt_data.checksums" --retries=5
>
> Checksumming tewn.landing_page_analytics: 1% 25:28 remain
> Checksumming tewn.landing_page_analytics: 4% 22:42 remain
> Checksumming tewn.landing_page_analytics: 6% 20:59 remain
> Checksumming tewn.landing_page_analytics: 8% 20:35 remain
> Checksumming tewn.landing_page_analytics: 10% 21:59 remain
> Checksumming tewn.landing_page_analytics: 12% 20:42 remain
> Checksumming tewn.landing_page_analytics: 14% 20:39 remain
> Checksumming tewn.landing_page_analytics: 16% 19:50 remain
> Checksumming tewn.landing_page_analytics: 18% 20:04 remain
> Checksumming tewn.landing_page_analytics: 19% 21:14 remain
> Checksumming tewn.landing_page_analytics: 21% 20:21 remain
> Checksumming tewn.landing_page_analytics: 23% 19:27 remain
> Checksumming tewn.landing_page_analytics: 25% 18:53 remain
> Checksumming tewn.landing_page_analytics: 28% 18:15 remain
> Checksumming tewn.landing_page_analytics: 30% 17:41 remain
> Checksumming tewn.landing_page_analytics: 32% 16:52 remain
> Checksumming tewn.landing_page_analytics: 34% 16:30 remain
> Checksumming tewn.landing_page_analytics: 36% 15:42 remain
> Checksumming tewn.landing_page_analytics: 39% 15:29 remain
> Checksumming tewn.landing_page_analytics: 41% 14:53 remain
> Checksumming tewn.landing_page_analytics: 43% 14:15 remain
> Checksumming tewn.landing_page_analytics: 45% 13:33 remain
> Checksumming tewn.landing_page_analytics: 48% 12:50 remain
> Checksumming tewn.landing_page_analytics: 50% 12:26 remain
> Checksumming tewn.landing_page_analytics: 52% 11:43 remain
> Checksumming tewn.landing_pag...

Read more...

Changed in percona-toolkit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers