Comment 7 for bug 1287253

Revision history for this message
Daniel Durand (daniel-durand) wrote : Re: [Bug 1287253] Re: pt-table-checksum deadlock

Hi Frank, let me check to be sure which one I saw it, what I remember it
was pt_table_sync but let me take a look, don't remember in this moment
when I check this was 3 or 4 months ago... thanks

2014-07-30 9:22 GMT-07:00 Frank Cizmich <email address hidden>:

> Sorry, I now see you mentioned 2.2.6 right of the bat.
>
> Still, pt-osc, pt-archiver and pt-heartbeat catch that error in that
> version.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1287253
>
> Title:
> pt-table-checksum deadlock
>
> Status in Percona Toolkit:
> Fix Committed
>
> Bug description:
> Percona Toolkik 2.2.6
>
> We currently using pt-table-checksum to check data integrity on all
> the oltp cluster, we have 6 servers there with 8k qps and we start to
> get deadlocks on one of the biggest tables when this tool try to get
> the checksum on some chunks. I already drop chunk time, but it doesn't
> fix completly the problem :-/.
>
> Command Line used:
>
> pt-table-checksum --host=sql1v-ci.ci.prd.********
> --user=pt_checksum --password=******* --nocheck-binlog-format
> --check-interval=1 --check-plan --check-replication-filters
> --check-slave-tables --chunk-size-limit=1 --chunk-time="0.025"
> --create-replicate-table --databases="tewn"
> --tables="landing_page_analytics" --empty-replicate-table --max-
> lag="1s" --max-load="Threads_running=40" --progress="time,30"
> --replicate="pt_data.checksums" --retries=5
>
> Checksumming tewn.landing_page_analytics: 1% 25:28 remain
> Checksumming tewn.landing_page_analytics: 4% 22:42 remain
> Checksumming tewn.landing_page_analytics: 6% 20:59 remain
> Checksumming tewn.landing_page_analytics: 8% 20:35 remain
> Checksumming tewn.landing_page_analytics: 10% 21:59 remain
> Checksumming tewn.landing_page_analytics: 12% 20:42 remain
> Checksumming tewn.landing_page_analytics: 14% 20:39 remain
> Checksumming tewn.landing_page_analytics: 16% 19:50 remain
> Checksumming tewn.landing_page_analytics: 18% 20:04 remain
> Checksumming tewn.landing_page_analytics: 19% 21:14 remain
> Checksumming tewn.landing_page_analytics: 21% 20:21 remain
> Checksumming tewn.landing_page_analytics: 23% 19:27 remain
> Checksumming tewn.landing_page_analytics: 25% 18:53 remain
> Checksumming tewn.landing_page_analytics: 28% 18:15 remain
> Checksumming tewn.landing_page_analytics: 30% 17:41 remain
> Checksumming tewn.landing_page_analytics: 32% 16:52 remain
> Checksumming tewn.landing_page_analytics: 34% 16:30 remain
> Checksumming tewn.landing_page_analytics: 36% 15:42 remain
> Checksumming tewn.landing_page_analytics: 39% 15:29 remain
> Checksumming tewn.landing_page_analytics: 41% 14:53 remain
> Checksumming tewn.landing_page_analytics: 43% 14:15 remain
> Checksumming tewn.landing_page_analytics: 45% 13:33 remain
> Checksumming tewn.landing_page_analytics: 48% 12:50 remain
> Checksumming tewn.landing_page_analytics: 50% 12:26 remain
> Checksumming tewn.landing_page_analytics: 52% 11:43 remain
> Checksumming tewn.landing_page_analytics: 54% 11:15 remain
> Checksumming tewn.landing_page_analytics: 55% 11:04 remain
> Checksumming tewn.landing_page_analytics: 57% 10:53 remain
> 02-28T15:30:44 Error checksumming table tewn.landing_page_analytics:
> Error executing checksum query: DBD::mysql::st execute failed: Deadlock
> found when trying to get lock; try restarting transaction [for Statement
> "REPLACE INTO `pt_data`.`checksums` (db, tbl, chunk, chunk_index,
> lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?,
> ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#',
> `id`, `domain_bitfield`, `page_id`, `category`, `demographic_sex`,
> `demographic_age`, `content_rating`, `num_ctr`, `num_display`, `weight`,
> `weight_vw`, `created`, `visible`, `type`, `key`, `updated` + 0,
> CONCAT(ISNULL(`domain_bitfield`), ISNULL(`page_id`), ISNULL(`num_ctr`),
> ISNULL(`num_display`), ISNULL(`weight`), ISNULL(`created`)))) AS
> UNSIGNED)), 10, 16)), 0) AS crc FROM `tewn`.`landing_page_analytics` FORCE
> INDEX(`PRIMARY`) WHERE ((`id` >= ?)) AND ((`id` <= ?)) /*checksum chunk*/"
> with ParamValues: 0='tewn', 1='landing_page_analytics', 2=16297,
> 3='PRIMARY', 4='159650735', 5='159662859', 6='159650735', 7='159662859'] at
> /usr/bin/pt-table-checksum line 10459.
>
>
> TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
> 02-28T15:30:44 1 0 80558617 16297 0 886.019
> tewn.landing_page_analytics
>
>
> May suggest something, I was checking this script and MySQL suggest when
> you have this kind of deadlock you must to retry to execute again this
> sentence, but in pt-table-checksum script line 10510 we have "fail"
> function which is evaluating kind of error so, retry argument option for
> the tool will retry to execute the last sql statement, like I said MySQL
> suggest to retry to execute, so with this modification this could be fixed.
>
> fail => sub {
> my (%args) = @_;
> my $error = $args{error};
>
> if ( $error =~ m/Lock wait timeout exceeded/
> || $error =~ m/Query execution was interrupted/
> || $error =~ m/Deadlock found when trying to get lock/
> ) {
> return 1;
> }
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/percona-toolkit/+bug/1287253/+subscriptions
>