pt-table-sync can't sync rows due to deadlocks

Bug #1263168 reported by Valerii Kravchuk on 2013-12-20
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
Undecided
Daniel Nichter

Bug Description

pt-table-sync 2.2.5 on CentOS 6.4, called like

pt-table-sync --config /home/customer/testussupp.cfg h=host1 h=host2

where the .cfg contains:
user=admin
password=<REDACTED>
print
no-foreign-key-checks
databases=db1
tables=us,us_excludes,us_extra_attributes,us_includes,us_payload_type_format,us_recon_policy,us_sensitivity,us_sharing_restrictions,us_site,usage
no-unique-checks
where=u_id>(select min(u_id) from us where u_time + 600 >= unix_timestamp())

Throws the following error:

Called not_in_left in state 0 at /usr/bin/pt-table-sync line 5575. while doing db1.us on host2

PTDEBUG capture shows deadlocks right before the error:

Capture #1:
==========
# RowDiff:3177 5343 Fetching row from left
# RowDiff:3179 5343 DBD::mysql::st fetchrow_hashref failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "SELECT /*db1.us:1/1*/ 0 AS chunk_num, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `u_id`, `u_server_id`, `unq_id`, `u_time`, `proc_time`, `sha1_hash`, `u_data`, `u_envelope`, CONCAT(ISNULL(`u_time`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`us` FORCE INDEX (`PRIMARY`) WHERE (1=1) AND ((u_id>(select min(u_id) from us where u_time + 600 >= unix_timestamp()))) LOCK IN SHARE MODE"] at /usr/bin/pt-table-sync line 3178.
#
# RowDiff:3187 5343 Fetching row from right
# RowDiff:3217 5343 Right is not in left
Called not_in_left in state 0 at /usr/bin/pt-table-sync line 5575. while doing db1.us on host2

Capture #2:
==========
# RowDiff:3177 22395 Fetching row from left
# RowDiff:3179 22395 DBD::mysql::st fetchrow_hashref failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "SELECT /*db1.us:1/1*/ 0 AS chunk_num, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `u_id`, `u_server_id`, `unq_id`, `u_time`, `proc_time`, `sha1_hash`, `u_data`, `u_envelope`, CONCAT(ISNULL(`u_time`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`us` FORCE INDEX (`PRIMARY`) WHERE (1=1) AND ((u_id>(select min(u_id) from us where u_time + 600 >= unix_timestamp()))) LOCK IN SHARE MODE"] at /usr/bin/pt-table-sync line 3178.
#
# RowDiff:3187 22395 Fetching row from right
# RowDiff:3217 22395 Right is not in left
Called not_in_left in state 0 at /usr/bin/pt-table-sync line 5575. while doing db1.us on host2

"I did notice that once, while running pt-table-sync, a client did encounter a Lock Wait Timeout Exceeded error and I can also see that Row Lock Waits and Row Lock Time were spiking when I was running the pt-table-sync" (customer words)

And we can see the deadlock here:
------------------------
LATEST DETECTED DEADLOCK
------------------------
131218 11:49:12
*** (1) TRANSACTION:
TRANSACTION 3213997, ACTIVE 2 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 25 lock struct(s), heap size 3112, 10 row lock(s), undo log entries 8
MySQL thread id 143451, OS thread handle 0x7fe82d531700, query id 66882879 update
insert into usage (site_id,use_type_id,payload_type_id,payload_format_id,server_time,u_id,u_server_id) values(9,1,1,7,1387388950,665348,2) ON DUPLICATE KEY UPDATE u_id = 665348, server_time=1387388950
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 238 page no 10 n bits 352 index `PRIMARY` of table `db1`.`usage` trx id 3213997 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 3213995, ACTIVE 4 sec fetching rows
mysql tables in use 2, locked 1
10 lock struct(s), heap size 1248, 475 row lock(s)
MySQL thread id 668647, OS thread handle 0x7fe82d900700, query id 66882933 localhost admin Sending data
SELECT /*db1.us:1/1*/ 0 AS chunk_num, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `u_id`, `u_server_id`, `unq_id`, `u_time`, `proc_time`, `sha1_hash`, `u_data`, `u_envelope`, CONCAT(ISNULL(`u_time`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`us` FORCE INDEX (`PRIMARY`) WHERE (1=1) AND ((u_id>(select min(u_id) from us where u_time + 600 >= unix_timestamp()))) LOCK IN SHARE MODE
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 238 page no 10 n bits 352 index `PRIMARY` of table `db1`.`usage` trx id 3213995 lock mode S
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 224 page no 1354467 n bits 80 index `PRIMARY` of table `db1`.`us` trx id 3213995 lock mode S waiting
*** WE ROLL BACK TRANSACTION (2)

Txn Isolation Level | REPEATABLE-READ

tags: added: i37826
Mike Diehn (mike-diehn) wrote :

I googled "The total number of locks exceeds the lock table size," which found in the debug output. Here's links to articles saying to increase the innodb_buffer_pool_size. My tables are 55+ million rows, so this seems plausible.

http://major.io/2010/02/16/mysql-the-total-number-of-locks-exceeds-the-lock-table-size-2/
http://stackoverflow.com/questions/6901108/the-total-number-of-locks-exceeds-the-lock-table-size

Didn't work with 2GB.
Didn't work with --lock=3 nor with --lock=0

Using --no-transaction and --lock=2 seems to be working. It seems slower, though.

tags: added: dead-lock percona-37826 pt-table-sync
removed: i37826
Changed in percona-toolkit:
assignee: nobody → Daniel Nichter (daniel-nichter)
tags: added: i37826
removed: percona-37826
Frank Cizmich (frank-cizmich) wrote :

Hi Valerii,

Revisiting this old "heisenbug" ...

I see the original intent was to archive data of a master table to another (non replication) server. This explains the unusual where clause. Maybe pt-archiver is better suited for this since it also deletes old data from master, and can deal with deadlocks (--retries option).

Since the future version of pt-table-sync will be restricted to normal replication, and will work on top of pt-table-checksum results (which is better prepared to deal with deadlocks), this issue can be considered moot.

Although this can be considered a true bug, for the above reasons I'm marking this as won't fix.

Valerii, feel free to reopen or comment if you have other considerations.

Changed in percona-toolkit:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers