Percona Toolkit moved to https://jira.percona.com/projects/PT

pt-table-checksum doesn't reconnect the slave $dbh

Bug #1042727 reported by Baron Schwartz on 2012-08-28

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Percona Toolkit moved to https://jira.percona.com/projects/PT	Fix Released	High	Daniel Nichter	Percona Toolkit moved to https://jira.percona.com/projects/PT 2.2.15

Bug Description

When replication is very delayed, pt-table-checksum will not keep its connection to the replica [was:master] alive, and when the replica catches up or if it dies for some reason, we get an error. It looks like this:

================

08-27T09:44:10 Error waiting for the last checksum of table <...> to replicate to replica <...>: DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement "SELECT MAX(chunk) FROM `percona`.`checksum` WHERE ... at pt-table-checksum line 8581.

Check that the replica is running and has the replicate table `percona`.`checksum`. Checking the replica for checksum differences will probably cause another error.
08-27T09:44:10 Error checking for checksum differences of table <...> on replica <...>: DBD::mysql::db selectall_arrayref failed: MySQL server has gone away [for Statement "SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `rkdb`.`archivechecksum` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='...' AND tbl='...')"] at pt-table-checksum line 4118.

Check that the replica is running and has the replicate table `percona`.`checksum`.

================

I think the tool needs to reconnect to replicas.

[redacted: I think the tool needs to do a keepalive SELECT 1 or something like that.]

See original description

Tags:

Revision history for this message

Brian Fraser (fraserbn) wrote on 2012-08-29:

I wonder what would happen if, instead of keeping the connection alive, we used $dbh->{mysql_auto_reconnect} = 1. Does anyone have any experience with that?

Changed in percona-toolkit:
status:	New → Confirmed

Revision history for this message

Baron Schwartz (baron-xaprb) wrote on 2012-09-07:

I am skeptical. Statement handles would be invalidated, I assume. But it may work.

In the meantime I am changing my local copy to do two things:

1. Don't print those warnings if --quiet =1
2. Wrap "$diffs = $rc->find_replication_differences(...)" in an eval{} block so that the whole thing doesn't get aborted if only one slave's connection has died.

Revision history for this message

Baron Schwartz (baron-xaprb) wrote on 2012-09-07:

By the way, it seems that every time I get the above messages, it's because checking on one slave failed, the tool aborts checksumming and/or never checks anything on that replica again, then tries to check for differences before exiting -- but it tries to use a $dbh it has been ignoring because it was dead. I never get one or the other error message, I always get both.

Revision history for this message

Baron Schwartz (baron-xaprb) wrote on 2012-09-10:

I'm trying this to see what happens. I'll let you know:

Index: utils/pt/pt-table-checksum
===================================================================
--- utils/pt/pt-table-checksum (revision 29726)
+++ utils/pt/pt-table-checksum (working copy)
@@ -216,6 +216,9 @@
       mysql_enable_utf8 => ($cxn_string =~ m/charset=utf8/i ? 1 : 0),
    };
    @{$defaults}{ keys %$opts } = values %$opts;
+ if ( $opts{AutoCommit} ) {
+ $opts{mysql_auto_reconnect} = 1;
+ }

if ( $opts->{mysql_use_result} ) {
$defaults->{mysql_use_result} = 1;

summary:	- pt-table-checksum doesn't keep the master DBH alive + pt-table-checksum doesn't reconnect the slave $dbh
description:	updated

Revision history for this message

Daniel Nichter (daniel-nichter) wrote on 2012-10-01:

Another case of connection resiliency à la bug 1046966.

tags:

added: error-recovery

Revision history for this message

Tibor Korocz (tkorocz) wrote on 2014-11-19:

Hi,

I'm using the newest pt-table-checksum but I got the same error:

11-18T18:09:35 Error waiting for the last checksum of table db.tbl to replicate to replica HOST : DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement "SELECT MAX(chunk) FROM `db`.`checksums` WHERE db='db' AND tbl='tbl' AND master_crc IS NOT NULL"] at /usr/bin/pt-table-checksum line 11230.

Check that the replica is running and has the replicate table `db`.`checksums`. Checking the replica for checksum differences will probably cause another error.

Anybody has any solution for this?

Thanks.

Daniel Nichter (daniel-nichter) on 2015-06-24

Changed in percona-toolkit:
status:	Confirmed → In Progress
assignee:	nobody → Daniel Nichter (daniel-nichter)
importance:	Undecided → High

Revision history for this message

Daniel Nichter (daniel-nichter) wrote on 2015-06-25:

See https://github.com/percona/percona-toolkit/pull/21

Will be released in 2.2.15.

Changed in percona-toolkit:
status:	In Progress → Fix Committed

Frank Cizmich (frank-cizmich) on 2015-07-10

Changed in percona-toolkit:
milestone:	none → 2.2.15

Hrvoje Matijakovic (hrvojem) on 2015-08-28

Changed in percona-toolkit:
status:	Fix Committed → Fix Released

Frank Cizmich (frank-cizmich) on 2015-09-17

Changed in percona-toolkit:
importance:	High → Medium
importance:	Medium → High

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-24:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-329

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1443847

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.