pt-table-checksum doesn't reconnect the slave $dbh

Reported by Baron Schwartz on 2012-08-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit
Undecided
Unassigned

Bug Description

When replication is very delayed, pt-table-checksum will not keep its connection to the replica [was:master] alive, and when the replica catches up or if it dies for some reason, we get an error. It looks like this:

================

08-27T09:44:10 Error waiting for the last checksum of table <...> to replicate to replica <...>: DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement "SELECT MAX(chunk) FROM `percona`.`checksum` WHERE ... at pt-table-checksum line 8581.

Check that the replica is running and has the replicate table `percona`.`checksum`. Checking the replica for checksum differences will probably cause another error.
08-27T09:44:10 Error checking for checksum differences of table <...> on replica <...>: DBD::mysql::db selectall_arrayref failed: MySQL server has gone away [for Statement "SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `rkdb`.`archivechecksum` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='...' AND tbl='...')"] at pt-table-checksum line 4118.

Check that the replica is running and has the replicate table `percona`.`checksum`.

================

I think the tool needs to reconnect to replicas.

[redacted: I think the tool needs to do a keepalive SELECT 1 or something like that.]

Brian Fraser (fraserbn) wrote :

I wonder what would happen if, instead of keeping the connection alive, we used $dbh->{mysql_auto_reconnect} = 1. Does anyone have any experience with that?

Changed in percona-toolkit:
status: New → Confirmed
Baron Schwartz (baron-xaprb) wrote :

I am skeptical. Statement handles would be invalidated, I assume. But it may work.

In the meantime I am changing my local copy to do two things:

1. Don't print those warnings if --quiet =1
2. Wrap "$diffs = $rc->find_replication_differences(...)" in an eval{} block so that the whole thing doesn't get aborted if only one slave's connection has died.

Baron Schwartz (baron-xaprb) wrote :

By the way, it seems that every time I get the above messages, it's because checking on one slave failed, the tool aborts checksumming and/or never checks anything on that replica again, then tries to check for differences before exiting -- but it tries to use a $dbh it has been ignoring because it was dead. I never get one or the other error message, I always get both.

Baron Schwartz (baron-xaprb) wrote :

I'm trying this to see what happens. I'll let you know:

Index: utils/pt/pt-table-checksum
===================================================================
--- utils/pt/pt-table-checksum (revision 29726)
+++ utils/pt/pt-table-checksum (working copy)
@@ -216,6 +216,9 @@
       mysql_enable_utf8 => ($cxn_string =~ m/charset=utf8/i ? 1 : 0),
    };
    @{$defaults}{ keys %$opts } = values %$opts;
+ if ( $opts{AutoCommit} ) {
+ $opts{mysql_auto_reconnect} = 1;
+ }

    if ( $opts->{mysql_use_result} ) {
       $defaults->{mysql_use_result} = 1;

summary: - pt-table-checksum doesn't keep the master DBH alive
+ pt-table-checksum doesn't reconnect the slave $dbh
description: updated
Daniel Nichter (daniel-nichter) wrote :

Another case of connection resiliency à la bug 1046966.

tags: added: error-recovery
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers