pt-table-checksum gets stuck in "Waiting to check replicas for differences: 0% 00:00 remain"

Reported by Walter Heck on 2012-03-27
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
Undecided
Unassigned

Bug Description

pt-table-checksum ends up in an infinite loop after a bunch of tables:

<pre>
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_checksum:7367 18166 Sleep 44 waiting for chunks
# pt_table_checksum:7340 18166 server01 max chunk: undef
# pt_table_checksum:7340 18166 server02 max chunk: undef
# pt_table_checksum:7340 18166 server03 max chunk: undef
# pt_table_checksum:7340 18166 server04 max chunk: undef
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_checksum:7367 18166 Sleep 44.25 waiting for chunks
# pt_table_checksum:7340 18166 server01 max chunk: undef
# pt_table_checksum:7340 18166 server02 max chunk: undef
# pt_table_checksum:7340 18166 server03 max chunk: undef
# pt_table_checksum:7340 18166 server04 max chunk: undef
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_checksum:7367 18166 Sleep 44.5 waiting for chunks
# pt_table_checksum:7340 18166 server01 max chunk: undef
# pt_table_checksum:7340 18166 server02 max chunk: undef
# pt_table_checksum:7340 18166 server03 max chunk: undef
# pt_table_checksum:7340 18166 server04 max chunk: undef
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_checksum:7367 18166 Sleep 44.75 waiting for chunks
# pt_table_checksum:7340 18166 server01 max chunk: undef
# pt_table_checksum:7340 18166 server02 max chunk: undef
# pt_table_checksum:7340 18166 server03 max chunk: undef
# pt_table_checksum:7340 18166 server04 max chunk: undef
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_checksum:7367 18166 Sleep 45 waiting for chunks
# pt_table_checksum:7340 18166 server01 max chunk: undef
# pt_table_checksum:7340 18166 server02 max chunk: undef
# pt_table_checksum:7340 18166 server03 max chunk: undef
# pt_table_checksum:7340 18166 server04 max chunk: undef
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_checksum:7367 18166 Sleep 45.25 waiting for chunks
</pre>

Then when breaking with CTRL+C:
<pre>
# Caught SIGINT.
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `test`.`checksum` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='mysql' AND tbl='columns_priv')
# pt_table_checksum:6667 18166 0 checksum diffs on server01
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `test`.`checksum` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='mysql' AND tbl='columns_priv')
# pt_table_checksum:6667 18166 0 checksum diffs on server02
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `test`.`checksum` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='mysql' AND tbl='columns_priv')
# pt_table_checksum:6667 18166 0 checksum diffs on server03
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `test`.`checksum` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='mysql' AND tbl='columns_priv')
# pt_table_checksum:6667 18166 0 checksum diffs on server04
03-27T08:34:15 0 0 0 1 0 8120.737 mysql.columns_priv
# OobNibbleIterator:4302 18166 Finish nibble_sth
# OobNibbleIterator:4302 18166 Finish explain_nibble_sth
# pt_table_checksum:6805 18166 Exit status 1 oktorun 0
# Cxn:1514 18166 Disconnecting dbh DBI::db=HASH(0x1b4426a0) undef
# Cxn:1514 18166 Disconnecting dbh DBI::db=HASH(0x1b442360) undef
# Cxn:1514 18166 Disconnecting dbh DBI::db=HASH(0x1afe0bf0) undef
# Cxn:1514 18166 Disconnecting dbh DBI::db=HASH(0x1b3fba50) undef
# Cxn:1514 18166 Disconnecting dbh DBI::db=HASH(0x1afe0c10) undef
</pre>

The original command line is:
# PTDEBUG=1 pt-table-checksum --empty-replicate-table -umaatkit --ask-pass --replicate=test.checksum --recurse=1 --max-lag=1h --no-check-replication-filters 192.168.1.1

some more info:
<pre>
[~] $ uname -a
Linux server0 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
[~] $ date
Tue Mar 27 08:36:38 BST 2012
[~] $ lsb_release -a
LSB Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 5.8 (Final)
Release: 5.8
Codename: Final
[~] $ yum list installed | grep perl
perl.x86_64 4:5.8.8-38.el5 installed
perl-Algorithm-Diff.noarch 1.1902-1.el5.rf installed
perl-Class-Singleton.noarch 1.4-1.el5.rf installed
perl-DBD-MySQL.x86_64 3.0007-2.el5 installed
perl-DBI.x86_64 1.52-2.el5 installed
perl-Git.x86_64 1.7.8.2-2.el5.rf installed
perl-Log-Log4perl.noarch 1.26-1.el5.rf installed
perl-Proc-Daemon.noarch 0.03-1.2.el5.rf installed
perl-String-CRC32.x86_64 1.4-2.fc6 installed
perl-TermReadKey.x86_64 2.30-3.el5.rf installed
[~] $ yum list installed | grep percona
percona-release.x86_64 0.0-1 installed
percona-toolkit.noarch 2.0.4-1 installed
</pre>

Baron Schwartz (baron-xaprb) wrote :

Infinite loop, or one-hour loop? You've set the tool to tolerate a max replication lag of an hour, and it looks to me like we're just waiting for the checksums to actually appear on the replicas. I can't see evidence that it's a bug. Perhaps we should report replication lag in the message as well so we have more information on what's happening.

Download full text (8.0 KiB)

I'm pretty sure that it's a bug since I have the same without the 1
hour param. Also, it was running like this for a long time, since it
increases the sleep time by 0.25 seconds every iteration. The log was
just an excerpt. I'm relatively sure I let it run for more then an
hour too.
Besides that, the cluster was quiet at that time, not doing a lot of load.

Do you have any ideas of what I could do to gather more info?

Thanks!

On Tue, Mar 27, 2012 at 21:27, Baron Schwartz <email address hidden> wrote:
> Infinite loop, or one-hour loop? You've set the tool to tolerate a max
> replication lag of an hour, and it looks to me like we're just waiting
> for the checksums to actually appear on the replicas.  I can't see
> evidence that it's a bug. Perhaps we should report replication lag in
> the message as well so we have more information on what's happening.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/965987
>
> Title:
>  pt-table-checksum gets stuck in "Waiting to check replicas for
>  differences:   0% 00:00 remain"
>
> Status in Percona Toolkit:
>  New
>
> Bug description:
>  pt-table-checksum ends up in an infinite loop after a bunch of tables:
>
>  <pre>
>  Waiting to check replicas for differences:   0% 00:00 remain
>  # pt_table_checksum:7367 18166 Sleep 44 waiting for chunks
>  # pt_table_checksum:7340 18166 server01 max chunk: undef
>  # pt_table_checksum:7340 18166 server02 max chunk: undef
>  # pt_table_checksum:7340 18166 server03 max chunk: undef
>  # pt_table_checksum:7340 18166 server04 max chunk: undef
>  Waiting to check replicas for differences:   0% 00:00 remain
>  # pt_table_checksum:7367 18166 Sleep 44.25 waiting for chunks
>  # pt_table_checksum:7340 18166 server01 max chunk: undef
>  # pt_table_checksum:7340 18166 server02 max chunk: undef
>  # pt_table_checksum:7340 18166 server03 max chunk: undef
>  # pt_table_checksum:7340 18166 server04 max chunk: undef
>  Waiting to check replicas for differences:   0% 00:00 remain
>  # pt_table_checksum:7367 18166 Sleep 44.5 waiting for chunks
>  # pt_table_checksum:7340 18166 server01 max chunk: undef
>  # pt_table_checksum:7340 18166 server02 max chunk: undef
>  # pt_table_checksum:7340 18166 server03 max chunk: undef
>  # pt_table_checksum:7340 18166 server04 max chunk: undef
>  Waiting to check replicas for differences:   0% 00:00 remain
>  # pt_table_checksum:7367 18166 Sleep 44.75 waiting for chunks
>  # pt_table_checksum:7340 18166 server01 max chunk: undef
>  # pt_table_checksum:7340 18166 server02 max chunk: undef
>  # pt_table_checksum:7340 18166 server03 max chunk: undef
>  # pt_table_checksum:7340 18166 server04 max chunk: undef
>  Waiting to check replicas for differences:   0% 00:00 remain
>  # pt_table_checksum:7367 18166 Sleep 45 waiting for chunks
>  # pt_table_checksum:7340 18166 server01 max chunk: undef
>  # pt_table_checksum:7340 18166 server02 max chunk: undef
>  # pt_table_checksum:7340 18166 server03 max chunk: undef
>  # pt_table_checksum:7340 18166 server04 max chunk: undef
>  Waiting to check replicas for differences:   0% 00:00 remain
>  # pt_table_checksum:7367 18166 Sleep 45.25 ...

Read more...

Baron Schwartz (baron-xaprb) wrote :

I'm not sure what might be happening here. If there is any chance you can create a reproducible case, that's ideal. The tool is waiting for a chunk to appear on a slave, and isn't finding it. I would try to figure out whether one of the following cases is happening:

* The chunk is there but the tool doesn't see it. (Transaction isolation level, maybe? A bug in the query?)
* The chunk isn't there, but it will be eventually if we keep waiting.
* The chunk should be there, but something happened to it -- replication is broken silently, or something deleted the chunk, etc
* The chunk doesn't exist on the master, so the tool is waiting for something that'll never happen.
* The tool isn't actually looking for the existence of a chunk (it thinks it is, but it doesn't have valid chunk keys to look for?)

You might look for the _d() call that says "waiting for chunks" and modify that code to print out which db, tbl, and chunk it's waiting for.

tags: added: infinite-loop pt-table-checksum
Changed in percona-toolkit:
status: New → Triaged
Brian Fraser (fraserbn) on 2012-10-31
Changed in percona-toolkit:
status: Triaged → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for Percona Toolkit because there has been no activity for 60 days.]

Changed in percona-toolkit:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers