pt-table-checksum gets stuck in "Waiting to check replicas for differences: 0% 00:00 remain"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Percona Toolkit moved to https://jira.percona.com/projects/PT |
Expired
|
Undecided
|
Unassigned |
Bug Description
pt-table-checksum ends up in an infinite loop after a bunch of tables:
<pre>
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_
# pt_table_
# pt_table_
# pt_table_
# pt_table_
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_
# pt_table_
# pt_table_
# pt_table_
# pt_table_
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_
# pt_table_
# pt_table_
# pt_table_
# pt_table_
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_
# pt_table_
# pt_table_
# pt_table_
# pt_table_
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_
# pt_table_
# pt_table_
# pt_table_
# pt_table_
Waiting to check replicas for differences: 0% 00:00 remain
# pt_table_
</pre>
Then when breaking with CTRL+C:
<pre>
# Caught SIGINT.
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(
# pt_table_
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(
# pt_table_
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(
# pt_table_
# RowChecksum:3483 18166 SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(
# pt_table_
03-27T08:34:15 0 0 0 1 0 8120.737 mysql.columns_priv
# OobNibbleIterat
# OobNibbleIterat
# pt_table_
# Cxn:1514 18166 Disconnecting dbh DBI::db=
# Cxn:1514 18166 Disconnecting dbh DBI::db=
# Cxn:1514 18166 Disconnecting dbh DBI::db=
# Cxn:1514 18166 Disconnecting dbh DBI::db=
# Cxn:1514 18166 Disconnecting dbh DBI::db=
</pre>
The original command line is:
# PTDEBUG=1 pt-table-checksum --empty-
some more info:
<pre>
[~] $ uname -a
Linux server0 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
[~] $ date
Tue Mar 27 08:36:38 BST 2012
[~] $ lsb_release -a
LSB Version: :core-4.
Distributor ID: CentOS
Description: CentOS release 5.8 (Final)
Release: 5.8
Codename: Final
[~] $ yum list installed | grep perl
perl.x86_64 4:5.8.8-38.el5 installed
perl-Algorithm-
perl-Class-
perl-DBD-
perl-DBI.x86_64 1.52-2.el5 installed
perl-Git.x86_64 1.7.8.2-2.el5.rf installed
perl-Log-
perl-Proc-
perl-String-
perl-TermReadKe
[~] $ yum list installed | grep percona
percona-
percona-
</pre>
Baron Schwartz (baron-xaprb) wrote : | #1 |
Walter Heck (walterheck) wrote : Re: [Bug 965987] Re: pt-table-checksum gets stuck in "Waiting to check replicas for differences: 0% 00:00 remain" | #2 |
I'm pretty sure that it's a bug since I have the same without the 1
hour param. Also, it was running like this for a long time, since it
increases the sleep time by 0.25 seconds every iteration. The log was
just an excerpt. I'm relatively sure I let it run for more then an
hour too.
Besides that, the cluster was quiet at that time, not doing a lot of load.
Do you have any ideas of what I could do to gather more info?
Thanks!
On Tue, Mar 27, 2012 at 21:27, Baron Schwartz <email address hidden> wrote:
> Infinite loop, or one-hour loop? You've set the tool to tolerate a max
> replication lag of an hour, and it looks to me like we're just waiting
> for the checksums to actually appear on the replicas. I can't see
> evidence that it's a bug. Perhaps we should report replication lag in
> the message as well so we have more information on what's happening.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> pt-table-checksum gets stuck in "Waiting to check replicas for
> differences: 0% 00:00 remain"
>
> Status in Percona Toolkit:
> New
>
> Bug description:
> pt-table-checksum ends up in an infinite loop after a bunch of tables:
>
> <pre>
> Waiting to check replicas for differences: 0% 00:00 remain
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> Waiting to check replicas for differences: 0% 00:00 remain
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> Waiting to check replicas for differences: 0% 00:00 remain
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> Waiting to check replicas for differences: 0% 00:00 remain
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> Waiting to check replicas for differences: 0% 00:00 remain
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> # pt_table_
> Waiting to check replicas for differences: 0% 00:00 remain
> # pt_table_
Baron Schwartz (baron-xaprb) wrote : | #3 |
I'm not sure what might be happening here. If there is any chance you can create a reproducible case, that's ideal. The tool is waiting for a chunk to appear on a slave, and isn't finding it. I would try to figure out whether one of the following cases is happening:
* The chunk is there but the tool doesn't see it. (Transaction isolation level, maybe? A bug in the query?)
* The chunk isn't there, but it will be eventually if we keep waiting.
* The chunk should be there, but something happened to it -- replication is broken silently, or something deleted the chunk, etc
* The chunk doesn't exist on the master, so the tool is waiting for something that'll never happen.
* The tool isn't actually looking for the existence of a chunk (it thinks it is, but it doesn't have valid chunk keys to look for?)
You might look for the _d() call that says "waiting for chunks" and modify that code to print out which db, tbl, and chunk it's waiting for.
tags: | added: infinite-loop pt-table-checksum |
Changed in percona-toolkit: | |
status: | New → Triaged |
Changed in percona-toolkit: | |
status: | Triaged → Incomplete |
Launchpad Janitor (janitor) wrote : | #4 |
[Expired for Percona Toolkit because there has been no activity for 60 days.]
Changed in percona-toolkit: | |
status: | Incomplete → Expired |
Aleksandr Kuzminsky (akuzminsky) wrote : | #5 |
I hit this bug too.
The last four chunks in percona.checksums:
*******
db: xxx
tbl: yyy
chunk: 4789
chunk_time: NULL
chunk_index: PRIMARY
lower_boundary: 1378097957772929022
upper_boundary: 1378099845562112096
this_crc: 8aecc046
this_cnt: 10000
master_crc: NULL
master_cnt: NULL
ts: 2014-12-01 21:44:46
*******
db: xxx
tbl: yyy
chunk: 4790
chunk_time: NULL
chunk_index: PRIMARY
lower_boundary: 1378099845941382247
upper_boundary: 1378101485777874957
this_crc: de3400dd
this_cnt: 8765
master_crc: NULL
master_cnt: NULL
ts: 2014-12-01 21:44:46
*******
db: xxx
tbl: yyy
chunk: 4791
chunk_time: NULL
chunk_index: PRIMARY
lower_boundary: NULL
upper_boundary: 1369094286749407470
this_crc: 0
this_cnt: 0
master_crc: NULL
master_cnt: NULL
ts: 2014-12-01 21:44:46
*******
db: xxx
tbl: yyy
chunk: 4792
chunk_time: NULL
chunk_index: PRIMARY
lower_boundary: 1378101485777874957
upper_boundary: NULL
this_crc: 0
this_cnt: 0
master_crc: NULL
master_cnt: NULL
ts: 2014-12-01 21:44:46
The replicas are running, they indefinitely execute :
SELECT MAX(chunk) FROM `percona`
+------------+
| MAX(chunk) |
+------------+
| NULL |
+------------+
1 row in set (0.01 sec)
I guess NULL is not what it expects to get.
Shahriyar Rzayev (rzayev-sehriyar) wrote : | #6 |
Percona now uses JIRA for bug reports so this bug report is migrated to: https:/
Infinite loop, or one-hour loop? You've set the tool to tolerate a max replication lag of an hour, and it looks to me like we're just waiting for the checksums to actually appear on the replicas. I can't see evidence that it's a bug. Perhaps we should report replication lag in the message as well so we have more information on what's happening.