pt-table-checksum has ambiguous exit status

Reported by Marco on 2012-03-01
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
High
Daniel Nichter

Bug Description

From pt-table-checksum manual: "The tool’s exit status is nonzero if any differences are found, or if any warnings or errors occur."
It would be nice to distinguish, with different status codes, errors (e.g. table skipped) from diffs (different tables checksum). Indeed errors may occur temporarily and don't break replicas integrity, while diffs do.

tags: added: ambiguity pt-table-checksum
Brian Fraser (fraserbn) on 2012-03-08
Changed in percona-toolkit:
importance: Undecided → Wishlist
Changed in percona-toolkit:
status: New → Triaged
Ryan Lowe (ryan-a-lowe) wrote :

pt-table-sync has the following:

STATUS MEANING
====== =======================================================
0 Success.
1 Internal error.
2 At least one table differed on the destination.
3 Combination of 1 and 2.

I'd love to see something similar on pt-table-checksum along the lines of:

STATUS MEANING
====== =======================================================
0 Success.
1 Could not start due to PID
2 Internal error.
3 At least one table differed on the destination.

Changed in percona-toolkit:
milestone: none → 2.2.5
assignee: nobody → Daniel Nichter (daniel-nichter)
importance: Wishlist → High
status: Triaged → In Progress
summary: - pt-table-checksum: exit status ambiguous
+ pt-table-checksum has ambiguous exit status
Daniel Nichter (daniel-nichter) wrote :

pt-table-checksum has three possible exit statuses: zero, 255, and any other
value is a bitmask with flags for different problems.

A zero exit status indicates no errors, warnings, or checksum differences,
or skipped chunks or tables.

A 255 exit status indicates a fatal error. In other words: the tool died
or crashed. The error is printed to C<STDERR>.

If the exit status is not zero or 255, then its value functions as a bitmask
with these flags:

   FLAG BIT VALUE MEANING
   ================ ========= ==========================================
   ALREADY_RUNNING 4 --pid file exists and the PID is running
   NO_SLAVES_FOUND 8 No replicas or cluster nodes were found
   CAUGHT_SIGNAL 16 Caught SIGHUP, SIGINT, SIGPIPE, or SIGTERM
   ERROR 32 A non-fatal error occurred
   TABLE_DIFF 512 At least one diff was found
   SKIP_CHUNK 1024 At least one chunk was skipped
   SKIP_TABLE 2048 At least one table was skipped

If any flag is set, the exit status will be non-zero. Use the bitwise C<AND>
operation to check for a particular flag. For example, if C<$exit_status & 4>
is true, then at least one diff was found.

Changed in percona-toolkit:
status: In Progress → Fix Committed
status: Fix Committed → In Progress
Daniel Nichter (daniel-nichter) wrote :

I conflated Perl exit with standard Unix exit, and the latter is limited to a single byte. So the new list is:

   ERROR 1 A non-fatal error occurred
   ALREADY_RUNNING 2 --pid file exists and the PID is running
   CAUGHT_SIGNAL 4 Caught SIGHUP, SIGINT, SIGPIPE, or SIGTERM
   NO_SLAVES_FOUND 8 No replicas or cluster nodes were found
   TABLE_DIFF 16 At least one diff was found
   SKIP_CHUNK 32 At least one chunk was skipped
   SKIP_TABLE 64 At least one table was skipped

Daniel Nichter (daniel-nichter) wrote :

For the record, we had a debate about this: some people say skipped chunks or tables should not be a non-zero exit, and others say it should. More people, including myself, think the latter, so we'll stay with the previous comment. My thinking is: zero exit should be a true, total "AOK"--everything worked as expected. People who commonly have skipped chunks may find this change to be a pita, as it does break backwards-compat a little, but a skipped chunk really is indication that something didn't work right, e.g. MySQL didn't us the index for a chunk, or the chunk was too large on the slave, etc. And since it's easy enough to isolate this exit status, people can still filter it out: "zero" exit == 0 || 32 || 96 (32 & 64).

Daniel Nichter (daniel-nichter) wrote :

Correction to previous comment: "zero" exit == 0 || 32 because < 2.2.4 only skipped *chunks* did not cause non-zero exit.

I have mentioned this change in the docs: As of pt-table-checksum 2.2.5, skipped chunks cause a non-zero exit status.

Changed in percona-toolkit:
status: In Progress → Fix Committed
Kenny Gryp (gryp) wrote :

Daniel, I totally agree with your decision. Thanks for 'fixing'/'adding that feature'!

Changed in percona-toolkit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers