pt-heartbeat --monitor --file sometimes results in an empty file

Bug #1133249 reported by Callum Macdonald
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Confirmed
Undecided
Unassigned

Bug Description

I'm using pt-heartbeat to monitor replication delay between 1 master and 2 slaves (1 SSD, 1 disk). Running on the slaves like so:
/usr/bin/pt-heartbeat --config /etc/pt-heartbeat.conf

$ cat /etc/pt-heartbeat.conf
monitor
file=/var/spool/pt-heartbeat
master-server-id=13
daemonize
pid=/tmp/pt-heartbeat.pid
host=localhost
user=heartbeat
password=pass
database=db
table=heartbeat

Then I'm using monit and a custom rolled bash script to verify the contents of /var/spool/pt-heartbeat once a minute and send an error if the value is above a threshold.

Several times a day, maybe 5-10 times per day, the script finds the /var/spool/pt-heartbeat file to be empty. Running some tests[1] suggests that in some cases the file is left empty for several seconds. This is not true on the other slave (nearly identical server setup, thanks puppet) with an SSD.

I tried adding a delay and a second check into the script. I used a sleep of 2.7s to offset against the 1s interval in case the script was hitting at precisely the time when pt-heartbeat had truncated the file. This change appears to have reduced the frequency of the issue, but it still persists.

Just let me know if any additional information would be useful in tracking down this issue.

[1] $ while true; do ls --full-time /var/spool/pt-heartbeat; sleep 0.1; done

Callum Macdonald (chmac)
tags: added: pt-heartbeat
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

This doesn't seem to be a bug in the tool. It sounds like a race condition, i.e. that your script is reading the file at the exact time pt-heartbeat has truncated the file but before it has written the new info and/or that new info has been flushed. The evidence seems to be the 2.7s offset which reduced the frequency of the issue.

One solution would be locking the file. We could do other more exotic things like writing to tmp file the moving the tmp to the real file. But a 1s, that's a lot of tmp files and moves.

This is is not really a bug in the tool, I'm not sure when or if we'll have time to implement a solution. In your script, can you just retry reading the file if it's empty? Or just ignore if empty because at 1s interval, missing a few seconds here and there shouldn't matter much?

tags: added: race-condition
Revision history for this message
Callum Macdonald (chmac) wrote :

I don't think it is a race condition. I did consider that and I ran some tests. I specifically chose a delay of 2.7s because the file is written every 1s, so a delay (twice) of 2.7s means we try to read at 0s, 2.7s, and 5.4s +/- 10ms. It seems highly improbable that the script is writing at those same intervals, when it's set to write every 1s.

I also ran a test as I described above where I check the size of the file every 50ms or 100ms. The file remains 0 bytes for more than 30 checks in some instances. Using these commands:
$ while true; do ls --full-time /var/spool/pt-heartbeat >> pt-heartbeat.log; sleep 0.05; done
$ cat pt-heartbeat.log | grep " 0 " | uniq -c

I can see some instances where for 40 x 50ms, the file remains at 0 bytes. As I understand it, that's not a race condition, that's something wonky with the tool on this machine. While running the same test on another machine shows 0 results with a filesize of 0 bytes.

I don't know enough about filesystems to hazard a guess as to what's going on. I can say that in the week we've been running the script on two separate machines, we've never had a single failure on the SSD based machine, but it happens frequently on this disk based machine.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

I'm not sure what to do about this, but I'm 99% certain it's not a bug or problem in the tool, at least not one that we can fix. The relevant code is:

            if ( my $file = $o->get('file') ) {
               open my $file, '>', $file
                  or die "Can't open $file: $OS_ERROR";
               print $file $output
                  or die "Can't print to $file: $OS_ERROR";
               close $file
                  or die "Can't close $file: $OS_ERROR";
            }

There's nothing more than that: open file, write, close. The output (print) is buffered, but flushed on close, and there's probably only a few microseconds between those two. There's nothing else we could do in the tool. Maybe there's some caching or filesystem-level issue on the disk-based machines. In any case, the code is already as simple as possible, i.e. no chance that it's doing something weird or superfluous that's causing this issue, at least not that we can do anything about (could be a Perl bug but that's extremely unlikely in such core functions like print() and close()).

The only other way the output could be zero bytes is if

            my $output = sprintf $format, $delay, @vals, $pk_val;

$output was an empty string, but that can't be the case either because,

   # 2.00s [ 0.05s, 0.01s, 0.00s ]
   my $format = ($hires_ts ? '%.2f' : '%4d') . "s "
              . "[ " . join(", ", map { "%5.2fs" } @$frames) . " ]"
              . ($o->get('print-master-server-id') ? " %d" : '')
              . "\n";

$format always has at least some static characters (or the new-line if nothing else).

So given all that, I'll close this bug. I hate to do that to you while it's still unresolved, but I think the evidence points to a deeper, non-tool-related, perhaps system- or storage- or filesystem-related problem. If you find new evidence one way or another, please reply to this issue and we'll take a look again.

Changed in percona-toolkit:
status: New → Invalid
Revision history for this message
Callum Macdonald (chmac) wrote :

Fair enough, I can believe it's an issue outside pt-heartbeat somewhere. I can work around it easily enough by requiring several sequential failures before sending an alert. Thanks for looking into it, I appreciate it.

Revision history for this message
W. Andrew Loe III (loe) wrote :

I also have the same issue as Callum, it appears to empty for extended periods of time. I use the default interval of 1.0 seconds and I check the file every 10 seconds. These machines are on EBS volumes, so perhaps the network latency on the truncate/write phase causes it to be empty for a while.

Revision history for this message
W. Andrew Loe III (loe) wrote :

I was able to fix this by changing from a truncate write, to writing one file and then renaming it into place.

            my $output = sprintf $format, $delay, @vals, $pk_val;
            if ( my $filename = $o->get('file') ) {
               my $tempfilename = $filename . ".tmp";
               open my $file, '>', $tempfilename
                  or die "Can't open $tempfilename: $OS_ERROR";
               print $file $output
                  or die "Can't print to $tempfilename: $OS_ERROR";
               close $file
                  or die "Can't close $tempfilename: $OS_ERROR";
                rename $tempfilename, $filename
                  or die "Can't rename $tempfilename to $filename: $OS_ERROR";

            }
            else {
               print $output;
            }

Changed in percona-toolkit:
status: Invalid → Confirmed
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1079

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.