Percona Toolkit moved to https://jira.percona.com/projects/PT

pt-heartbeat --monitor --file sometimes results in an empty file

Bug #1133249 reported by Callum Macdonald on 2013-02-26

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Percona Toolkit moved to https://jira.percona.com/projects/PT	Confirmed	Undecided	Unassigned

Bug Description

I'm using pt-heartbeat to monitor replication delay between 1 master and 2 slaves (1 SSD, 1 disk). Running on the slaves like so:
/usr/bin/pt-heartbeat --config /etc/pt-heartbeat.conf

$ cat /etc/pt-heartbeat.conf
monitor
file=/var/spool/pt-heartbeat
master-server-id=13
daemonize
pid=/tmp/pt-heartbeat.pid
host=localhost
user=heartbeat
password=pass
database=db
table=heartbeat

Then I'm using monit and a custom rolled bash script to verify the contents of /var/spool/pt-heartbeat once a minute and send an error if the value is above a threshold.

Several times a day, maybe 5-10 times per day, the script finds the /var/spool/pt-heartbeat file to be empty. Running some tests[1] suggests that in some cases the file is left empty for several seconds. This is not true on the other slave (nearly identical server setup, thanks puppet) with an SSD.

I tried adding a delay and a second check into the script. I used a sleep of 2.7s to offset against the 1s interval in case the script was hitting at precisely the time when pt-heartbeat had truncated the file. This change appears to have reduced the frequency of the issue, but it still persists.

Just let me know if any additional information would be useful in tracking down this issue.

[1] $ while true; do ls --full-time /var/spool/pt-heartbeat; sleep 0.1; done

Tags:

Callum Macdonald (chmac) on 2013-02-26

tags:

added: pt-heartbeat

Revision history for this message

Daniel Nichter (daniel-nichter) wrote on 2013-03-01:

This doesn't seem to be a bug in the tool. It sounds like a race condition, i.e. that your script is reading the file at the exact time pt-heartbeat has truncated the file but before it has written the new info and/or that new info has been flushed. The evidence seems to be the 2.7s offset which reduced the frequency of the issue.

One solution would be locking the file. We could do other more exotic things like writing to tmp file the moving the tmp to the real file. But a 1s, that's a lot of tmp files and moves.

This is is not really a bug in the tool, I'm not sure when or if we'll have time to implement a solution. In your script, can you just retry reading the file if it's empty? Or just ignore if empty because at 1s interval, missing a few seconds here and there shouldn't matter much?

tags:

added: race-condition

Revision history for this message

Callum Macdonald (chmac) wrote on 2013-03-02:

I don't think it is a race condition. I did consider that and I ran some tests. I specifically chose a delay of 2.7s because the file is written every 1s, so a delay (twice) of 2.7s means we try to read at 0s, 2.7s, and 5.4s +/- 10ms. It seems highly improbable that the script is writing at those same intervals, when it's set to write every 1s.

I also ran a test as I described above where I check the size of the file every 50ms or 100ms. The file remains 0 bytes for more than 30 checks in some instances. Using these commands:
$ while true; do ls --full-time /var/spool/pt-heartbeat >> pt-heartbeat.log; sleep 0.05; done
$ cat pt-heartbeat.log | grep " 0 " | uniq -c

I can see some instances where for 40 x 50ms, the file remains at 0 bytes. As I understand it, that's not a race condition, that's something wonky with the tool on this machine. While running the same test on another machine shows 0 results with a filesize of 0 bytes.

I don't know enough about filesystems to hazard a guess as to what's going on. I can say that in the week we've been running the script on two separate machines, we've never had a single failure on the SSD based machine, but it happens frequently on this disk based machine.

Revision history for this message

Daniel Nichter (daniel-nichter) wrote on 2013-03-13:

I'm not sure what to do about this, but I'm 99% certain it's not a bug or problem in the tool, at least not one that we can fix. The relevant code is:

            if ( my $file = $o->get('file') ) {
               open my $file, '>', $file
                  or die "Can't open $file: $OS_ERROR";
               print $file $output
                  or die "Can't print to $file: $OS_ERROR";
               close $file
                  or die "Can't close $file: $OS_ERROR";
            }

There's nothing more than that: open file, write, close. The output (print) is buffered, but flushed on close, and there's probably only a few microseconds between those two. There's nothing else we could do in the tool. Maybe there's some caching or filesystem-level issue on the disk-based machines. In any case, the code is already as simple as possible, i.e. no chance that it's doing something weird or superfluous that's causing this issue, at least not that we can do anything about (could be a Perl bug but that's extremely unlikely in such core functions like print() and close()).

The only other way the output could be zero bytes is if

my $output = sprintf $format, $delay, @vals, $pk_val;

$output was an empty string, but that can't be the case either because,

   # 2.00s [ 0.05s, 0.01s, 0.00s ]
   my $format = ($hires_ts ? '%.2f' : '%4d') . "s "
              . "[ " . join(", ", map { "%5.2fs" } @$frames) . " ]"
              . ($o->get('print-master-server-id') ? " %d" : '')
              . "\n";

$format always has at least some static characters (or the new-line if nothing else).

So given all that, I'll close this bug. I hate to do that to you while it's still unresolved, but I think the evidence points to a deeper, non-tool-related, perhaps system- or storage- or filesystem-related problem. If you find new evidence one way or another, please reply to this issue and we'll take a look again.

Changed in percona-toolkit:
status:	New → Invalid

Revision history for this message

Callum Macdonald (chmac) wrote on 2013-03-14:

Fair enough, I can believe it's an issue outside pt-heartbeat somewhere. I can work around it easily enough by requiring several sequential failures before sending an alert. Thanks for looking into it, I appreciate it.

Revision history for this message

W. Andrew Loe III (loe) wrote on 2013-12-16:

I also have the same issue as Callum, it appears to empty for extended periods of time. I use the default interval of 1.0 seconds and I check the file every 10 seconds. These machines are on EBS volumes, so perhaps the network latency on the truncate/write phase causes it to be empty for a while.

Revision history for this message

W. Andrew Loe III (loe) wrote on 2013-12-20:

rename.patch Edit (1.1 KiB, text/plain)

I was able to fix this by changing from a truncate write, to writing one file and then renaming it into place.

            my $output = sprintf $format, $delay, @vals, $pk_val;
            if ( my $filename = $o->get('file') ) {
               my $tempfilename = $filename . ".tmp";
               open my $file, '>', $tempfilename
                  or die "Can't open $tempfilename: $OS_ERROR";
               print $file $output
                  or die "Can't print to $tempfilename: $OS_ERROR";
               close $file
                  or die "Can't close $tempfilename: $OS_ERROR";
                rename $tempfilename, $filename
                  or die "Can't rename $tempfilename to $filename: $OS_ERROR";