pt-heartbeat --monitor --file sometimes results in an empty file
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
I'm using pt-heartbeat to monitor replication delay between 1 master and 2 slaves (1 SSD, 1 disk). Running on the slaves like so:
/usr/bin/
$ cat /etc/pt-
monitor
file=/var/
master-server-id=13
daemonize
pid=/tmp/
host=localhost
user=heartbeat
password=pass
database=db
table=heartbeat
Then I'm using monit and a custom rolled bash script to verify the contents of /var/spool/
Several times a day, maybe 5-10 times per day, the script finds the /var/spool/
I tried adding a delay and a second check into the script. I used a sleep of 2.7s to offset against the 1s interval in case the script was hitting at precisely the time when pt-heartbeat had truncated the file. This change appears to have reduced the frequency of the issue, but it still persists.
Just let me know if any additional information would be useful in tracking down this issue.
[1] $ while true; do ls --full-time /var/spool/
tags: | added: pt-heartbeat |
This doesn't seem to be a bug in the tool. It sounds like a race condition, i.e. that your script is reading the file at the exact time pt-heartbeat has truncated the file but before it has written the new info and/or that new info has been flushed. The evidence seems to be the 2.7s offset which reduced the frequency of the issue.
One solution would be locking the file. We could do other more exotic things like writing to tmp file the moving the tmp to the real file. But a 1s, that's a lot of tmp files and moves.
This is is not really a bug in the tool, I'm not sure when or if we'll have time to implement a solution. In your script, can you just retry reading the file if it's empty? Or just ignore if empty because at 1s interval, missing a few seconds here and there shouldn't matter much?