pt-stalk deletes non-empty files

Reported by Gavin Towey on 2012-09-08
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
Medium
Daniel Nichter

Bug Description

Actually that's only half true. It's too aggressive about when it checks for empty files -- some important files that are slow to get their data are deleted, even though they would return data if the script waited longer:

   $CMD_MYSQL $EXT_ARGV -e "$innostat" >> "$d/$p-innodbstatus2" &
   $CMD_MYSQL $EXT_ARGV -e "$mutex" >> "$d/$p-mutex-status2" &
   open_tables >> "$d/$p-opentables2" &

   kill $mysqladmin_pid
   [ "$tail_error_log_pid" ] && kill $tail_error_log_pid
   [ "$tcpdump_pid" ] && kill $tcpdump_pid

   hostname > "$d/$p-hostname"

   for file in "$d/$p-"*; do
      if [ -z "$(grep -v '^TS ' --max-count 1 "$file")" ]; then
         log "Removing empty file $file";
         rm "$file"
      fi
   done

In particular I never get the mutex-status2 output, because it takes several seconds for the command to return any output. The for loop at the end happens before that, and deletes the file before it's written to, resulting in data loss.

I'm not sure the best solution to this; inserting a sleep before the file cleanup is pretty arbitrary. Maybe checking for no more child processes (with a timeout) would be a better approach. My bash-fu isn't quite good enough for this one I think. My work around is to simply comment out the file cleanup.

Brian Fraser (fraserbn) wrote :

I don't think that's quite right; Even if the file is deleted, wouldn't it just be recreated if there was any output from the mutex command?

#!/bin/sh

rm yadda
touch yadda
(sleep 1; echo something >> yadda) &
rm yadda
if [ -f yadda ]; then
   echo "file exists"
else
   echo "file doesn't exist"
fi
sleep 1
cat yadda

Gavin Towey (gtowey) wrote :

Try a simple script like this, which backgrounds a query that directs stdout to a file, then immediately checks for the file and deletes it. You can see that the file is created immediately by the mysql command, so the subsequent if block deletes it.

However the deleted filehandle is still in use by the previous command, so it happily writes to nothing.

#!/usr/bin/env bash
mysql -h localhost -e "select sleep(3)" > query.out &
if [[ -e "query.out" ]];
then
        echo "File exists"
        rm query.out
fi

sleep 4
if [[ ! -e "query.out" ]];
then
        echo "File doesn't exist"
fi

Brian Fraser (fraserbn) wrote :

Ah, you're absolutely right. I'll tag this to be fixed for the next release, although I too am a bit fuzzy on how to fix it at the moment.

Changed in percona-toolkit:
milestone: none → 2.1.5
status: New → Triaged
tags: added: pt-stalk
Changed in percona-toolkit:
importance: Undecided → Medium
Daniel Nichter (daniel-nichter) wrote :

Reported again as bug 1061296.

Daniel Nichter (daniel-nichter) wrote :

The problem boils down to needing to wait for subprocesses, but as Gavin noted, sleeping is arbitrary. Then again, we should only wait some amount of time in case a subprocess is stalled/dead/or just way too slow. Using the jobs command the tool can easily determine if there are any subprocesses running, but then the question becomes: how long should it wait for subprocs to finish running?

There's one perhaps less arbitrary sleep time: --sleep, i.e. the sleep time between collections, which defaults to 5 minutes. We could move the "delete empty files" task to after this --sleep? I think 5min, or whatever --sleep is set to, is a pretty generous time. Plus, if a subproc hasn't finished by the time the script starts to collect again, we surely want to kill it so we don't have 2 procs collecting the same thing (even though they'd be logging to different files, doing double work seems bad).

What do you all think?

Changed in percona-toolkit:
status: Triaged → In Progress
assignee: nobody → Daniel Nichter (daniel-nichter)
Baron Schwartz (baron-xaprb) wrote :

I'd suggest waiting for as long as the collect process runs for (30 sec). There is a "wait" builtin that waits for background jobs to finish. http://en.wikipedia.org/wiki/Wait_(command) You could do a "sleep $sleep; kill $PID" just before waiting, or something like that.

Daniel Nichter (daniel-nichter) wrote :

Agreed. It occurred to me that waiting for --sleep may be too long. If the system is in a deteriorated state (hence pt-stalk being triggered), then we shouldn't burden it further with potentially long-running processes trying to collect data. In other words, when things break, we're going to suffer --run-time seconds to collect data, then just another --run-time seconds for processes we launched at the start to finish. Any more may be adding fuel to the fire.

Daniel Nichter (daniel-nichter) wrote :

This is a difficult thing to test since it's timing related, but if nothing else the changes that have been merged should help. If anyone consistently has this problem, please get the latest pt-stalk from the 2.1 branch, try again, and let us know if these changes help. Even if some empty files are still deleted, I think this is about all we can do because we don't want to wait forever, and afaik there's no better solution.

Changed in percona-toolkit:
status: In Progress → Fix Committed
Brian Fraser (fraserbn) on 2012-11-16
Changed in percona-toolkit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers