ERROR: Could not find '/tmp/review-tools-XXXX' due to race condition with concurrent runs.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Click Reviewers tools (obsolete) |
Fix Released
|
Medium
|
Jamie Strandboge |
Bug Description
We've been seeing this error sporadically when the store runs automated reviews. We've speculated it's due to a race condition. I've reproduced the issue locally with 2 concurrent tools runs and we know now why this happens.
The key is this: I noticed that running the tools on a snap initially creates the dir with today's date:
drwx------ 9 roadmr roadmr 4096 Jan 25 18:20 review-
But then at some point during the run it changes the date:
drwxr-xr-x 10 roadmr roadmr 4096 Sep 28 03:26 review-
If the snap is very large, the review process goes on for a relatively long time (minutes), which opens a window of opportunity for another tools run against another, smaller snap, on the same system, to finish, notice the "stale" directory (see this one with a "Sep 28" date, clearly older than the reap threshold) and reap it. At this point, the first process will exit as we've seen:
ERROR: Could not find '/tmp/review-
Since this happens mid-run, the review is considered incomplete and reported as failed to the store.
so I think what's happening is:
1- snaps that were *created* more than 3 hours ago and which are largish, create a temporary directory with a timestamp older than 3 hours ago, and stay around for a few minutes while the review processes.
2- Another small snap review comes along, runs, and then when finishing, notices the "stale" directory and reaps it.
3- The first run fails with the error we've been seeing because its dir was indeed removed from under it mid-run.
How to repro:
1- Get a large snap and a small snap. I used a "large" snap which is about 300 MB and I picked uploads that were done more than 3 hours ago. The small snap can be the minimal "empty" snap.
2- Open terminal A, run the tools on the large snap. PYTHONPATH=$PWD ./bin/click-review large-snap.snap
3- Open terminal B, start doing "ls -la /tmp", checking the review-tools-* directory's timestamp.
4- When you see the timestamp change to something older than 3 hours ago, open terminal C, and run the tools on the small snap, PYTHONPATH=$PWD ./bin/click-review small-snap.snap
5- Almost immediately you'll see the run on terminal A fail with "ERROR: Could not find '/tmp/review-
There are several reasons why a snap could be uploaded and not processed by the tools in at least 3 hours. Perhaps a prior upload of that snap failed review, creating a "queue" of uploads. Once the stuck upload is processed successfully (rejected, or the automated review is rerun) it starts processing the uploads again, but by this time they're older than 3 hours and thus vulnerable to this race condition.
A possible solution is to check if any of the 3 timestamps (mtime, ctime, atime) is not altered, and use that. A quick check shows that os.path.getmtime is what we're using right now, but ctime does not get changed and could be used instead:
$ stat -c "%x %z %y" review-
2018-01-25 18:34:37.248566133 -0500 2018-01-25 18:34:37.216565947 -0500 2017-09-28 03:26:08.000000000 -0400
%x is atime, %z is ctime, %y is mtime.
Changed in click-reviewers-tools: | |
assignee: | nobody → Jamie Strandboge (jdstrand) |
importance: | Undecided → Medium |
status: | New → Triaged |
unsquasfs sets the mtime to be the mtime of the squashfs-root directory in the snap, and it does this after the unpack (which explains why the mtime changes), so checking the mtime is wrong. Checking the ctime resolves this.