ERROR: Could not find '/tmp/review-tools-XXXX' due to race condition with concurrent runs.

Bug #1745506 reported by Daniel Manrique
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Click Reviewers tools (obsolete)
Fix Released
Medium
Jamie Strandboge

Bug Description

We've been seeing this error sporadically when the store runs automated reviews. We've speculated it's due to a race condition. I've reproduced the issue locally with 2 concurrent tools runs and we know now why this happens.

The key is this: I noticed that running the tools on a snap initially creates the dir with today's date:

drwx------ 9 roadmr roadmr 4096 Jan 25 18:20 review-tools-pl8iff65

But then at some point during the run it changes the date:

drwxr-xr-x 10 roadmr roadmr 4096 Sep 28 03:26 review-tools-pl8iff65

If the snap is very large, the review process goes on for a relatively long time (minutes), which opens a window of opportunity for another tools run against another, smaller snap, on the same system, to finish, notice the "stale" directory (see this one with a "Sep 28" date, clearly older than the reap threshold) and reap it. At this point, the first process will exit as we've seen:

ERROR: Could not find '/tmp/review-tools-pl8iff65'

Since this happens mid-run, the review is considered incomplete and reported as failed to the store.

so I think what's happening is:

1- snaps that were *created* more than 3 hours ago and which are largish, create a temporary directory with a timestamp older than 3 hours ago, and stay around for a few minutes while the review processes.
2- Another small snap review comes along, runs, and then when finishing, notices the "stale" directory and reaps it.
3- The first run fails with the error we've been seeing because its dir was indeed removed from under it mid-run.

How to repro:

1- Get a large snap and a small snap. I used a "large" snap which is about 300 MB and I picked uploads that were done more than 3 hours ago. The small snap can be the minimal "empty" snap.

2- Open terminal A, run the tools on the large snap. PYTHONPATH=$PWD ./bin/click-review large-snap.snap

3- Open terminal B, start doing "ls -la /tmp", checking the review-tools-* directory's timestamp.

4- When you see the timestamp change to something older than 3 hours ago, open terminal C, and run the tools on the small snap, PYTHONPATH=$PWD ./bin/click-review small-snap.snap

5- Almost immediately you'll see the run on terminal A fail with "ERROR: Could not find '/tmp/review-tools-pl8iff65'"

There are several reasons why a snap could be uploaded and not processed by the tools in at least 3 hours. Perhaps a prior upload of that snap failed review, creating a "queue" of uploads. Once the stuck upload is processed successfully (rejected, or the automated review is rerun) it starts processing the uploads again, but by this time they're older than 3 hours and thus vulnerable to this race condition.

A possible solution is to check if any of the 3 timestamps (mtime, ctime, atime) is not altered, and use that. A quick check shows that os.path.getmtime is what we're using right now, but ctime does not get changed and could be used instead:

$ stat -c "%x %z %y" review-tools-4dxtf42a/
2018-01-25 18:34:37.248566133 -0500 2018-01-25 18:34:37.216565947 -0500 2017-09-28 03:26:08.000000000 -0400

%x is atime, %z is ctime, %y is mtime.

Changed in click-reviewers-tools:
assignee: nobody → Jamie Strandboge (jdstrand)
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

unsquasfs sets the mtime to be the mtime of the squashfs-root directory in the snap, and it does this after the unpack (which explains why the mtime changes), so checking the mtime is wrong. Checking the ctime resolves this.

Changed in click-reviewers-tools:
status: Triaged → Fix Committed
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

This was fixed in 0.47 which I just released and which is available in the edge channel of the snap.

Changed in click-reviewers-tools:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.