Backup scripts fail due to concurrent snapshots
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Linaro AWS Tools |
Fix Released
|
Medium
|
Loïc Minier |
Bug Description
Hi,
The weekly backup from a week ago and the daily backup from today failed because they try to create a snapshot at the same time.
This is the weekly job and its error:
Subject: Cron <linaro@peony> bin/run-boto linaro-
2012-04-22 00:00:05,213 ERROR 400 Bad Request
2012-04-22 00:00:05,213 ERROR <?xml version="1.0" encoding="UTF-8"?>
<Response>
volume may only be made once that call has completed.
Traceback (most recent call last):
File "linaro-
sys.
File "linaro-
snapshot_
File "linaro-
snapshot = volume.
File "/home/
return self.connection
File "/home/
Snapshot, verb='POST')
File "/home/
raise self.ResponseEr
boto.exception.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
volume may only be made once that call has completed.
and that's the daily one:
Subject: Cron <linaro@peony> bin/run-boto linaro-
2012-04-29 00:00:10,304 ERROR 400 Bad Request
2012-04-29 00:00:10,304 ERROR <?xml version="1.0" encoding="UTF-8"?>
<Response>
volume may only be made once that call has completed.
Traceback (most recent call last):
File "linaro-
sys.
File "linaro-
snapshot_
File "linaro-
snapshot = volume.
File "/home/
return self.connection
File "/home/
Snapshot, verb='POST')
File "/home/
raise self.ResponseEr
boto.exception.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
volume may only be made once that call has completed.
I believe this is only a problem over the short duration of an AWS request (a couple of seconds), not during the actual snapshotting process (which might last for many minutes).
There are many ways to fix this, a simple one is to take the weekly/monthly snapshots at a different time. Nicer is to take a lock when the backup script is already running.
Ideally, we'd also change weekly and monthly backups so that they are idempotent and can all be run daily, but that's a bit trickier.
Cheers,
Changed in linaro-aws-tools: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Patched ~linaro/ bin/run- boto on people.l.o to take an exclusive lock (via flock) on ~/.run-boto.lck with a 30.0 seconds timeout; this means that all scripts running under run-boto should complete under less than 30 seconds or they might break another script trying to run concurrently. We can bump to something higher if 30 seconds is a problem.