Backup scripts fail due to concurrent snapshots

Bug #991046 reported by Loïc Minier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro AWS Tools
Fix Released
Medium
Loïc Minier

Bug Description

Hi,

The weekly backup from a week ago and the daily backup from today failed because they try to create a snapshot at the same time.

This is the weekly job and its error:
Subject: Cron <linaro@peony> bin/run-boto linaro-aws-tools/aws-backuper --backup-set weekly --create flexlm.linaro.org cards.linaro.org

2012-04-22 00:00:05,213 ERROR 400 Bad Request
2012-04-22 00:00:05,213 ERROR <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidState</Code><Message>A CreateSnapshot request is already in progress for volume 'vol-3fac3253'. Another CreateSnapshot request for this
volume may only be made once that call has completed.</Message></Error></Errors><RequestID>08411da6-c612-4ba3-bba1-636f9cb10762</RequestID></Response>
Traceback (most recent call last):
  File "linaro-aws-tools/aws-backuper", line 137, in <module>
    sys.exit(main(sys.argv[1:]))
  File "linaro-aws-tools/aws-backuper", line 129, in main
    snapshot_instance_by_name(name, opts.backup_set)
  File "linaro-aws-tools/aws-backuper", line 70, in snapshot_instance_by_name
    snapshot = volume.create_snapshot(description)
  File "/home/linaro/local/boto-latest/boto/ec2/volume.py", line 156, in create_snapshot
    return self.connection.create_snapshot(self.id, description)
  File "/home/linaro/local/boto-latest/boto/ec2/connection.py", line 1578, in create_snapshot
    Snapshot, verb='POST')
  File "/home/linaro/local/boto-latest/boto/connection.py", line 916, in get_object
    raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidState</Code><Message>A CreateSnapshot request is already in progress for volume 'vol-3fac3253'. Another CreateSnapshot request for this
volume may only be made once that call has completed.</Message></Error></Errors><RequestID>08411da6-c612-4ba3-bba1-636f9cb10762</RequestID></Response>

and that's the daily one:
Subject: Cron <linaro@peony> bin/run-boto linaro-aws-tools/aws-backuper --backup-set daily --create flexlm.linaro.org cards.linaro.org

2012-04-29 00:00:10,304 ERROR 400 Bad Request
2012-04-29 00:00:10,304 ERROR <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidState</Code><Message>A CreateSnapshot request is already in progress for volume 'vol-23a09b4e'. Another CreateSnapshot request for this
volume may only be made once that call has completed.</Message></Error></Errors><RequestID>cbf79b5e-b211-413d-bd48-dff123427a10</RequestID></Response>
Traceback (most recent call last):
  File "linaro-aws-tools/aws-backuper", line 137, in <module>
    sys.exit(main(sys.argv[1:]))
  File "linaro-aws-tools/aws-backuper", line 129, in main
    snapshot_instance_by_name(name, opts.backup_set)
  File "linaro-aws-tools/aws-backuper", line 70, in snapshot_instance_by_name
    snapshot = volume.create_snapshot(description)
  File "/home/linaro/local/boto-latest/boto/ec2/volume.py", line 156, in create_snapshot
    return self.connection.create_snapshot(self.id, description)
  File "/home/linaro/local/boto-latest/boto/ec2/connection.py", line 1578, in create_snapshot
    Snapshot, verb='POST')
  File "/home/linaro/local/boto-latest/boto/connection.py", line 916, in get_object
    raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidState</Code><Message>A CreateSnapshot request is already in progress for volume 'vol-23a09b4e'. Another CreateSnapshot request for this
volume may only be made once that call has completed.</Message></Error></Errors><RequestID>cbf79b5e-b211-413d-bd48-dff123427a10</RequestID></Response>

I believe this is only a problem over the short duration of an AWS request (a couple of seconds), not during the actual snapshotting process (which might last for many minutes).

There are many ways to fix this, a simple one is to take the weekly/monthly snapshots at a different time. Nicer is to take a lock when the backup script is already running.

Ideally, we'd also change weekly and monthly backups so that they are idempotent and can all be run daily, but that's a bit trickier.

Cheers,

Loïc Minier (lool)
Changed in linaro-aws-tools:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Loïc Minier (lool) wrote :

Patched ~linaro/bin/run-boto on people.l.o to take an exclusive lock (via flock) on ~/.run-boto.lck with a 30.0 seconds timeout; this means that all scripts running under run-boto should complete under less than 30 seconds or they might break another script trying to run concurrently. We can bump to something higher if 30 seconds is a problem.

Changed in linaro-aws-tools:
assignee: nobody → Loïc Minier (lool)
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.