ProvisioningAgent has to deal with eventual consistency

Bug #639888 reported by Gustavo Niemeyer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pyjuju
Triaged
Low
Unassigned

Bug Description

The EC2 API is "eventually consistent", which also means it's hard to deal with when one wants to infer decisions from a retrieved state.

The ProvisioningAgent is in charge of firing new machines to cover requested machine states that were never seen, but also to cover machine states that were alive but died for whatever reason when they shouldn't.

Now, imagine the following sequence of actions within the ProvisioningAgent:

1. Acquire the topology lock to ensure no one else attempts changes for now
2. Detect a machine state without an id (new machine requested by the admin)
3. Fire the new machine
4. Store the new machine id in the machine state in zookeeper
5. Release the topology lock
6. Acquire the topology lock again, and start over
7. Detect a machine state with an id (set in 4)
8. Observe that EC2 doesn't know about this id yet (eventual consistency FTW!)
9. Behave as if the machine had died, and fire another machine!
10. Repeat from 4.

This problem may be fixed by introducing a "started_time" parameter into the machine state, and ignoring machines which were acted upon recently.

Tags: agents
Changed in ensemble:
status: New → Confirmed
importance: Undecided → High
description: updated
Changed in ensemble:
milestone: none → 0.4
Changed in ensemble:
importance: High → Medium
Changed in ensemble:
milestone: 0.4 → budapest
tags: added: agents
Changed in ensemble:
milestone: budapest → dublin
Changed in ensemble:
milestone: dublin → none
Curtis Hovey (sinzui)
Changed in juju:
status: Confirmed → Triaged
Curtis Hovey (sinzui)
Changed in juju:
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.