The EC2 API is "eventually consistent", which also means it's hard to deal with when one wants to infer decisions from a retrieved state.
The ProvisioningAgent is in charge of firing new machines to cover requested machine states that were never seen, but also to cover machine states that were alive but died for whatever reason when they shouldn't.
Now, imagine the following sequence of actions within the ProvisioningAgent:
1. Acquire the topology lock to ensure no one else attempts changes for now
2. Detect a machine state without an id (new machine requested by the admin)
3. Fire the new machine
4. Store the new machine id in the machine state in zookeeper
5. Release the topology lock
6. Acquire the topology lock again, and start over
7. Detect a machine state with an id (set in 4)
8. Observe that EC2 doesn't know about this id yet (eventual consistency FTW!)
9. Behave as if the machine had died, and fire another machine!
10. Repeat from 4.
This problem may be fixed by introducing a "started_time" parameter into the machine state, and ignoring machines which were acted upon recently.