Comment 2 for bug 1893848

Revision history for this message
Ben Hoyt (benhoyt) wrote :

I'm giving up on this for now (time boxed investigation to a day). Just leaving a few notes here for when this is picked back up:

* The error is in the assertAvailabilityZoneMachinesDistribution() function, because the three "good" machines (1, 3, and 4) are started on zones 1, 1, and 3, respectively ... and that assertion function ensures there's not a delta of 2 between the "heaviest" zone (zone 1 with two machines) and the "lightest" (zones 2 and 4 with no machines).
* I noticed in the log that there's a message "got not provisioned error while waiting: machine 4 not provisioned" ... so for whatever reason there was a failure starting up machine 4. So that's probably what's causing it.
* In worker/provisioner/provisioner_task.go there's a function machineAvailabilityZoneDistribution() that distributes the machines across the zones. I suspect what's happening is that in the case of the machine 4 failure, something is grabbing zone 2 in a racy way so that machine 3 is starting up on zone 1 (and doubling up with machine 1).