provisioner stops instances one at a time

Bug #1622813 reported by Andrew Wilkins on 2016-09-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Unassigned

Bug Description

When testing parallel VM deletion changes in the Azure provider, I have found that the provisioner worker is calling Environ.StopInstances with one instance.Id at a time, and each call in series. If I restart the controller agent, StopInstances is then called with the instance IDs of all "Dead" machines at once.

Andrew Wilkins (axwalk) wrote :

So what I did was:
 1. bootstrap azure
 2. deploy observable-kubernetes
 3. destroy-model -y default

Tailing the logs, I found that the StopInstances method was only being called with one instance ID... sometimes. All of the machines are "Dead".

Changed in juju:
milestone: 2.0-rc1 → 2.0.0
Changed in juju:
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
tags: added: ateam
Curtis Hovey (sinzui) on 2016-10-06
Changed in juju:
milestone: 2.0-rc3 → 2.0.0
Andrew Wilkins (axwalk) on 2016-10-07
Changed in juju:
status: Triaged → In Progress
assignee: Alexis Bruemmer (alexis-bruemmer) → Andrew Wilkins (axwalk)
Andrew Wilkins (axwalk) wrote :

I'm not entirely sure if it's the provisioner's fault yet.

I tested it a little differently, by using "add-machine -n 10" and then destroying the model. That time, all of the machines were stopped in one go.

I then did what I originally did, and deployed a bundle. This time, canonical-kubernetes. The units/applications get torn down gracefully, so not *all* the machines are in the "dead" state by the time the provisioner finds the first one is. That's fine. But once it's done stopping that instance, they are all in the dead state... and it's still stopping them one at a time.

Andrew Wilkins (axwalk) wrote :

OK, I *think* I see what's going on now, finally.

The state lifecycle watcher is sending a change when it sees the first machine become Dead. The API client is ready for it, and pulls it over immediately. The provisioner is ready for that, and pulls that down immediately.

The state lifecycle watcher then notices the second machine become Dead, and sends that across to the API client, which pulls it into memory. The provisioner is busy destroying the first instance, so doesn't grab it yet.

At this point, the state lifcycle watcher will gradually see that each of the remaining machines is Dead, and coalesces their IDs. So it's not until the *third* call that they all get destroyed.

In the immediate term, I think we should update api/watcher to coalesce. Long term, I think we want to change the provisioner to permit multiple concurrent operations. If a provider can't terminate multiple instances concurrently, then it needs to serialise those operations. But we shouldn't hamstring them all.

Andrew Wilkins (axwalk) wrote :

Given that this is much less of a problem than I originally thought, and how close we are to 2.0, I'm going to defer this until later.

Changed in juju:
status: In Progress → Triaged
milestone: 2.0.0 → 2.0.1
Curtis Hovey (sinzui) on 2016-10-28
Changed in juju:
milestone: 2.0.1 → none
Changed in juju:
assignee: Andrew Wilkins (axwalk) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers