Azure-arm leaves machine-0 from the admin model behind

Bug #1571687 reported by Curtis Hovey
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Undecided
Unassigned

Bug Description

Juju CI is finding many 10's of resource groups left behind each week in Azure. On Monday 2016-04-18, Azure had 26 resource groups/instances running from 1 or more days ago. Most instances were from April 15. A few were older from April 13 and some were from April 16. All but two were machine-0 from the admin model.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta5 → 2.0-rc1
Revision history for this message
Andrew Wilkins (axwalk) wrote :

That would be because CI is timing out on kill-controller:
    http://data.vapour.ws/juju-ci/products/version-3914/native-deploy-landscape-azure/build-51/consoleText

I've found deleting VMs in Azure to be considerably slower than on other clouds. Resource group deletion is also very slow, and this is necessary to destroy a model.

I do think we should stop trying to be "friendly" in kill-controller, and just talk directly to the cloud API like we used to with --force. That would probably speed things up a bit, because then we'd just delete everything at once by deleting the resource group.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

kill-controller improvements may be in order, but the crux of this issue is that CI needs to wait for destruction to complete, or should expect leakage.

Changed in juju-core:
status: Triaged → Invalid
Revision history for this message
Curtis Hovey (sinzui) wrote :

When CI times out, the build is marked a a failure. The example shows a success. I do not see Keyboard interrupt or python exceptions raised in the example. http://data.vapour.ws/juju-ci/products/version-3914/native-deploy-landscape-azure/build-51/consoleText

What does CI need to see in the log to know it didn't wait long enough?

Changed in juju-core:
status: Invalid → Incomplete
Revision history for this message
Cheryl Jennings (cherylj) wrote :

CI would at least need to see the message "All hosted models reclaimed, cleaning up controller machines" to know that kill-controller is trying to take down admin/machine-0.

Azure is taking FOREVER to kill / destroy controllers. I timed the last one and it was just about 10 minutes: http://paste.ubuntu.com/16033544/

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta6 → 2.0-beta7
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
Changed in juju-core:
milestone: 2.0-beta8 → none
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Is this still an issue?

Revision history for this message
Curtis Hovey (sinzui) wrote :

I suspect that that juju-ci-tools is not waiting long enough for kill-controller to complete. Azure can take 30 minutes to delete a large deployment, but the timeout is set for 10 minutes. I think this issues will go away when bug 1604102 is fixed.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Curtis,

The dependent bug 1604102, has been "Fix committed" since on 2016-08-05.

So is this still an issue or has it indeed been addressed?

Changed in juju-core:
importance: High → Undecided
affects: juju-core → juju
Curtis Hovey (sinzui)
Changed in juju:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.