manual provider leaves jujud behind

Bug #1642295 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Mick Gregg
2.0
Fix Released
High
Mick Gregg
2.1
Fix Released
High
Mick Gregg
juju-core
Won't Fix
Undecided
Unassigned

Bug Description

As seen in the non-1.25 examples here
    http://reports.vapour.ws/releases/issue/57b1c1f9749a567693457040

Juju 2 occasionally fails to clean up the machines it provisioned. Jujud is left running. The call to destroy-controller reports

   All hosted models reclaimed, cleaning up controller machines
   00:29:40 INFO cmd supercommand.go:465 command finished
   + EXITCODE=0

This most often happens on ppc64el. All manual tests are run using a triplet of lxd containers. Container A is the bootstrap host and B and C are hosts for the charms. Once juju reports it has destroyed the controller, a clean up script is run. The script will fail juju if it had to clean uo (so that the next test run can provision the machine).

Revision history for this message
Curtis Hovey (sinzui) wrote :

Per bug 1475212, Juju 1.25 wont be fixed. It can provision a clean host, but always leaved resources behind that prevent juju from reprovioning the host.

Changed in juju-core:
status: New → Won't Fix
Changed in juju:
milestone: none → 2.2.0-beta1
Curtis Hovey (sinzui)
tags: added: manual-provider
removed: maas-provider
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2.0-beta1 → 2.1-beta2
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.1-beta2 → none
Curtis Hovey (sinzui)
Changed in juju:
milestone: none → 2.1-rc1
Mick Gregg (macgreagoir)
Changed in juju:
assignee: nobody → Mick Gregg (macgreagoir)
Revision history for this message
Mick Gregg (macgreagoir) wrote :

I can reproduce consistently (2.1-rc1) with manual machines added to the controller model. No attempt is made to clean up these non-controller, manual machines.

Using a non-controller model for the added machines, I have not been able to reproduce.

Changed in juju:
status: Triaged → In Progress
Revision history for this message
Mick Gregg (macgreagoir) wrote :

This seems to be a race between the call to destroy the model and the final clean-up of the controller, which sees the controller go before the other machines have completed.

Unlike other providers, the manual provider (understandably) has no final clean-up to terminate or shutdown instances.

Revision history for this message
Mick Gregg (macgreagoir) wrote :
Mick Gregg (macgreagoir)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Martin Packman (gz) wrote :
Changed in juju:
milestone: 2.1-rc1 → 2.2.0-alpha1
Revision history for this message
Martin Packman (gz) wrote :

Also landed on 2.0 branch: https://github.com/juju/juju/pull/6699

Revision history for this message
Aaron Bentley (abentley) wrote :

We are still seeing this against 2.1 and develop:
http://reports.vapour.ws/releases/issue/57b1c1f9749a567693457040

Changed in juju:
status: Fix Committed → Triaged
Revision history for this message
Aaron Bentley (abentley) wrote :

And 2.0.

Revision history for this message
Mick Gregg (macgreagoir) wrote :

@abentley I think those logs are showing the issue only on the controller machine itself (jujud and mongod). I think I'm seeing, in different runs, no evidence of the controller clean-up script running, or the script failing to kill the processes.

I haven't been able to reproduce in my environment yet.

Revision history for this message
Curtis Hovey (sinzui) wrote :

I believe the issue happens when the machine is under load. The ppc64el machines are underpowered thus often under load. I can provide access to the ppc64el machines.

Revision history for this message
Mick Gregg (macgreagoir) wrote :

This is not a fix, but an improvement: PR https://github.com/juju/juju/pull/6729

If the process is uninterruptible, we can't kill it without a reboot, but I think a reboot in real life should require the operator's intervention.

Revision history for this message
Mick Gregg (macgreagoir) wrote :

Raised bug 1651674 to track the jujud stuck process bug. Marking this one as fixed, as the resolved bug that non-bootstrap nodes in the controller model were not being cleaned-up.

Updated http://reports.vapour.ws/releases/issue/57b1c1f9749a567693457040 accordingly.

Changed in juju:
status: Triaged → Fix Committed
tags: removed: intermittent-failure ppc64el
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.