Canonical Juju

manual provider leaves jujud behind

Series 2.0
Bug #1642295

Bug #1642295 reported by Curtis Hovey on 2016-11-16

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Mick Gregg	Canonical Juju 2.2-alpha1
2.0	Fix Released	High	Mick Gregg	Canonical Juju 2.0.3
2.1	Fix Released	High	Mick Gregg	Canonical Juju 2.1-beta4
juju-core	Won't Fix	Undecided	Unassigned

Bug Description

As seen in the non-1.25 examples here
http://reports.vapour.ws/releases/issue/57b1c1f9749a567693457040

Juju 2 occasionally fails to clean up the machines it provisioned. Jujud is left running. The call to destroy-controller reports

   All hosted models reclaimed, cleaning up controller machines
   00:29:40 INFO cmd supercommand.go:465 command finished
   + EXITCODE=0

This most often happens on ppc64el. All manual tests are run using a triplet of lxd containers. Container A is the bootstrap host and B and C are hosts for the charms. Once juju reports it has destroyed the controller, a clean up script is run. The script will fail juju if it had to clean uo (so that the next test run can provision the machine).

Tags:

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-11-16:

Per bug 1475212, Juju 1.25 wont be fixed. It can provision a clean host, but always leaved resources behind that prevent juju from reprovioning the host.

Changed in juju-core:
status:	New → Won't Fix
Changed in juju:
milestone:	none → 2.2.0-beta1

Curtis Hovey (sinzui) on 2016-11-16

tags:

added: manual-provider
removed: maas-provider

Curtis Hovey (sinzui) on 2016-11-18

Changed in juju:
milestone:	2.2.0-beta1 → 2.1-beta2

Curtis Hovey (sinzui) on 2016-12-01

Changed in juju:
milestone:	2.1-beta2 → none

Curtis Hovey (sinzui) on 2016-12-02

Changed in juju:
milestone:	none → 2.1-rc1

Mick Gregg (macgreagoir) on 2016-12-05

Changed in juju:
assignee:	nobody → Mick Gregg (macgreagoir)

Revision history for this message

Mick Gregg (macgreagoir) wrote on 2016-12-06:

I can reproduce consistently (2.1-rc1) with manual machines added to the controller model. No attempt is made to clean up these non-controller, manual machines.

Using a non-controller model for the added machines, I have not been able to reproduce.

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Mick Gregg (macgreagoir) wrote on 2016-12-07:

This seems to be a race between the call to destroy the model and the final clean-up of the controller, which sees the controller go before the other machines have completed.

Unlike other providers, the manual provider (understandably) has no final clean-up to terminate or shutdown instances.

Revision history for this message

Mick Gregg (macgreagoir) wrote on 2016-12-08:

PR https://github.com/juju/juju/pull/6674 is the proposed fix.
PR https://github.com/juju/juju/pull/6675 is a nicety.

Mick Gregg (macgreagoir) on 2016-12-08

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Martin Packman (gz) wrote on 2016-12-09:

Branch against 2.1:

<https://github.com/juju/juju/pull/6683>

Changed in juju:
milestone:	2.1-rc1 → 2.2.0-alpha1

Revision history for this message

Martin Packman (gz) wrote on 2016-12-13:

Also landed on 2.0 branch: https://github.com/juju/juju/pull/6699

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-12-15:

We are still seeing this against 2.1 and develop:
http://reports.vapour.ws/releases/issue/57b1c1f9749a567693457040

Changed in juju:
status:	Fix Committed → Triaged

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-12-15:

And 2.0.

Revision history for this message

Mick Gregg (macgreagoir) wrote on 2016-12-16:

@abentley I think those logs are showing the issue only on the controller machine itself (jujud and mongod). I think I'm seeing, in different runs, no evidence of the controller clean-up script running, or the script failing to kill the processes.

I haven't been able to reproduce in my environment yet.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-12-16:

#10

I believe the issue happens when the machine is under load. The ppc64el machines are underpowered thus often under load. I can provide access to the ppc64el machines.

Revision history for this message

Mick Gregg (macgreagoir) wrote on 2016-12-20:

#11

This is not a fix, but an improvement: PR https://github.com/juju/juju/pull/6729

If the process is uninterruptible, we can't kill it without a reboot, but I think a reboot in real life should require the operator's intervention.

Revision history for this message

Mick Gregg (macgreagoir) wrote on 2016-12-21:

#12

Raised bug 1651674 to track the jujud stuck process bug. Marking this one as fixed, as the resolved bug that non-bootstrap nodes in the controller model were not being cleaned-up.

Updated http://reports.vapour.ws/releases/issue/57b1c1f9749a567693457040 accordingly.