Canonical Juju

Juju does not clean up instances that stay in BUILD too long, then loops on retries

Bug #1914829 reported by Joshua Genet on 2021-02-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Achilleas Anagnostopoulos	Canonical Juju 2.8.10

Bug Description

I'm deploying Kubernetes on top of Openstack on a system that has a high load on the host during deploy (90-100% CPU usage). It appears that Juju is spinning up an instance, the instance stays in BUILD state for longer than Juju likes, so Juju schedules a retry.

The initial instance eventually makes it to ACTIVE state just fine, but Juju has already scheduled a retry and ends up launching a duplicate instance. This snowballs and eventually the deploy is using more resources than it actually needs.

I'm curious if this happens on something like AWS as well. This could in theory blow up a user's bill or use all the resources on a machine.

---

Each machine that stays in BUILD longer than Juju likes has this message in juju status:

failed to start machine 5 (cannot run instance: max duration exceeded: instance "39a21b8b-ac23-46d8-9139-50e7ab6fdc1b" has status BUILD), retrying in 10s (8 more attempts)

Here's relevant output from my list of instances in the middle of this Kubernetes deploy. In this example, the first juju-5824ed-kubernetes-14 instance took longer than Juju liked, so Juju scheduled a retry. Eventually the first instance was able to come up, but Juju continued with the second juju-5824ed-kubernetes-14. The second one also took longer than Juju liked so it scheduled a third juju-5824ed-kubernetes-14.

Tags:

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-02-08:

It's quite clear that StartInstance the logic allows this to occur. It should be fixed.

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Achilleas Anagnostopoulos (achilleasa) on 2021-02-25

Changed in juju:
assignee:	nobody → Achilleas Anagnostopoulos (achilleasa)
status:	Triaged → In Progress

Revision history for this message

Achilleas Anagnostopoulos (achilleasa) wrote on 2021-02-25:

PR https://github.com/juju/juju/pull/12694 includes a fix for 2.8

Achilleas Anagnostopoulos (achilleasa) on 2021-02-26

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Achilleas Anagnostopoulos (achilleasa) wrote on 2021-02-26:

The fix has been forward-ported to 2.9.

Harry Pidcock (hpidcock) on 2021-10-18

Changed in juju:
milestone:	none → 2.8.10
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.