Juju does not clean up instances that stay in BUILD too long, then loops on retries
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Achilleas Anagnostopoulos |
Bug Description
I'm deploying Kubernetes on top of Openstack on a system that has a high load on the host during deploy (90-100% CPU usage). It appears that Juju is spinning up an instance, the instance stays in BUILD state for longer than Juju likes, so Juju schedules a retry.
The initial instance eventually makes it to ACTIVE state just fine, but Juju has already scheduled a retry and ends up launching a duplicate instance. This snowballs and eventually the deploy is using more resources than it actually needs.
I'm curious if this happens on something like AWS as well. This could in theory blow up a user's bill or use all the resources on a machine.
---
Each machine that stays in BUILD longer than Juju likes has this message in juju status:
failed to start machine 5 (cannot run instance: max duration exceeded: instance "39a21b8b-
--
Here's relevant output from my list of instances in the middle of this Kubernetes deploy. In this example, the first juju-5824ed-
$ openstack server list
+------
| Name | Status | Networks |
+------
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-5824ed-
| juju-72c0e9-
+------
Changed in juju: | |
assignee: | nobody → Achilleas Anagnostopoulos (achilleasa) |
status: | Triaged → In Progress |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
milestone: | none → 2.8.10 |
status: | Fix Committed → Fix Released |
It's quite clear that StartInstance the logic allows this to occur. It should be fixed.