Juju does not clean up instances that stay in BUILD too long, then loops on retries

Bug #1914829 reported by Joshua Genet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Achilleas Anagnostopoulos

Bug Description

I'm deploying Kubernetes on top of Openstack on a system that has a high load on the host during deploy (90-100% CPU usage). It appears that Juju is spinning up an instance, the instance stays in BUILD state for longer than Juju likes, so Juju schedules a retry.

The initial instance eventually makes it to ACTIVE state just fine, but Juju has already scheduled a retry and ends up launching a duplicate instance. This snowballs and eventually the deploy is using more resources than it actually needs.

I'm curious if this happens on something like AWS as well. This could in theory blow up a user's bill or use all the resources on a machine.

---

Each machine that stays in BUILD longer than Juju likes has this message in juju status:

failed to start machine 5 (cannot run instance: max duration exceeded: instance "39a21b8b-ac23-46d8-9139-50e7ab6fdc1b" has status BUILD), retrying in 10s (8 more attempts)

--

Here's relevant output from my list of instances in the middle of this Kubernetes deploy. In this example, the first juju-5824ed-kubernetes-14 instance took longer than Juju liked, so Juju scheduled a retry. Eventually the first instance was able to come up, but Juju continued with the second juju-5824ed-kubernetes-14. The second one also took longer than Juju liked so it scheduled a third juju-5824ed-kubernetes-14.

$ openstack server list
+---------------------------+--------+----------------------------------------+
| Name | Status | Networks |
+---------------------------+--------+----------------------------------------+
| juju-5824ed-kubernetes-7 | BUILD | |
| juju-5824ed-kubernetes-6 | BUILD | |
| juju-5824ed-kubernetes-13 | BUILD | |
| juju-5824ed-kubernetes-4 | BUILD | |
| juju-5824ed-kubernetes-3 | BUILD | |
| juju-5824ed-kubernetes-5 | BUILD | |
| juju-5824ed-kubernetes-14 | BUILD | |
| juju-5824ed-kubernetes-12 | ACTIVE | ubuntu-net=172.16.0.184, 10.244.32.16 |
| juju-5824ed-kubernetes-13 | ACTIVE | ubuntu-net=172.16.0.216 |
| juju-5824ed-kubernetes-5 | ACTIVE | ubuntu-net=172.16.0.210 |
| juju-5824ed-kubernetes-4 | ACTIVE | ubuntu-net=172.16.0.141 |
| juju-5824ed-kubernetes-14 | ACTIVE | ubuntu-net=172.16.0.117 |
| juju-5824ed-kubernetes-3 | ACTIVE | ubuntu-net=172.16.0.124 |
| juju-5824ed-kubernetes-9 | ACTIVE | ubuntu-net=172.16.0.200, 10.244.32.126 |
| juju-5824ed-kubernetes-4 | ACTIVE | ubuntu-net=172.16.0.241 |
| juju-5824ed-kubernetes-14 | ACTIVE | ubuntu-net=172.16.0.93, 172.16.0.138 |
| juju-5824ed-kubernetes-3 | ACTIVE | ubuntu-net=172.16.0.199 |
| juju-5824ed-kubernetes-11 | ACTIVE | ubuntu-net=172.16.0.63 |
| juju-5824ed-kubernetes-10 | ACTIVE | ubuntu-net=172.16.0.50 |
| juju-5824ed-kubernetes-7 | ACTIVE | ubuntu-net=172.16.0.41 |
| juju-5824ed-kubernetes-9 | ACTIVE | ubuntu-net=172.16.0.221 |
| juju-5824ed-kubernetes-8 | ACTIVE | ubuntu-net=172.16.0.207, 10.244.32.39 |
| juju-5824ed-kubernetes-2 | ACTIVE | ubuntu-net=172.16.0.176, 10.244.32.80 |
| juju-5824ed-kubernetes-1 | ACTIVE | ubuntu-net=172.16.0.242, 10.244.32.112 |
| juju-5824ed-kubernetes-0 | ACTIVE | ubuntu-net=172.16.0.202, 10.244.32.45 |
| juju-72c0e9-controller-0 | ACTIVE | ubuntu-net=172.16.0.164, 10.244.32.60 |
+---------------------------+--------+----------------------------------------+

Revision history for this message
Joseph Phillips (manadart) wrote :

It's quite clear that StartInstance the logic allows this to occur. It should be fixed.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Changed in juju:
assignee: nobody → Achilleas Anagnostopoulos (achilleasa)
status: Triaged → In Progress
Revision history for this message
Achilleas Anagnostopoulos (achilleasa) wrote :

PR https://github.com/juju/juju/pull/12694 includes a fix for 2.8

Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Achilleas Anagnostopoulos (achilleasa) wrote :

The fix has been forward-ported to 2.9.

Harry Pidcock (hpidcock)
Changed in juju:
milestone: none → 2.8.10
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.