[2.8-rc2] On vSphere juju attempts to launch vms and fails when that name is already taken
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Invalid
|
Low
|
Unassigned |
Bug Description
Juju 2.8-rc2 deploying CDK on vSphere. Test run can be found here: https:/
During bundle deployment multiple machines enter a down state and never recover:
Machine State DNS Inst id Series AZ Message
0 started 10.245.201.135 juju-fc0715-0 bionic poweredOn
1 started 10.245.201.220 juju-fc0715-1 bionic poweredOn
2 started 10.245.201.162 juju-fc0715-2 bionic poweredOn
3 pending 10.245.201.172 juju-fc0715-3 bionic poweredOn
4 down pending bionic The name 'juju-fc0715-4' already exists.
5 pending pending bionic cloning VM: 45.00%
6 down pending bionic cloning VM: 23.00%
7 down pending bionic The name 'juju-fc0715-7' already exists.
8 pending pending bionic cloning VM: 45.00%
9 pending pending bionic cloning VM: 45.00%
10 pending pending bionic cloning VM: 45.00%
11 pending 10.245.201.175 juju-fc0715-11 bionic poweredOn
12 pending pending bionic cloning VM: 45.00%
13 pending pending bionic cloning VM: 45.00%
14 pending pending bionic cloning VM: 45.00%
15 pending 10.245.201.106 juju-fc0715-15 bionic poweredOn
16 pending pending bionic cloning VM: 44.00%
17 pending 10.245.201.170 juju-fc0715-17 bionic poweredOn
18 down pending bionic The name 'juju-fc0715-18' already exists.
Looks like the vsphere cluster ran out of storage space and Juju attempted to move the VM elsewhere, and I assume it did not clean up the old VM properly and ran into a name collision:
2020-05-22 23:29:25 INFO juju.worker.
So the bug here is twofold: 1. the status message is fairly vague, it should surface that the provider ran out of storage. 2. the controller attempted to re-use the VM name during cloning which caused the cluster to barf.
tags: | removed: cdo-release-blocker |
Changed in juju: | |
milestone: | none → 2.8.1 |
importance: | Undecided → High |
status: | New → Triaged |
Changed in juju: | |
milestone: | 2.8.1 → 2.8.2 |
Changed in juju: | |
milestone: | 2.8.2 → 2.8.3 |
Changed in juju: | |
milestone: | 2.8.4 → none |
Agreed on the first point - we should surface that better.
I think the second point is happening because the provider isn't removing the VM after it fails to start before it retries. I don't understand this - looking in the code we do call cleanupVM if there's an error from powering the machine on.