[2.8-rc2] On vSphere juju attempts to launch vms and fails when that name is already taken

Bug #1880720 reported by Michael Skalka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Low
Unassigned

Bug Description

Juju 2.8-rc2 deploying CDK on vSphere. Test run can be found here: https://solutions.qa.canonical.com/#/qa/testRun/cde295aa-eb05-4f52-a739-7f65593659d8

During bundle deployment multiple machines enter a down state and never recover:

Machine State DNS Inst id Series AZ Message
0 started 10.245.201.135 juju-fc0715-0 bionic poweredOn
1 started 10.245.201.220 juju-fc0715-1 bionic poweredOn
2 started 10.245.201.162 juju-fc0715-2 bionic poweredOn
3 pending 10.245.201.172 juju-fc0715-3 bionic poweredOn
4 down pending bionic The name 'juju-fc0715-4' already exists.
5 pending pending bionic cloning VM: 45.00%
6 down pending bionic cloning VM: 23.00%
7 down pending bionic The name 'juju-fc0715-7' already exists.
8 pending pending bionic cloning VM: 45.00%
9 pending pending bionic cloning VM: 45.00%
10 pending pending bionic cloning VM: 45.00%
11 pending 10.245.201.175 juju-fc0715-11 bionic poweredOn
12 pending pending bionic cloning VM: 45.00%
13 pending pending bionic cloning VM: 45.00%
14 pending pending bionic cloning VM: 45.00%
15 pending 10.245.201.106 juju-fc0715-15 bionic poweredOn
16 pending pending bionic cloning VM: 44.00%
17 pending 10.245.201.170 juju-fc0715-17 bionic poweredOn
18 down pending bionic The name 'juju-fc0715-18' already exists.

Looks like the vsphere cluster ran out of storage space and Juju attempted to move the VM elsewhere, and I assume it did not clean up the old VM properly and ran into a name collision:

2020-05-22 23:29:25 INFO juju.worker.provisioner provisioner_task.go:1221 started machine 15 as instance juju-fc0715-15 with hardware "arch=amd64 root-disk=8192M root-disk-source=SpindleDisks2", network config [], volumes [], volume attachments map[], subnets to zones [], lxd profiles [] 2020-05-22 23:29:26 INFO juju.worker.provisioner provisioner_task.go:1221 started machine 17 as instance juju-fc0715-17 with hardware "arch=amd64 root-disk=8192M root-disk-source=SpindleDisks2", network config [], volumes [], volume attachments map[], subnets to zones [], lxd profiles [] 2020-05-22 23:29:47 INFO juju.worker.instancepoller worker.go:436 machine "15" (instance ID "juju-fc0715-15") instance status changed from {"allocating" "powering on VM: 56.00%"} to {"running" "poweredOn"} 2020-05-22 23:30:08 INFO juju.worker.instancepoller worker.go:436 machine "17" (instance ID "juju-fc0715-17") instance status changed from {"allocating" "powering on VM: 56.00%"} to {"running" "poweredOn"} 2020-05-22 23:30:14 WARNING juju.worker.provisioner provisioner_task.go:1131 machine 7 failed to start in availability zone Autopilot: Failed to extend swap file from 0 KB to 4194304 KB. 2020-05-22 23:30:14 WARNING juju.worker.provisioner provisioner_task.go:1173 failed to start machine 7 (Failed to extend swap file from 0 KB to 4194304 KB.), retrying in 10s (10 more attempts) 2020-05-22 23:30:18 WARNING juju.worker.provisioner provisioner_task.go:1131 machine 18 failed to start in availability zone Autopilot: Failed to extend swap file from 0 KB to 1048576 KB. 2020-05-22 23:30:18 WARNING juju.worker.provisioner provisioner_task.go:1173 failed to start machine 18 (Failed to extend swap file from 0 KB to 1048576 KB.), retrying in 10s (10 more attempts) 2020-05-22 23:30:27 WARNING juju.worker.provisioner provisioner_task.go:1131 machine 4 failed to start in availability zone Autopilot: Failed to extend swap file from 0 KB to 1048576 KB. 2020-05-22 23:30:27 WARNING juju.worker.provisioner provisioner_task.go:1173 failed to start machine 4 (Failed to extend swap file from 0 KB to 1048576 KB.), retrying in 10s (10 more attempts) 2020-05-22 23:30:28 INFO juju.worker.provisioner provisioner_task.go:1116 trying machine 7 StartInstance in availability zone Autopilot 2020-05-22 23:30:29 INFO juju.worker.provisioner provisioner_task.go:1116 trying machine 18 StartInstance in availability zone Autopilot 2020-05-22 23:30:38 INFO juju.worker.provisioner provisioner_task.go:1116 trying machine 4 StartInstance in availability zone Autopilot 2020-05-22 23:30:45 WARNING juju.worker.provisioner provisioner_task.go:1131 machine 18 failed to start in availability zone Autopilot: cloning template VM: waiting for task "CloneVM_Task": Insufficient disk space on datastore 'SpindleDisks2'. 2020-05-22 23:30:45 WARNING juju.worker.provisioner provisioner_task.go:1173 failed to start machine 18 (cloning template VM: waiting for task "CloneVM_Task": Insufficient disk space on datastore 'SpindleDisks2'.), retrying in 10s (9 more attempts) 2020-05-22 23:30:46 WARNING juju.worker.provisioner provisioner_task.go:1131 machine 7 failed to start in availability zone Autopilot: cloning template VM: waiting for task "CloneVM_Task": Insufficient disk space on datastore 'SpindleDisks2'. 2020-05-22 23:30:46 WARNING juju.worker.provisioner provisioner_task.go:1173 failed to start machine 7 (cloning template VM: waiting for task "CloneVM_Task": Insufficient disk space on datastore 'SpindleDisks2'.), retrying in 10s (9 more attempts)

So the bug here is twofold: 1. the status message is fairly vague, it should surface that the provider ran out of storage. 2. the controller attempted to re-use the VM name during cloning which caused the cluster to barf.

Michael Skalka (mskalka)
tags: removed: cdo-release-blocker
Revision history for this message
Christian Muirhead (2-xtian) wrote :

Agreed on the first point - we should surface that better.
I think the second point is happening because the provider isn't removing the VM after it fails to start before it retries. I don't understand this - looking in the code we do call cleanupVM if there's an error from powering the machine on.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.8.1
importance: Undecided → High
status: New → Triaged
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8.1 → 2.8.2
Changed in juju:
milestone: 2.8.2 → 2.8.3
Pen Gale (pengale)
Changed in juju:
milestone: 2.8.4 → none
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :

Closing this due to inactivity (low number of occurrences, and no hit for more than one year)

Changed in juju:
status: Triaged → Invalid
tags: added: solutions-qa-expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.