provisioner not handling restart while a machine is being provisioned

Bug #2011599 reported by Heather Lanigan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

On a ubuntu 22.04 machine with 6Gb of memory and 1 cpu in the oracle cloud, a second controller to deploy the landscape-scalable bundle was spun up and the bundle deployed. However the machine stopped responding and rebooted.

On reboot, the model's provision was stuck:

2023-03-07 18:09:50 INFO juju.worker.provisioner provisioner_task.go:504 found machine pending provisioning id:0, details:0
2023-03-07 18:09:55 INFO juju.worker.provisioner provisioner_task.go:1003 provisioning in zones: [landscape-ken]
2023-03-07 18:09:55 INFO juju.worker.provisioner provisioner_task.go:504 found machine pending provisioning id:1, details:1
2023-03-07 18:09:55 INFO juju.worker.provisioner provisioner_task.go:1382 trying machine 0 StartInstance in availability zone landscape-ken
2023-03-07 18:09:55 INFO juju.worker.provisioner provisioner_task.go:1003 provisioning in zones: [landscape-ken]
2023-03-07 18:09:55 INFO juju.worker.provisioner provisioner_task.go:504 found machine pending provisioning id:2, details:2
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:1382 trying machine 1 StartInstance in availability zone landscape-ken
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:1003 provisioning in zones: [landscape-ken]
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:504 found machine pending provisioning id:3, details:3
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:602 provisioner-harvest-mode is set to destroyed; unknown instances not stopped [juju-b8e760-0]
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:1382 trying machine 2 StartInstance in availability zone landscape-ken
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:1003 provisioning in zones: [landscape-ken]
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:602 provisioner-harvest-mode is set to destroyed; unknown instances not stopped [juju-b8e760-0]
2023-03-07 18:09:56 INFO juju.worker.provisioner provisioner_task.go:1382 trying machine 3 StartInstance in availability zone landscape-ken
2023-03-07 18:09:58 INFO juju.worker.provisioner provisioner_task.go:1003 provisioning in zones: [landscape-ken]
2023-03-07 18:09:58 INFO juju.worker.provisioner provisioner_task.go:504 found machine pending provisioning id:4, details:4
2023-03-07 18:09:59 INFO juju.worker.provisioner provisioner_task.go:602 provisioner-harvest-mode is set to destroyed; unknown instances not stopped [juju-b8e760-2 juju-b8e760-3 juju-b8e760-0 juju-b8e760-1]
2023-03-07 18:09:59 INFO juju.worker.provisioner provisioner_task.go:1382 trying machine 4 StartInstance in availability zone landscape-ken
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1397 machine 1 failed to start in availability zone landscape-ken: write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1438 failed to start machine 1 (write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe), retrying in 10s (10 more attempts)
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1443 failed to set instance status: connection is shut down
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1397 machine 2 failed to start in availability zone landscape-ken: write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1438 failed to start machine 2 (write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe), retrying in 10s (10 more attempts)
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1443 failed to set instance status: connection is shut down
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1397 machine 3 failed to start in availability zone landscape-ken: write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1438 failed to start machine 3 (write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe), retrying in 10s (10 more attempts)
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1443 failed to set instance status: connection is shut down
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1397 machine 4 failed to start in availability zone landscape-ken: write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1438 failed to start machine 4 (write tcp 10.63.241.176:54180->10.63.241.1:8443: write: broken pipe), retrying in 10s (10 more attempts)
2023-03-07 18:23:00 WARNING juju.worker.provisioner provisioner_task.go:1443 failed to set instance status: connection is shut down
2023-03-07 18:23:00 ERROR juju.worker.cleaner cleaner.go:88 cannot cleanup state: connection is shut down
2023-03-07 18:23:00 WARNING juju.worker.metricworker sender.go:23 failed to send metrics connection is shut down - will retry later
2023-03-07 18:23:00 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-03-07 18:23:00 ERROR juju.worker.provisioner workerpool.go:140 worker 4: shutting down pool due to error while handling a "start-instance 4" task: catacomb 0x400074d8e0 is dying
2023-03-07 18:23:00 ERROR juju.worker.provisioner workerpool.go:140 worker 3: shutting down pool due to error while handling a "start-instance 3" task: catacomb 0x400074d8e0 is dying
2023-03-07 18:23:00 ERROR juju.worker.provisioner workerpool.go:140 worker 2: shutting down pool due to error while handling a "start-instance 2" task: catacomb 0x400074d8e0 is dying
2023-03-07 18:23:00 ERROR juju.worker.provisioner workerpool.go:140 worker 1: shutting down pool due to error while handling a "start-instance 1" task: catacomb 0x400074d8e0 is dying
2023-03-07 18:23:00 INFO juju.worker.machineundertaker undertaker.go:138 tearing down machine undertaker

LXD had provisioned the machines and they had tried to connect to the controller, but failed as they were unknown.

The harvester should be able to either remove them and retry, or use the existing machines, though they'd have to be kicked to attempt controller connection again.

See attached logs for more detail.

Tags: deploy
Revision history for this message
Heather Lanigan (hmlanigan) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.