Cannot remove-machine or destroy-model on OpenStack cloud when machine is down

Bug #1814271 reported by Ed Stewart
68
This bug affects 24 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Low
Unassigned
2.6
Fix Released
High
Unassigned

Bug Description

Using Juju 2.5 client/controller, we've created several models against an OpenStack cloud, however, the machines within those models haven't provisioned because we have run out of resources within OpenStack.

This leaves juju status in the following state:

ubuntu@juju-d8558f-0:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp Notes
wavvkqkykaly juju_main dpcop/RegionOne 2.5.0 unsupported 14:07:30Z attempt 9 to destroy model failed (will retry): model not empty, found 1 machine, 2 applications (model not empty)

App Version Status Scale Charm Store Rev OS Notes
etcd waiting 0/1 etcd jujucharms 319 ubuntu
kubernetes-master waiting 0/1 kubernetes-master jujucharms 542 ubuntu exposed

Unit Workload Agent Machine Public address Ports Message
etcd/0 waiting allocating 0 waiting for machine
kubernetes-master/0 waiting allocating 0 waiting for machine

Machine State DNS Inst id Series AZ Message
0 down pending bionic cannot run instance: with fault "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance d4e0e6ed-3c41-4e2a-85c3-2fe7306a29ea."

In this state, it appears impossible to remove the machine (even with --force) or the model.

ubuntu@juju-d8558f-0:~$ juju remove-machine 0 --force --debug
14:09:54 INFO juju.cmd supercommand.go:57 running juju [2.5.0 gc go1.10.4]
14:09:54 DEBUG juju.cmd supercommand.go:58 args: []string{"/snap/juju/6362/bin/juju", "remove-machine", "0", "--force", "--debug"}
14:09:54 INFO juju.juju api.go:67 connecting to API addresses: [172.16.20.50:17070 10.5.1.3:17070 252.3.0.1:17070]
14:09:54 DEBUG juju.api apiclient.go:883 successfully dialed "wss://172.16.20.50:17070/model/434a24d2-cdff-40d8-8023-299ad11c7e8d/api"
14:09:54 INFO juju.api apiclient.go:603 connection established to "wss://172.16.20.50:17070/model/434a24d2-cdff-40d8-8023-299ad11c7e8d/api"
14:09:54 INFO cmd remove.go:185 removing machine 0
14:09:54 INFO cmd remove.go:193 - will remove unit etcd/0
14:09:54 INFO cmd remove.go:193 - will remove unit kubernetes-master/0
14:09:54 DEBUG juju.api monitor.go:35 RPC connection died
14:09:54 INFO cmd supercommand.go:502 command finished

however, the machine remains in the model.

Ed Stewart (emcs2)
tags: added: atos
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.6-beta1
Tim Penhey (thumper)
tags: added: teardown
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Ed Stewart (emcs2),

Do you have controller and/or machine/unit logs? I think I know what's going on but I'd like to see what you got.

Changed in juju:
milestone: 2.6-beta1 → 2.6-beta2
Changed in juju:
milestone: 2.6-beta2 → 2.6-rc1
Changed in juju:
milestone: 2.6-rc1 → 2.6-rc2
Revision history for this message
Anastasia (anastasia-macmood) wrote :

We have done a lot of work in removal/destruction area in Juju 2.6.

I can no longer reproduce the issue, although I have tried on AWS.

I have:

* bootstrapped,
* deployed 2 units of ubuntu requesting non-existing space (this got machines in the same twist as yours, i.e. with errors),
* remove-machine failed since it has a unit,
* remove-machine with --force succeeded,
* destroy-model and destroy-controller succeeded despite a machine in error.

https://pastebin.ubuntu.com/p/jstzstNmVT/

I am marking this as Fix Committed.

Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Ed Stewart (emcs2) wrote :

We have tried to retest this on Juju 2.6RC2 with a slightly different triggering error condition. In this case it was creating a VM with a floating ip on an Openstack network with no route to the target external network. Here are the notes from our engineer:

Environment setup

openstack rocky
create private network with subnet in project_a
ensure there's an external shared network
no routing from private network to external shared network
deploy a model with application that uses both networks

After 20-30min juju status reports:

cannot assign public address 172.16.20.53 to instance "eb98fa8f-773b-42d2-9426-16eb478a23a9": failed to add floating ip 172.16.20.53 to server with id: eb98fa8f-773b-42d2-9426-16eb478a23a9
caused by: request (https://xxxxxxxx.net:8774/v2.1/servers/eb98fa8f-773b-42d2-9426-16eb478a23a9/action) returned unexpected status: 400; error info: {"badRequest": {"code": 400, "message": "Unable to associate floating IP 172.16.20.53 to fixed IP 192.168.2.17 for instance eb98fa8f-773b-42d2-9426-16eb478a23a9. Error: External network a4e7ace0-102e-4836-9143-5bd175ea26cc is not reachable from subnet 7baa2429-daf9-4fcb-9042-5ef5485c58a5. Therefore, cannot associate Port c643a9ee-39d2-4e19-8ca0-a324b09cd128 with a Floating IP.\nNeutron server returns request_ids: ['req-edfe24df-c3e5-410e-9337-9caa8d859ac5']"}}

Test case 1
=================
Juju controller runs 2.5.4
setup the environment as described above
after we see juju is stuck, we try to remove the model with juju destroy-model
destroy never completes, stuck at above error
juju upgrade-model -m controller --agent-version 2.6-rc2 --agent-stream devel
wait until the upgrade completes
observe the model that was previously stuck is now gone from juju models --all

Test case 2
===================
Juju controller runs 2.6.-rc2
setup the environment as described above
after we see juju is stuck, we try to remove the model with juju destroy-model
destroy never completes, stuck at above error and the output of juju destroy-model is stuck in a loop:
Destroying model
Waiting on model to be removed, 4 machine(s)...
Waiting on model to be removed, 4 machine(s)...
...

Conclusion
================
Bottom line is 2.6-rc2 still does not cleanly remove a model when it got stuck creating vm and connecting public address to it because the internal network is misconfigured (no route to external).

Where the model got cleaned up correctly in the transition from 2.5.4 to 2.6-rc2. I suspect that is an unexpected side effect which could be an interesting cue.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Ed Stewart (emcs2),

This is very curious... Could you please re-try this scenario with 2.6 again (2.6.2 has been released) but see 'destroy-model --force"?

I'll mark this report as Incomplete until we hear back from you :) There may well be something that we need to address.

Please attach controller logs and model dump. It will be very helpful to see what is happening...

Changed in juju:
status: Fix Released → Incomplete
status: Incomplete → Fix Released
milestone: 2.6-rc2 → none
status: Fix Released → Incomplete
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.