Juju failing to remove unit due to attached storage stuck dying

Bug #1950928 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

Hi,

Noticed a lot of noise about juju failing to remove machines, logging such as this:

| 2021-11-15 00:22:15 WARNING juju.state cleanup.go:213 cleanup failed in model 5e38a904-8ee0-48db-8ff1-7e2feee0835a for machine("19"): machine 19 has attachments [volume-3]

It looks like at some point, there was a volume attached but that's now stuck being cleaned up:

| volumes:
| "3":
| provider-id: dd052012-96c3-4df3-b0e7-87e5b0507788
| attachments:
| machines:
| "19":
| device: vdb
| read-only: false
| life: alive
| pool: cinder
| size: 51200
| persistent: true
| life: dying
| status:
| current: attaching
| message: |-
| failed to list volume attachments
| caused by: Resource at http://...:8774/v2/.../servers/.../os-volume_attachments not found
| caused by: request (http://...:8774/v2/.../servers/.../os-volume_attachments) returned unexpected status: 404; error info: {"itemNotFound": {"message": "The resource could not be found.", "code": 404}}
| since: 03 Mar 2020 06:19:06Z

See https://pastebin.canonical.com/p/fNVxjn7SDm/ and https://pastebin.canonical.com/p/nXVKHSJcgc/
(sorry, company private)

Not having a name associated with this storage means we can't try force removing it with 'juju remove-storage).

This is on a recently upgraded 2.9.18 controller (was 2.8.7).

Haw Loeung (hloeung)
description: updated
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.9.20
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

What happens if you:

set debug logging on the storage provisioner worker, ie add to logging-config

juju.worker.storageprovisioner=DEBUG

juju remove-machine 19 --force

(it might take a few minutes to time out and run the force)

Can you do that and attach pastebins of juju dump-db:
- the cleanups collection
- machine 19 from the machines collection

Also the last 5 or 10 minutes of logs, ie starting before the remove --force was run.

Then turn off the extra debugging.

Thanks

Revision history for this message
Ian Booth (wallyworld) wrote :

From the logs

2021-11-18 01:38:49 WARNING juju.state cleanup.go:213 cleanup failed in model 5e38a904-8ee0-48db-8ff1-7e2feee0835a for forceRemoveMachine("19"): removing attachment plan of volume 3 from machine 19: state changing too quickly; try again soon

--force should have worked but there's an issue with the logic and it's failing to handle the bad/incomplete data is it finding.

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Haw Loeung (hloeung) wrote (last edit ):

Per Ian, this DB query helped unstick things:

| juju:PRIMARY> db.volumeattachments.update(
| ... {_id: "5e38a904-8ee0-48db-8ff1-7e2feee0835a:19:3"},
| ... { $set:{life: 1}}
| ... )

Where it's model-uuid : machine-num : volume-num. All that provided by the controller log:

| 2021-11-15 00:22:15 WARNING juju.state cleanup.go:213 cleanup failed in model 5e38a904-8ee0-48db-8ff1-7e2feee0835a for machine("19"): machine 19 has attachments [volume-3]

Revision history for this message
Ian Booth (wallyworld) wrote :
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.