Azure provider: Can't remove machine (with storage)

Bug #1900789 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Hi,

I can't seem to remove this machine, the unit is gone but not the machine. It seems stuck in 'dying', likely to do with attached storage?

| Machine State DNS Inst id Series AZ Message
| 0 started 20.193.47.26 machine-0 focal
| 1 down 20.53.72.175 machine-1 focal
| 3 started 20.53.109.34 machine-3 focal

| $ juju remove-machine 1 --force --no-wait
| removing machine 1

| $ juju list-storage
| Unit Storage id Type Pool Size Status Message
| ubuntu-repository-cache/0 ubuntu-repository-cache/0 filesystem azure 100GiB attached
| ubuntu-repository-cache/2 ubuntu-repository-cache/2 filesystem azure 100GiB attached

| $ juju status --format=yaml
| "1":
| juju-status:
| current: down
| message: agent is not communicating with the server
| since: 30 Sep 2020 09:37:28Z
| version: 2.8.3
| life: dying
| dns-name: 20.53.72.175
| ip-addresses:
| - 20.53.72.175
| - 192.168.0.5
| instance-id: machine-1
| machine-status:
| current: running
| since: 14 Jul 2020 04:39:50Z
| modification-status:
| current: idle
| since: 14 Jul 2020 04:37:08Z
| series: focal
| network-interfaces:
| eth0:
| ip-addresses:
| - 192.168.0.5
| mac-address: 00:0d:3a:cb:08:65
| gateway: 192.168.0.1
| is-up: true
| constraints: instance-type=Standard_D2_v2
| hardware: arch=amd64 cores=2 mem=7168M root-disk=30720M
| ...
| volumes:
| "1":
| provider-id: volume-1
| attachments:
| machines:
| "1":
| device-link: /dev/disk/azure/scsi1/lun0
| read-only: false
| life: alive
| pool: azure
| size: 102400
| persistent: true
| life: dying
| status:
| current: attached
| since: 25 Aug 2020 06:44:23Z

Full 'juju storage --format=yaml' output - https://paste.ubuntu.com/p/HxYhJQCZSP/

Model is Juju 2.8.3.

Haw Loeung (hloeung)
description: updated
Haw Loeung (hloeung)
description: updated
Revision history for this message
Pen Gale (pengale) wrote :

I tried to reproduce this on juju 2.8.5 by adding a machine w/ storage on azure. I wasn't able to reproduce -- the machine got removed cleanly.

```
juju deploy postgresql --storage pgdata=10G
# Deploys the machine as expected
juju remove-unit postegresql/0
# Removes the machine as expected
# I can remove storage with:
juju remove-storage pgdata/0

```

What is the history of this cluster? Does `juju remove-storage` unstick you? Did you attempt remove-unit before attempting to remove-machine?

Changed in juju:
status: New → Incomplete
Revision history for this message
Haw Loeung (hloeung) wrote :

Controllers originally deployed/provisioned with 2.8.1. Upgraded recently to 2.8.3 (Sept 30th). Model itself was also deployed 2.8.1 and upgraded to 2.8.3.

Originally used remove-unit to spin up a new one. This was because the unit it self failed to boot due to the grub calloc bug, LP:1889556, I think this is the important and missing piece here. So the unit is left in an unbootable state and juju is unable to properly / fully? remove it?

The storage doesn't exist though:

| $ juju list-storage
| Unit Storage id Type Pool Size Status Message
| ubuntu-repository-cache/0 ubuntu-repository-cache/0 filesystem azure 100GiB attached
| ubuntu-repository-cache/2 ubuntu-repository-cache/2 filesystem azure 100GiB attached

Changed in juju:
status: Incomplete → New
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1900789] Re: Azure provider: Can't remove machine (with storage)

Have you done a remove-machine --force? The normal destroy model should be
shutting things down cleanly, but 'remove-machine --force' should delete
the machine not waiting for units, etc to clean up.

On Mon, Oct 26, 2020 at 9:35 PM Haw Loeung <email address hidden>
wrote:

> Controllers originally deployed/provisioned with 2.8.1. Upgraded
> recently to 2.8.3 (Sept 30th). Model itself was also deployed 2.8.1 and
> upgraded to 2.8.3.
>
> Originally used remove-unit to spin up a new one. This was because the
> unit it self failed to boot due to the grub calloc bug, LP:1889556, I
> think this is the important and missing piece here. So the unit is left
> in an unbootable state and juju is unable to properly / fully? remove
> it?
>
> The storage doesn't exist though:
>
> | $ juju list-storage
> | Unit Storage id Type Pool
> Size Status Message
> | ubuntu-repository-cache/0 ubuntu-repository-cache/0 filesystem azure
> 100GiB attached
> | ubuntu-repository-cache/2 ubuntu-repository-cache/2 filesystem azure
> 100GiB attached
>
>
> ** Changed in: juju
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1900789
>
> Title:
> Azure provider: Can't remove machine (with storage)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1900789/+subscriptions
>

Revision history for this message
Haw Loeung (hloeung) wrote :

On Tue, Oct 27, 2020 at 04:52:38PM -0000, John A Meinel wrote:
> Have you done a remove-machine --force? The normal destroy model should be
> shutting things down cleanly, but 'remove-machine --force' should delete
> the machine not waiting for units, etc to clean up.

Yes, it's in the original bug report:

| $ juju remove-machine 1 --force --no-wait
| removing machine 1

Tried both --force and --force with --no-wait.

Regards,

Haw

Revision history for this message
Pen Gale (pengale) wrote :

@hloeung: are you currently in a state where a production model is stuck with a machine that won't go away? If so, please grab us in the #Juju channel and we can provide some support on unsticking it.

If not, I think that we can triage this bug as medium and loop back to it later. It sounds like we don't have a general issue w/ removing Azure instances, though we do have an edge case where things can get wedged, which is documented by this bug, for a later fix.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.