Juju doesn't remove KVM virtual machines on maas nodes when using "juju remove-unit"

Bug #1982960 reported by Marco Marino
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

Hello,
Let me know if you need additional information. I'm not sure if this bug only happens in a nested virtualized environment. please, double-check.

Environment description:
Juju version: 2.9.32
MaaS version: 3.2

I used 3 virtual machines on my ubuntu 20.04 Workstation. One is the MaaS server, one is the juju controller and another is "node1", a node used for deploying juju applications.
Also, I used my local Workstation as a juju client.

I installed MaaS 3.2 as a snap on the first node, then I added 2 Machines using the mac addresses of the VMs I created on my workstation. Also, I used the "virsh" powertype and it works without any problem.

At this point, on my workstation I created the env with:

$ juju clouds --local # Using MaaS

$ juju add-credential maas-cloud

$ juju bootstrap maas-cloud --to juju-controller.maas

when completed, I created a new application with the following command:

juju deploy ubuntu --series focal

The third VM has been deployed by MaaS without any problem.

Then, I did the following:

juju deploy ubuntu --series bionic testubuntu --to kvm:0

The previous command creates a new VM inside machine 0 (I know, machine 0 is a VM so we have nested virtualization here)

when finished, the output of "juju status" was the following:

marino-mrc@discovery:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
default maascloud-default maascloud/default 2.9.32 unsupported 13:10:46+02:00

App Version Status Scale Charm Channel Rev Exposed Message
testubuntu 18.04 active 1 ubuntu stable 20 no
ubuntu 20.04 active 1 ubuntu stable 20 no

Unit Workload Agent Machine Public address Ports Message
testubuntu/5* active idle 0/kvm/3 192.168.100.60
ubuntu/0* active idle 0 192.168.100.54

Machine State DNS Inst id Series AZ Message
0 started 192.168.100.54 node1 focal default Deployed
0/kvm/3 started 192.168.100.60 juju-a7ae43-0-kvm-3 bionic Container started

Also, let's check the status of running VMs on machine 0:

marino-mrc@discovery:~$ juju ssh 0 "sudo virsh list --all"
setlocale: No such file or directory
 Id Name State
-------------------------------------
 4 juju-a7ae43-0-kvm-3 running

Now, let's try to remove the unit:

marino-mrc@discovery:~$ juju remove-unit testubuntu/5
removing unit testubuntu/5

Juju status again:
marino-mrc@discovery:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
default maascloud-default maascloud/default 2.9.32 unsupported 13:12:58+02:00

App Version Status Scale Charm Channel Rev Exposed Message
testubuntu unknown 0 ubuntu stable 20 no
ubuntu 20.04 active 1 ubuntu stable 20 no

Unit Workload Agent Machine Public address Ports Message
ubuntu/0* active idle 0 192.168.100.54

Machine State DNS Inst id Series AZ Message
0 started 192.168.100.54 node1 focal default Deployed

Sounds good, but if you check the running VMs again, kvm/3 is still running!:

marino-mrc@discovery:~$ juju ssh 0 "sudo virsh list --all"
setlocale: No such file or directory
 Id Name State
-------------------------------------
 4 juju-a7ae43-0-kvm-3 running

Also, I can still ping the IP:
marino-mrc@discovery:~$ ping -c 2 192.168.100.60
PING 192.168.100.60 (192.168.100.60) 56(84) bytes of data.
64 bytes from 192.168.100.60: icmp_seq=1 ttl=64 time=1.01 ms
64 bytes from 192.168.100.60: icmp_seq=2 ttl=64 time=0.958 ms

--- 192.168.100.60 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.958/0.981/1.005/0.023 ms

Also, if you check the MaaS web interface and the database, the IP, the interface and the instance have been removed. So we have a "zombie" VM that is running and with a reachable IP that can cause duplicates over the network.

Note that this doesn't happen if I use an LXD container instead of a KVM machine (the container will be removed based on the output of "sudo lxc list")

Thank you.
Regards,
Marco

Revision history for this message
Ian Booth (wallyworld) wrote :

Do the juju logs show any errors or other indication that there was an issue?

Revision history for this message
Marco Marino (marino-mrc) wrote :

Hello Ian,
I don't see any relevant log:

machine-0-kvm-9: 14:52:03 INFO juju.worker.deployer checking unit "testubuntu/14"
unit-testubuntu-14: 14:52:03 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-testubuntu-14: 14:52:05 INFO juju.worker.uniter.operation ran "stop" hook (via hook dispatching script: dispatch)
unit-testubuntu-14: 14:52:07 INFO juju.worker.uniter.operation ran "remove" hook (via hook dispatching script: dispatch)
machine-0-kvm-9: 14:52:07 INFO juju.worker.deployer checking unit "testubuntu/14"
machine-0-kvm-9: 14:52:07 INFO juju.worker.deployer recalling unit "testubuntu/14"
machine-0-kvm-9: 14:52:07 INFO juju.worker.deployer removing unit "testubuntu/14"
machine-0: 14:52:07 INFO juju.container-setup initial container setup with ids: [0/kvm/9]
machine-0: 14:52:08 INFO juju.container-setup initial container setup with ids: [0/kvm/9]
machine-0: 14:52:08 INFO juju.worker.provisioner stopping known instances [juju-a7ae43-0-kvm-9]
machine-0: 14:52:08 INFO juju.container.broker.kvm stopping kvm container for instance: juju-a7ae43-0-kvm-9
machine-0: 14:52:08 INFO juju.container.broker.kvm released all addresses for container "0/kvm/9"
machine-0: 14:52:08 INFO juju.worker.provisioner removing dead machine "0/kvm/9"

It seems that all is ok according to logs but the machine 0/kvm/9 is still running (the IP address is still reachable but MaaS DB has been updated)

Regards,
Marco

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

Reproduced, with logs showing the issue - Juju thinks the container is already stopped.

machine-1: 11:19:22 DEBUG juju.worker.provisioner worker 1: processing task "stop-instances"
machine-1: 11:19:22 INFO juju.worker.provisioner stopping known instances [juju-bc971f-1-kvm-0]
machine-1: 11:19:22 INFO juju.container.broker.kvm stopping kvm container for instance: juju-bc971f-1-kvm-0
machine-1: 11:19:22 DEBUG juju.container.kvm juju-bc971f-1-kvm-0 is already stopped

Changed in juju:
status: Triaged → In Progress
importance: Medium → High
milestone: none → 2.9.34
Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.