Orphaned LXD VM left behind after creation failure, requires manual cleanup

Bug #2055252 reported by Trent Lloyd
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Status tracked in 3.6
3.5
Won't Fix
Medium
Unassigned
3.6
Triaged
Medium
Unassigned

Bug Description

When MAAS fails to create an LXD VM, currently specifically because the root-disk size was requested as 0 due to Juju Bug #1983084, an orphaned LXD instance is left behind but no such instance exists in MAAS. It requires manual cleanup of the LXD VM.

How to reproduce
1. Deploy MAAS (v3.4)
2. Register an LXD VM host to MAAS
3. Bootstrap a juju controller against the MAAS (Juju v3.3.1)
4. Deploy a charm with storage support, without specifying a "root-disk" constraint - which causes the Bug #1983084 to request a VM with 0 storage:
juju add-model test1
juju deploy ceph-osd -n1 --channel quincy/stable --storage osd-devices=maas,8G

I can't seem to easily recreate this failure using the MAAS CLI, e.g. "maas admin vm-host compose 1 hostname=test3 cores=3 memory=2048 storage=0.0" manages to create and boot an LXD VM with no disk - rather than try to create a disk with size 0. Perhaps someone can figure out an easier way to reproduce a failure without the juju complexity.

From a brief look at the code, it seems maybe there is no cleanup error handling at all? There's no catching of an error during creation and I don't see any other obvious cleanup code. But I only looked briefly and may have missed something at a higher level:
https://github.com/maas/maas/blob/dbd701455fa1045d7fbab45e4fc1daa139e4c6cb/src/provisioningserver/drivers/pod/lxd.py#L450

The secondary storage volume is also left attached. Juju also retries 10 times so every time this happens you end up with 10 VMs left behind.

Revision history for this message
Trent Lloyd (lathiat) wrote (last edit ):

Looking at this a bit further, the LXD VM is actually created but perhaps not funtional or fails to start? But there is no logging from the LXD client with snap debugging turned on, and I can't actually see any obvious specific error or cause for why the transaction is ultimately rolled back.

Possibly a variant of Bug #2028284?

2024-02-28 07:39:27 django.db.backends: [debug] (0.000) ROLLBACK TO SAVEPOINT "s140124733044288_x26"; args=None
2024-02-28 07:39:27 django.db.backends: [debug] (0.000) RELEASE SAVEPOINT "s140124733044288_x26"; args=None
2024-02-28 07:39:27 maasserver: [error] ################################ Exception: No available machine matches constraints: [('agent_name', ['20d71182-29d5-4b3e-8125-7789ad7a60b8']), ('arch', ['amd64']), ('interfaces', ['1:space=1']), ('storage', ['root:0,0:8']), ('zone', ['default'])] (resolved to "arch=amd64/generic interfaces=1:space=1 storage=root:0,0:8 zone=default") ################################
2024-02-28 07:39:27 maasserver: [error] Traceback (most recent call last):
  File "/snap/maas/32469/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/utils/views.py", line 298, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/snap/maas/32469/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/snap/maas/32469/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/snap/maas/32469/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/api/support.py", line 371, in dispatch
    return function(self, request, *args, **kwargs)
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/api/machines.py", line 2608, in allocate
    raise NodesNotAvailable(message)
maasserver.exceptions.NodesNotAvailable: No available machine matches constraints: [('agent_name', ['20d71182-29d5-4b3e-8125-7789ad7a60b8']), ('arch', ['amd64']), ('interfaces', ['1:space=1']), ('storage', ['root:0,0:8']), ('zone', ['default'])] (resolved to "arch=amd64/generic interfaces=1:space=1 storage=root:0,0:8 zone=default")

Revision history for this message
Jones Ogolo (jonesogolo) wrote :

Hi Trent,

Thanks for pointing this out, from our findings, juju appears to be running the following command
`maas admin machines allocate arch=amd64/generic interfaces=1:space=1 storage=root:0,0:10 zone=default` which creates orphan containers, this triggering the issue you mentioned. We'll have a look.

Changed in maas:
milestone: none → 3.5.x
importance: Undecided → Medium
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.