LXD vm compose fails with - This "instances" entry already exists

Bug #2028284 reported by Marian Gasparovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Alexsander de Souza
3.3
Fix Released
High
Alexsander de Souza
3.4
Fix Released
High
Alexsander de Souza

Bug Description

Using LXD VMs, we are hitting

2023-07-19-22:17:36 root DEBUG [localhost]: maas root vm-host compose 1 hostname=landscapeamqp-1 cores=2 memory=4096 storage=40.0 zone=2
2023-07-19-22:17:42 root ERROR [localhost] Command failed: maas root vm-host compose 1 hostname=landscapeamqp-1 cores=2 memory=4096 storage=40.0 zone=2
2023-07-19-22:17:42 root ERROR 1[localhost] STDOUT follows:
Unable to compose machine because: Failed talking to pod: Failed creating instance record: Add instance info to the database: This "instances" entry already exists

MAAS logs

https://oil-jenkins.canonical.com/artifacts/b1301c81-3e45-4743-b2e4-2466a84dbdf7/generated/generated/maas/logs-2023-07-19-22.18.38.tgz

Related branches

Revision history for this message
Adam Collard (adam-collard) wrote :

That error comes from LXD, when you try to create an instance with a name that already exists.

for i in 1 2; do lxc launch --empty foo; done

Revision history for this message
Bill Wear (billwear) wrote :

so how did these duplicate instance names come about -- is it something in the MAAS code that causes this, or is it something related to the test rigging?

Bill Wear (billwear)
Changed in maas:
status: New → Incomplete
Revision history for this message
Marian Gasparovic (marosg) wrote :

There is no test rigging, all comes directly from deployment

Changed in maas:
status: Incomplete → New
Revision history for this message
Marian Gasparovic (marosg) wrote :
Download full text (10.6 KiB)

I did several experiments, I made sure there were no machines defined

$ for i in 30 31 32; do ssh 10.244.40.$i lxc list ; done
Warning: Permanently added '10.244.40.30' (ED25519) to the list of known hosts.
+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+
Warning: Permanently added '10.244.40.31' (ED25519) to the list of known hosts.
To start your first container, try: lxc launch ubuntu:22.04
Or for a virtual machine: lxc launch ubuntu:22.04 --vm

+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+
Warning: Permanently added '10.244.40.32' (ED25519) to the list of known hosts.
To start your first container, try: lxc launch ubuntu:22.04
Or for a virtual machine: lxc launch ubuntu:22.04 --vm

+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+

Then ran the deployment

$ fce_wrap build --layer maas --steps compose_vms
2023-08-09-13:10:11 root DEBUG fce --debug build --layer maas --steps compose_vms
2023-08-09-13:10:11 root DEBUG FCE version: 2.21+git.10.g0acb2a1d
2023-08-09-13:10:11 root DEBUG Running 'zone' project check
2023-08-09-13:10:11 fce.build INFO Started building layer: maas
Warning: Permanently added '10.244.40.30' (ED25519) to the list of known hosts.
2023-08-09-13:10:12 fce.maas INFO Starting step: maas:compose_vms
2023-08-09-13:10:12 root DEBUG [localhost]: maas root vm-hosts read
2023-08-09-13:10:16 root DEBUG [localhost]: maas root version read
2023-08-09-13:10:19 root DEBUG [localhost]: maas root rack-controllers read hostname=leafeon
2023-08-09-13:10:24 root DEBUG [localhost]: maas root vm-host update 1 memory_over_commit_ratio=10 cpu_over_commit_ratio=10
2023-08-09-13:10:29 root DEBUG [localhost]: maas root machines read
2023-08-09-13:10:34 foundationcloudengine.layers.configuremaas INFO Creating elastic-1 in leafeon
2023-08-09-13:10:34 root DEBUG [localhost]: maas root vm-host compose 1 hostname=elastic-1 cores=2 memory=24576 storage=500.0 zone=2
2023-08-09-13:10:41 root DEBUG [localhost]: maas root tags create name=elastic
2023-08-09-13:10:44 root ERROR [localhost] Command failed: maas root tags create name=elastic
2023-08-09-13:10:44 root ERROR 1[localhost] STDOUT follows:
{"name": ["Tag with this Name already exists."]}
2023-08-09-13:10:44 root ERROR 2[localhost] STDERR follows:
b''
2023-08-09-13:10:44 root DEBUG [localhost]: maas root tag update-nodes elastic add=np6tyw
2023-08-09-13:10:48 foundationcloudengine.layers.configuremaas INFO Creating grafana-1 in leafeon
2023-08-09-13:10:48 root DEBUG [localhost]: maas root vm-host compose 1 hostname=grafana-1 cores=2 memory=3072 storage=40.0 zone=2
2023-08-09-13:10:54 root DEBUG [localhost]: maas root tags create name=grafana
2023-08-09-13:10:58 root ERROR [localhost] Command failed: maas root tags create name=grafana
2023-08-09-13:10:58 root ERROR 1[localhost] STDOUT follows:
{"name": ["Tag with this Name already exists."]}
2023-08-09-13:10:58 root ERROR 2[localhost] STDERR follows...

Revision history for this message
Björn Tillenius (bjornt) wrote :

I suspect what happens is that in the api handler, we first create the instance in lxd and then we do db work. If the db work fails due to a serialization error, which is expected now and then, the whole request is retried and it will try to create a new instance with the same name in lxd.

It's quite hard to handle this properly without significant changes, but I'll take a look to see what we can do. I'm also going to see whether we changed something recently so that you see more serialization errors than before.

Revision history for this message
Björn Tillenius (bjornt) wrote :

BTW, I did reproduce this by raising a serialization error at the end of BMC.create_machine().

Revision history for this message
Björn Tillenius (bjornt) wrote :

There are three serialization errors happening between the time the compose command was being executed and the 'instance already exist' error happened. The most likely culprit is this:

2023-07-19 22:17:41.506 UTC [2508416] maas@maasdb ERROR: could not serialize ac
cess due to concurrent update
2023-07-19 22:17:41.506 UTC [2508416] maas@maasdb STATEMENT: UPDATE "maasserver_podhints" SET "pod_id" = 1, "cores" = 20, "memory" = 131072, "cpu_speed" = 2600, "local_storage" = 0, "cluster_id" = NULL WHERE "maasserver_podhints"."id" = 1

They way the compose form works is that it creates an instance in lxd and then it updates the pod hints in the database. After that it adds a post commit hook to refresh commissioning data for the pod, which also syncs the hints to the database.

So what can happen when composing multiple vms in a row is that the pod hint sync in the form and the one in the post commit from the previous compose command will conflict.

Still trying to figure out how to solve this.

Revision history for this message
Björn Tillenius (bjornt) wrote :

In the long term, we should make the compose form handle conflict errors. One idea is that add a user.* isntance property containing the maas system id, so that we know that MAAS created the instance and can handle deletion of it if necessary.

In the short term, we can make BMC.sync_hints() a bit smarter and not save anything if nothing changed. Currently only virsh is actually returning any hints when composing a machine. And only LXD is sending commissioning results in the post commit hook.

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → 3.5.0
summary: - vm compose fails with - This "instances" entry already exists
+ LXD vm compose fails with - This "instances" entry already exists
Changed in maas:
assignee: nobody → Alexsander de Souza (alexsander-souza)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Trent Lloyd (lathiat) wrote :

Wanted to note that I've just filed a Bug #2055252 that may be sortof related but I don't see a serialisation error in my case (and it's not obvious exactly why the VM creation fails and gets rolled back).

The LXD VM creation fails specifically with a storage constraint of ('storage', ['root:0,0:8']) which juju is generating when using a second disk from juju storage but you don't specify a root-disk size with a root-disk constraint - so it passes in 0. (This Juju Bug #1983084)

It my bug it seems the DB entries are rolled back but the LXD VM is left behind and requires manual cleanup. Since this also comes from juju, it tries 10 times so we get 10 VMs left behind when this happens.

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.