LXD container fails to start due to UNIQUE constraint on container.name

Bug #1945813 reported by Moises Emilio Benzan Mora
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Heather Lanigan

Bug Description

It seems like juju implicitly retried creating an lxd machine after a failed configuration, leading to `machine 5/lxd/6 failed to start: UNIQUE constraint failed: containers.name`.

We expected the LXD container to boot up and proceed with the layer deployment, however it failed to start and blocked the rest of the process.

Crashdump:
```
672078e3-269d-4513-a881-57e30b79a8f3/bcache-tuning_5/var/log/juju/machine-5.log
116:2021-10-01 05:21:50 INFO juju.container-setup container_initialisation.go:110 initial container setup with ids: [5/lxd/6 5/lxd/0 5/lxd/1 5/lxd/2 5/lxd/3 5/lxd/4 5/lxd/5]
204:2021-10-01 05:22:06 INFO juju.worker.provisioner provisioner_task.go:405 found machine pending provisioning id:5/lxd/6, details:5/lxd/6
368:2021-10-01 05:29:58 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
675:2021-10-01 05:35:30 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: UNIQUE constraint failed: containers.name
676:2021-10-01 05:35:30 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (UNIQUE constraint failed: containers.name), retrying in 10s (10 more attempts)
794:2021-10-01 05:48:34 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
826:2021-10-01 05:48:56 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
827:2021-10-01 05:48:56 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (9 more attempts)
940:2021-10-01 05:54:15 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
971:2021-10-01 05:54:25 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
972:2021-10-01 05:54:25 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (8 more attempts)
1036:2021-10-01 05:56:00 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1078:2021-10-01 05:56:22 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1079:2021-10-01 05:56:22 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (7 more attempts)
1137:2021-10-01 05:57:45 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1200:2021-10-01 05:57:58 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1201:2021-10-01 05:57:58 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (6 more attempts)
1339:2021-10-01 05:59:13 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1390:2021-10-01 05:59:32 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1391:2021-10-01 05:59:32 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (5 more attempts)
1495:2021-10-01 06:00:48 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1560:2021-10-01 06:01:04 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1561:2021-10-01 06:01:04 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (4 more attempts)
1666:2021-10-01 06:02:11 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1725:2021-10-01 06:02:25 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1726:2021-10-01 06:02:25 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (3 more attempts)
1866:2021-10-01 06:03:39 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1912:2021-10-01 06:03:58 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1913:2021-10-01 06:03:58 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (2 more attempts)
2165:2021-10-01 06:05:47 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
2271:2021-10-01 06:06:11 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
2272:2021-10-01 06:06:11 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (1 more attempts)
2520:2021-10-01 06:08:44 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
2595:2021-10-01 06:09:07 ERROR juju.worker.provisioner provisioner_task.go:977 cannot start instance for machine "5/lxd/6": Container 'juju-5bcbde-5-lxd-6' already exists
```

Path to the run: https://solutions.qa.canonical.com/testruns/testRun/7ba33504-e4a3-4970-9071-c4968b0f3407

Path to the artifacts: https://oil-jenkins.canonical.com/artifacts/7ba33504-e4a3-4970-9071-c4968b0f3407/generated/generated/openstack/juju-crashdump-openstack-2021-10-01-06.39.57.tar.gz

Path to build: https://oil-jenkins.canonical.com/job/fce_build/12561//console

Failed LXD machine: Machine5/lxd/6

All occurrences can be found here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1945813

description: updated
tags: added: foundations-engine-cdo-qa
tags: added: cdo-qa foundations-engine
removed: foundations-engine-cdo-qa
description: updated
Revision history for this message
Harry Pidcock (hpidcock) wrote :

LXD appears to have been left in a dirty state since its trying to either create a container when one was already created (but perhaps LXD error'd on the first attempt) or there was a container leftover that wasn't cleaned up from a previous run. Will have to dig deeper.

Changed in juju:
importance: Undecided → High
milestone: none → 2.9.18
status: New → Triaged
Revision history for this message
John A Meinel (jameinel) wrote :

So my understanding is that this isn't because there was already a 'juju-5bcbde-5-lxd-6' before the environment was deployed. But that when we went to create juju-5bcbde-5-lxd-6 we initially got some sort of failure. But that failure wasn't strictly fatal, so there were pieces that *did* get created. And when we come back to retry we aren't handling the fact that there are artifacts already in place.

It would be good to understand what the original failure was, and know how we should be handling it better.

Changed in juju:
milestone: 2.9.18 → 2.9.19
Revision history for this message
Harry Pidcock (hpidcock) wrote :

Still occurring on Solutions QA runs.

Revision history for this message
Marian Gasparovic (marosg) wrote :

When we deploy in virtual environment (which has constrained resources), we hit it every single time. It started to appear about two weeks ago.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

This is reproducing with a simple config in:
https://bugs.launchpad.net/juju/+bug/1950771

juju 2.9.18
juju bootstrap google
juju deploy ubuntu -n 3
juju deploy mysql --to kvm:0
juju add-unit mysql --to lxd:1
juju add-unit mysql --to lxd:2

machine-2: 19:29:45 INFO juju.container.lxd starting new container "juju-780be5-2-lxd-0" (image "ubuntu-16.04-server-cloudimg-amd64-lxd.tar.xz")
machine-2: 19:29:45 ERROR juju.worker.provisioner cannot start instance for machine "2/lxd/0": Failed creating instance record: Add instance info to the database: This instance already exists

| juju-780be5-2-lxd-0 | STOPPED | | | CONTAINER | 0 |

More investigation is required. Without extra lxd debugging enabled there were no useful logs. I was able to ssh to machine 2 and successfully create a container.

Changed in juju:
milestone: 2.9.19 → 2.9.20
Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
milestone: 2.9.21 → 2.9.22
Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

I am unable to reproduce with juju 2.9.22, currently 2.9/edge, and lxd 4.20 (21902), nor juju 2.9.18 with lxd 4.20 (21902). lxd 4.20/stable updated on 2021-11-18, after I reproduced prior. Will try some more.

Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

It looks like juju is retrying a failed container provisioning with the same container name. Produced with the reproducer of 1950771. This is not the same as the logs provided a above.

2021-12-08 21:47:18 INFO juju.container.lxd container.go:255 starting new container "juju-d3dd2f-1-lxd-0" (image "ubuntu-16.04-server-cloudimg-amd64-lxd.tar.xz")
2021-12-08 21:49:32 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 1/lxd/0 failed to start: websocket: close 1006 (abnormal closure): unexpected EOF
2021-12-08 21:49:32 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 1/lxd/0 (websocket: close 1006 (abnormal closure): unexpected EOF), retrying in 10s (10 more attempts)
2021-12-08 21:49:42 WARNING juju.container.broker broker.go:76 no name servers supplied by provider, using host's name servers.
2021-12-08 21:49:42 WARNING juju.container.broker broker.go:76 no search domains supplied by provider, using host's search domains.
2021-12-08 21:49:42 WARNING juju.container.broker broker.go:84 incomplete DNS config found, discovering host's DNS config
2021-12-08 21:49:45 WARNING juju.container.lxd manager.go:200 no CIDR was detected for the following networks: [fan-253]
2021-12-08 21:49:46 INFO juju.cloudconfig userdatacfg_unix.go:575 Fetching agent: curl -sSfw 'agent binaries from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --connect-timeout 20 --noproxy "*" --insecure -o $bin/tools.tar.gz <[https://10.240.0.58:17070/model/228ded55-1a5a-4d21-8f6a-2be9c5d3dd2f/tools/2.9.18-ubuntu-amd64 https://253.0.58.1:17070/model/228ded55-1a5a-4d21-8f6a-2be9c5d3dd2f/tools/2.9.18-ubuntu-amd64 https://35.237.242.119:17070/model/228ded55-1a5a-4d21-8f6a-2be9c5d3dd2f/tools/2.9.18-ubuntu-amd64]>
2021-12-08 21:49:47 INFO juju.container.lxd container.go:255 starting new container "juju-d3dd2f-1-lxd-0" (image "ubuntu-16.04-server-cloudimg-amd64-lxd.tar.xz")
2021-12-08 21:49:49 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 1/lxd/0 failed to start: Failed creating instance record: Add instance info to the database: This instance already exists

New info: "websocket: close 1006 (abnormal closure): unexpected EOF" tends to happen if the host machine is created with only 1G of memory. As do other container problems. Using a memory constraint of 4G does not reproduce the websocket failure.

Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
Changed in juju:
milestone: 2.9.22 → 2.9.23
Changed in juju:
milestone: 2.9.23 → 2.9.24
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Is there any progress on this bug? It is stopping the majority of testing on sqa's weaker hardware.
One manual fix seems to be to ssh into the host machine and delete the lxd container that juju is failing to create, then on juju's next attempt to create the container is works most of the time.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Interestingly, we only see this bug on focal. Bionic containers never show this problem.

Changed in juju:
milestone: 2.9.24 → 2.9.25
Revision history for this message
John A Meinel (jameinel) wrote :

This seems to be the original error:

 2021-12-08 21:49:32 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 1/lxd/0 failed to start: websocket: close 1006 (abnormal closure): unexpected EOF

And what has actually happened is we issued a Start request to LXD, which failed with 1006. At which point, the container has been registered in LXD's database, but Juju doesn't know that it started successfully. At which point we then try again, and get rejected by LXD.

If we get a 'container already exists' failure, we could query LXD about that container and see if it fits the constraint of the container we were trying to create (with the right network devices, lxd profile, etc). And then treat it as though our start request succeeded instead of failed.

or we could delete it directly and then issue another Start request. That seems more dangerous because if we did have some other blip where we got confused and swapped containers, etc we might be stopping a container that was healthy and doing good work.

Changed in juju:
milestone: 2.9.25 → 2.9.26
Changed in juju:
milestone: 2.9.26 → 2.9.27
Revision history for this message
Simon Richardson (simonrichardson) wrote :
Changed in juju:
milestone: 2.9.27 → 2.9.28
Changed in juju:
milestone: 2.9.28 → 2.9.29
Revision history for this message
Simon Richardson (simonrichardson) wrote :
Changed in juju:
assignee: Heather Lanigan (hmlanigan) → Simon Richardson (simonrichardson)
status: Triaged → In Progress
Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

We're seeing different invocations of the UNIQUE constraint also:

Failed creating instance from image: Error inserting volume "juju-8d5c96-5-lxd-5" for project "default" in pool "default" of type "containers" into database "UNIQUE constraint failed: storage_volumes.storage_pool_id, storage_volumes.node_id, storage_volumes.project_id, storage_volumes.name, storage_volumes.type"

Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

Pondering if we should have a different `num-provision-workers` for the container provider. Currently the default is 16.

Some the machines reproducing this problem are running high load will very little free memory at the end (200M). Machine unhappiness is also seen with "failed to begin transaction: context deadline exceeded" from LXD. But not all, seen on bigger machines as well.

Ian Booth (wallyworld)
Changed in juju:
assignee: Simon Richardson (simonrichardson) → nobody
status: In Progress → Triaged
Changed in juju:
milestone: 2.9.29 → 2.9.30
Revision history for this message
John A Meinel (jameinel) wrote :

Looking over the solutions QA tests, we haven't seen a recurrence of this issue since April 11th. I think that means we could consider this fixed, unless Heather has some other specific changes that she wants to land.

Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
status: Triaged → In Progress
John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Released
Changed in juju:
milestone: 2.9.30 → 2.9.31
Changed in juju:
milestone: 2.9.31 → 3.0-rc1
Ian Booth (wallyworld)
Changed in juju:
milestone: 3.0-rc1 → 3.0-beta4
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.