Canonical Juju

LXD container fails to start due to UNIQUE constraint on container.name

Bug #1945813 reported by Moises Emilio Benzan Mora on 2021-10-01

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Heather Lanigan	Canonical Juju 3.0-beta4

Bug Description

It seems like juju implicitly retried creating an lxd machine after a failed configuration, leading to `machine 5/lxd/6 failed to start: UNIQUE constraint failed: containers.name`.

We expected the LXD container to boot up and proceed with the layer deployment, however it failed to start and blocked the rest of the process.

Crashdump:
```
672078e3-269d-4513-a881-57e30b79a8f3/bcache-tuning_5/var/log/juju/machine-5.log
116:2021-10-01 05:21:50 INFO juju.container-setup container_initialisation.go:110 initial container setup with ids: [5/lxd/6 5/lxd/0 5/lxd/1 5/lxd/2 5/lxd/3 5/lxd/4 5/lxd/5]
204:2021-10-01 05:22:06 INFO juju.worker.provisioner provisioner_task.go:405 found machine pending provisioning id:5/lxd/6, details:5/lxd/6
368:2021-10-01 05:29:58 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
675:2021-10-01 05:35:30 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: UNIQUE constraint failed: containers.name
676:2021-10-01 05:35:30 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (UNIQUE constraint failed: containers.name), retrying in 10s (10 more attempts)
794:2021-10-01 05:48:34 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
826:2021-10-01 05:48:56 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
827:2021-10-01 05:48:56 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (9 more attempts)
940:2021-10-01 05:54:15 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
971:2021-10-01 05:54:25 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
972:2021-10-01 05:54:25 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (8 more attempts)
1036:2021-10-01 05:56:00 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1078:2021-10-01 05:56:22 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1079:2021-10-01 05:56:22 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (7 more attempts)
1137:2021-10-01 05:57:45 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1200:2021-10-01 05:57:58 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1201:2021-10-01 05:57:58 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (6 more attempts)
1339:2021-10-01 05:59:13 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1390:2021-10-01 05:59:32 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1391:2021-10-01 05:59:32 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (5 more attempts)
1495:2021-10-01 06:00:48 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1560:2021-10-01 06:01:04 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1561:2021-10-01 06:01:04 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (4 more attempts)
1666:2021-10-01 06:02:11 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1725:2021-10-01 06:02:25 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1726:2021-10-01 06:02:25 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (3 more attempts)
1866:2021-10-01 06:03:39 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
1912:2021-10-01 06:03:58 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
1913:2021-10-01 06:03:58 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (2 more attempts)
2165:2021-10-01 06:05:47 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
2271:2021-10-01 06:06:11 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 5/lxd/6 failed to start: Container 'juju-5bcbde-5-lxd-6' already exists
2272:2021-10-01 06:06:11 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 5/lxd/6 (Container 'juju-5bcbde-5-lxd-6' already exists), retrying in 10s (1 more attempts)
2520:2021-10-01 06:08:44 DEBUG juju.container.broker.lxd broker.go:45 using multi-bridge networking for container "5/lxd/6"
2595:2021-10-01 06:09:07 ERROR juju.worker.provisioner provisioner_task.go:977 cannot start instance for machine "5/lxd/6": Container 'juju-5bcbde-5-lxd-6' already exists
```

Path to the run: https://solutions.qa.canonical.com/testruns/testRun/7ba33504-e4a3-4970-9071-c4968b0f3407

Path to the artifacts: https://oil-jenkins.canonical.com/artifacts/7ba33504-e4a3-4970-9071-c4968b0f3407/generated/generated/openstack/juju-crashdump-openstack-2021-10-01-06.39.57.tar.gz

Path to build: https://oil-jenkins.canonical.com/job/fce_build/12561//console

Failed LXD machine: Machine5/lxd/6

All occurrences can be found here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1945813

See original description

Tags:

Moises Emilio Benzan Mora (moisesbenzan) on 2021-10-01

description:	updated
tags:	added: foundations-engine-cdo-qa
tags:	added: cdo-qa foundations-engine removed: foundations-engine-cdo-qa
description:	updated

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2021-10-11:

LXD appears to have been left in a dirty state since its trying to either create a container when one was already created (but perhaps LXD error'd on the first attempt) or there was a container leftover that wasn't cleaned up from a previous run. Will have to dig deeper.

Changed in juju:
importance:	Undecided → High
milestone:	none → 2.9.18
status:	New → Triaged

Revision history for this message

John A Meinel (jameinel) wrote on 2021-11-03:

So my understanding is that this isn't because there was already a 'juju-5bcbde-5-lxd-6' before the environment was deployed. But that when we went to create juju-5bcbde-5-lxd-6 we initially got some sort of failure. But that failure wasn't strictly fatal, so there were pieces that *did* get created. And when we come back to retry we aren't handling the fact that there are artifacts already in place.

It would be good to understand what the original failure was, and know how we should be handling it better.

Canonical Juju QA Bot (juju-qa-bot) on 2021-11-04

Changed in juju:
milestone:	2.9.18 → 2.9.19

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2021-11-08:

Still occurring on Solutions QA runs.

Revision history for this message

Marian Gasparovic (marosg) wrote on 2021-11-12:

When we deploy in virtual environment (which has constrained resources), we hit it every single time. It started to appear about two weeks ago.

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2021-11-12:

This is reproducing with a simple config in:
https://bugs.launchpad.net/juju/+bug/1950771

juju 2.9.18
juju bootstrap google
juju deploy ubuntu -n 3
juju deploy mysql --to kvm:0
juju add-unit mysql --to lxd:1
juju add-unit mysql --to lxd:2

machine-2: 19:29:45 INFO juju.container.lxd starting new container "juju-780be5-2-lxd-0" (image "ubuntu-16.04-server-cloudimg-amd64-lxd.tar.xz")
machine-2: 19:29:45 ERROR juju.worker.provisioner cannot start instance for machine "2/lxd/0": Failed creating instance record: Add instance info to the database: This instance already exists

More investigation is required. Without extra lxd debugging enabled there were no useful logs. I was able to ssh to machine 2 and successfully create a container.

Canonical Juju QA Bot (juju-qa-bot) on 2021-11-17

Changed in juju:
milestone:	2.9.19 → 2.9.20

Canonical Juju QA Bot (juju-qa-bot) on 2021-11-26

Changed in juju:
milestone:	2.9.20 → 2.9.21

Canonical Juju QA Bot (juju-qa-bot) on 2021-12-01

Changed in juju:
milestone:	2.9.21 → 2.9.22

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2021-12-08 (last edit on 2021-12-08):

I am unable to reproduce with juju 2.9.22, currently 2.9/edge, and lxd 4.20 (21902), nor juju 2.9.18 with lxd 4.20 (21902). lxd 4.20/stable updated on 2021-11-18, after I reproduced prior. Will try some more.

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2021-12-08 (last edit on 2021-12-10):

It looks like juju is retrying a failed container provisioning with the same container name. Produced with the reproducer of 1950771. This is not the same as the logs provided a above.

2021-12-08 21:47:18 INFO juju.container.lxd container.go:255 starting new container "juju-d3dd2f-1-lxd-0" (image "ubuntu-16.04-server-cloudimg-amd64-lxd.tar.xz")
2021-12-08 21:49:32 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 1/lxd/0 failed to start: websocket: close 1006 (abnormal closure): unexpected EOF
2021-12-08 21:49:32 WARNING juju.worker.provisioner provisioner_task.go:1153 failed to start machine 1/lxd/0 (websocket: close 1006 (abnormal closure): unexpected EOF), retrying in 10s (10 more attempts)
2021-12-08 21:49:42 WARNING juju.container.broker broker.go:76 no name servers supplied by provider, using host's name servers.
2021-12-08 21:49:42 WARNING juju.container.broker broker.go:76 no search domains supplied by provider, using host's search domains.
2021-12-08 21:49:42 WARNING juju.container.broker broker.go:84 incomplete DNS config found, discovering host's DNS config
2021-12-08 21:49:45 WARNING juju.container.lxd manager.go:200 no CIDR was detected for the following networks: [fan-253]
2021-12-08 21:49:46 INFO juju.cloudconfig userdatacfg_unix.go:575 Fetching agent: curl -sSfw 'agent binaries from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --connect-timeout 20 --noproxy "*" --insecure -o $bin/tools.tar.gz <[https://10.240.0.58:17070/model/228ded55-1a5a-4d21-8f6a-2be9c5d3dd2f/tools/2.9.18-ubuntu-amd64 https://253.0.58.1:17070/model/228ded55-1a5a-4d21-8f6a-2be9c5d3dd2f/tools/2.9.18-ubuntu-amd64 https://35.237.242.119:17070/model/228ded55-1a5a-4d21-8f6a-2be9c5d3dd2f/tools/2.9.18-ubuntu-amd64]>
2021-12-08 21:49:47 INFO juju.container.lxd container.go:255 starting new container "juju-d3dd2f-1-lxd-0" (image "ubuntu-16.04-server-cloudimg-amd64-lxd.tar.xz")
2021-12-08 21:49:49 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 1/lxd/0 failed to start: Failed creating instance record: Add instance info to the database: This instance already exists

New info: "websocket: close 1006 (abnormal closure): unexpected EOF" tends to happen if the host machine is created with only 1G of memory. As do other container problems. Using a memory constraint of 4G does not reproduce the websocket failure.

It looks like juju is retrying a failed container provisioning with the same container name.  Produced with the reproducer of 1950771.  This is not the same as the logs provided a above.

New info: "websocket: close 1006 (abnormal closure): unexpected EOF" tends to happen if the host machine is created with only 1G of memory.  As do other container problems.  Using a memory constraint of 4G does not reproduce the websocket failure.

Changed in juju:
assignee:	nobody → Heather Lanigan (hmlanigan)

Canonical Juju QA Bot (juju-qa-bot) on 2021-12-10

Changed in juju:
milestone:	2.9.22 → 2.9.23

Canonical Juju QA Bot (juju-qa-bot) on 2022-01-12

Changed in juju:
milestone:	2.9.23 → 2.9.24

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-01-12:

Is there any progress on this bug? It is stopping the majority of testing on sqa's weaker hardware.
One manual fix seems to be to ssh into the host machine and delete the lxd container that juju is failing to create, then on juju's next attempt to create the container is works most of the time.

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-01-18:

Interestingly, we only see this bug on focal. Bionic containers never show this problem.

Canonical Juju QA Bot (juju-qa-bot) on 2022-02-01

Changed in juju:
milestone:	2.9.24 → 2.9.25

Revision history for this message

John A Meinel (jameinel) wrote on 2022-02-07:

#10

This seems to be the original error:

2021-12-08 21:49:32 WARNING juju.worker.provisioner provisioner_task.go:1114 machine 1/lxd/0 failed to start: websocket: close 1006 (abnormal closure): unexpected EOF

And what has actually happened is we issued a Start request to LXD, which failed with 1006. At which point, the container has been registered in LXD's database, but Juju doesn't know that it started successfully. At which point we then try again, and get rejected by LXD.

If we get a 'container already exists' failure, we could query LXD about that container and see if it fits the constraint of the container we were trying to create (with the right network devices, lxd profile, etc). And then treat it as though our start request succeeded instead of failed.

or we could delete it directly and then issue another Start request. That seems more dangerous because if we did have some other blip where we got confused and swapped containers, etc we might be stopping a container that was healthy and doing good work.

Canonical Juju QA Bot (juju-qa-bot) on 2022-02-15

Changed in juju:
milestone:	2.9.25 → 2.9.26

Canonical Juju QA Bot (juju-qa-bot) on 2022-03-09

Changed in juju:
milestone:	2.9.26 → 2.9.27

Revision history for this message

Simon Richardson (simonrichardson) wrote on 2022-03-17:

#11

PR: https://github.com/juju/juju/pull/13848

Canonical Juju QA Bot (juju-qa-bot) on 2022-03-18

Changed in juju:
milestone:	2.9.27 → 2.9.28

Canonical Juju QA Bot (juju-qa-bot) on 2022-03-30

Changed in juju:
milestone:	2.9.28 → 2.9.29

Revision history for this message

Simon Richardson (simonrichardson) wrote on 2022-03-31:

#12

Follow up PR: https://github.com/juju/juju/pull/13920

Changed in juju:
assignee:	Heather Lanigan (hmlanigan) → Simon Richardson (simonrichardson)
status:	Triaged → In Progress

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2022-04-08 (last edit on 2022-04-08):

#14

We're seeing different invocations of the UNIQUE constraint also:

Failed creating instance from image: Error inserting volume "juju-8d5c96-5-lxd-5" for project "default" in pool "default" of type "containers" into database "UNIQUE constraint failed: storage_volumes.storage_pool_id, storage_volumes.node_id, storage_volumes.project_id, storage_volumes.name, storage_volumes.type"

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2022-04-08 (last edit on 2022-04-08):

#15

Pondering if we should have a different `num-provision-workers` for the container provider. Currently the default is 16.

Some the machines reproducing this problem are running high load will very little free memory at the end (200M). Machine unhappiness is also seen with "failed to begin transaction: context deadline exceeded" from LXD. But not all, seen on bigger machines as well.

Ian Booth (wallyworld) on 2022-04-19

Changed in juju:
assignee:	Simon Richardson (simonrichardson) → nobody
status:	In Progress → Triaged

Canonical Juju QA Bot (juju-qa-bot) on 2022-04-27

Changed in juju:
milestone:	2.9.29 → 2.9.30

Revision history for this message

John A Meinel (jameinel) wrote on 2022-05-11:

#16

Looking over the solutions QA tests, we haven't seen a recurrence of this issue since April 11th. I think that means we could consider this fixed, unless Heather has some other specific changes that she wants to land.

Changed in juju:
assignee:	nobody → Heather Lanigan (hmlanigan)
status:	Triaged → In Progress

John A Meinel (jameinel) on 2022-05-11

Changed in juju:
status:	In Progress → Fix Released

Canonical Juju QA Bot (juju-qa-bot) on 2022-05-26

Changed in juju:
milestone:	2.9.30 → 2.9.31

Juan M. Tirado (tiradojm) on 2022-08-02

Changed in juju:
milestone:	2.9.31 → 3.0-rc1

Ian Booth (wallyworld) on 2022-08-24

Changed in juju:
milestone:	3.0-rc1 → 3.0-beta4

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.