Canonical Juju

[2.9.4]lxd container doesnt start because it cant start eth0, and cant create veth, file exists

Bug #1932180 reported by Alexander Balderson on 2021-06-16

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Simon Richardson	Canonical Juju 2.9.15

Bug Description

On a deployment using juju 2.9.4 we had an issue with 1 container failing to start:

Failed preparing container for start: Failed to start device "eth0": Failed to create the veth interfaces 0lxd0-0 and veth326b5d08: Failed to run: ip link add 0lxd0-0 type veth peer name veth326b5d08: RTNETLINK answers: File exists

The container doesnt have any unique spaces to the other containers on the host. Juju tries all 10 times to start the container but fails and it never starts.

I cant seem to find anything else helpful in the logs other than syslog reporting that it failed to start the container.

The testrun can be found at
https://solutions.qa.canonical.com/testruns/testRun/da1c1739-7cda-43b0-9c44-e00bb84ea0e9

with crashdump at:
https://oil-jenkins.canonical.com/artifacts/da1c1739-7cda-43b0-9c44-e00bb84ea0e9/generated/generated/openstack/juju-crashdump-openstack-2021-06-15-21.46.21.tar.gz

Machine 0 has the errors, and the error posted above comes from machine-0.log

Tags:

Revision history for this message

Marian Gasparovic (marosg) wrote on 2021-06-23:

I see this with 2.9.4 a lot lately in our test environment, during openstack deployment I can see 20 containers showing this error.

Revision history for this message

Michael Skalka (mskalka) wrote on 2021-06-23:

juju-crashdump-controller.tar.xz Edit (4.0 MiB, application/x-tar)

Controller crashdump

Revision history for this message

Michael Skalka (mskalka) wrote on 2021-06-23:

juju-crashdump-model.tar.xz Edit (7.5 MiB, application/x-tar)

Model crashdump

Revision history for this message

Michael Skalka (mskalka) wrote on 2021-07-30:

Still seeing this bug on 2.9.9: https://solutions.qa.canonical.com/testruns/testRun/e8c0deb1-8dc8-4b26-a455-176e618ec1a5

The lxd provisioner brings up 10 out of 11 of the containers on that host however 3/lxd/3 is stuck in pending with:

3/lxd/3 down pending focal Failed preparing container for start: Failed to start device "eth0": Failed to create the veth interfaces 3lxd3-0 and vethdb320926: Failed to run: ip link add 3lxd3-0 type veth peer name vethdb320926: RTNETLINK answers: File exists

Model crashdump here: https://oil-jenkins.canonical.com/artifacts/e8c0deb1-8dc8-4b26-a455-176e618ec1a5/generated/generated/openstack/juju-crashdump-openstack-2021-07-30-00.44.38.tar.gz
Controller crashdump: https://oil-jenkins.canonical.com/artifacts/e8c0deb1-8dc8-4b26-a455-176e618ec1a5/generated/generated/juju_maas_controller/juju-crashdump-controller-2021-07-30-00.44.06.tar.gz

Full model status at the end of: https://oil-jenkins.canonical.com/job/fce_build/11801/console

Revision history for this message

Stéphane Graber (stgraber) wrote on 2021-08-20:

Looking into this a bit it appears that Juju is defining network devices on the container with a host_name property set.

What this property does is tell LXD to specifically name the host side device for the container.

The problem with doing that is that the kernel has a bit of a tendency to hold on to those devices which is why LXD by default will generate a random name every time we allocate a network interface.

The issue you're running into is that for some reason the kernel is taking a while to garbage collect the network namespace and devices of the container as it exits. The kernel tells userspace that the container is gone (no more processes) but garbage collection of the network namespace (and mount namespace) may still be ongoing and can be for up to a few minutes in some cases.

If the container is started again during that time, you'll get an interface name conflict as reported above.

My recommendation here would be quite simple, don't use host_name unless you're going to be making sure it's available prior to starting a container. I don't know what the rationale is for having Juju generate those names but in general I'd far recommend letting LXD figure out a unique available name and if you need to query or interact with the host side device, then just ask LXD what it ended up naming it (GetInstanceState will tell you that).

Revision history for this message

Simon Richardson (simonrichardson) wrote on 2021-08-24:

Can we get you to try the following PR https://github.com/juju/juju/pull/13274, we believe there is a bug inside of juju here.

This PR should unblock you, which essentially doesn't create a host_name for eth0 if you have a device in the config.

Changed in juju:
assignee:	nobody → Simon Richardson (simonrichardson)
milestone:	none → 2.9.12
importance:	Undecided → High
status:	New → Triaged

Canonical Juju QA Bot (juju-qa-bot) on 2021-08-25

Changed in juju:
milestone:	2.9.12 → 2.9.13

Canonical Juju QA Bot (juju-qa-bot) on 2021-09-08

Changed in juju:
milestone:	2.9.13 → 2.9.14

Canonical Juju QA Bot (juju-qa-bot) on 2021-09-10

Changed in juju:
milestone:	2.9.14 → 2.9.15

Simon Richardson (simonrichardson) on 2021-09-10

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Marian Gasparovic (marosg) wrote on 2021-09-13:

Thank you, edge snap 2.9.15-7a069ff solves it. I ran ten deployments in a loop over the weekend, all went without any issue

Simon Richardson (simonrichardson) on 2021-09-16

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2021-09-27

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.