[2.9.4]lxd container doesnt start because it cant start eth0, and cant create veth, file exists

Bug #1932180 reported by Alexander Balderson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Simon Richardson

Bug Description

On a deployment using juju 2.9.4 we had an issue with 1 container failing to start:

Failed preparing container for start: Failed to start device "eth0": Failed to create the veth interfaces 0lxd0-0 and veth326b5d08: Failed to run: ip link add 0lxd0-0 type veth peer name veth326b5d08: RTNETLINK answers: File exists

The container doesnt have any unique spaces to the other containers on the host. Juju tries all 10 times to start the container but fails and it never starts.

I cant seem to find anything else helpful in the logs other than syslog reporting that it failed to start the container.

The testrun can be found at
https://solutions.qa.canonical.com/testruns/testRun/da1c1739-7cda-43b0-9c44-e00bb84ea0e9

with crashdump at:
https://oil-jenkins.canonical.com/artifacts/da1c1739-7cda-43b0-9c44-e00bb84ea0e9/generated/generated/openstack/juju-crashdump-openstack-2021-06-15-21.46.21.tar.gz

Machine 0 has the errors, and the error posted above comes from machine-0.log

Revision history for this message
Marian Gasparovic (marosg) wrote :

I see this with 2.9.4 a lot lately in our test environment, during openstack deployment I can see 20 containers showing this error.

Revision history for this message
Michael Skalka (mskalka) wrote :

Controller crashdump

Revision history for this message
Michael Skalka (mskalka) wrote :

Model crashdump

Revision history for this message
Michael Skalka (mskalka) wrote :

Still seeing this bug on 2.9.9: https://solutions.qa.canonical.com/testruns/testRun/e8c0deb1-8dc8-4b26-a455-176e618ec1a5

The lxd provisioner brings up 10 out of 11 of the containers on that host however 3/lxd/3 is stuck in pending with:

3/lxd/3 down pending focal Failed preparing container for start: Failed to start device "eth0": Failed to create the veth interfaces 3lxd3-0 and vethdb320926: Failed to run: ip link add 3lxd3-0 type veth peer name vethdb320926: RTNETLINK answers: File exists

Model crashdump here: https://oil-jenkins.canonical.com/artifacts/e8c0deb1-8dc8-4b26-a455-176e618ec1a5/generated/generated/openstack/juju-crashdump-openstack-2021-07-30-00.44.38.tar.gz
Controller crashdump: https://oil-jenkins.canonical.com/artifacts/e8c0deb1-8dc8-4b26-a455-176e618ec1a5/generated/generated/juju_maas_controller/juju-crashdump-controller-2021-07-30-00.44.06.tar.gz

Full model status at the end of: https://oil-jenkins.canonical.com/job/fce_build/11801/console

Revision history for this message
Stéphane Graber (stgraber) wrote :

Looking into this a bit it appears that Juju is defining network devices on the container with a host_name property set.

What this property does is tell LXD to specifically name the host side device for the container.

The problem with doing that is that the kernel has a bit of a tendency to hold on to those devices which is why LXD by default will generate a random name every time we allocate a network interface.

The issue you're running into is that for some reason the kernel is taking a while to garbage collect the network namespace and devices of the container as it exits. The kernel tells userspace that the container is gone (no more processes) but garbage collection of the network namespace (and mount namespace) may still be ongoing and can be for up to a few minutes in some cases.

If the container is started again during that time, you'll get an interface name conflict as reported above.

My recommendation here would be quite simple, don't use host_name unless you're going to be making sure it's available prior to starting a container. I don't know what the rationale is for having Juju generate those names but in general I'd far recommend letting LXD figure out a unique available name and if you need to query or interact with the host side device, then just ask LXD what it ended up naming it (GetInstanceState will tell you that).

Revision history for this message
Simon Richardson (simonrichardson) wrote :

Can we get you to try the following PR https://github.com/juju/juju/pull/13274, we believe there is a bug inside of juju here.

This PR should unblock you, which essentially doesn't create a host_name for eth0 if you have a device in the config.

Changed in juju:
assignee: nobody → Simon Richardson (simonrichardson)
milestone: none → 2.9.12
importance: Undecided → High
status: New → Triaged
Changed in juju:
milestone: 2.9.12 → 2.9.13
Changed in juju:
milestone: 2.9.13 → 2.9.14
Changed in juju:
milestone: 2.9.14 → 2.9.15
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Marian Gasparovic (marosg) wrote :

Thank you, edge snap 2.9.15-7a069ff solves it. I ran ten deployments in a loop over the weekend, all went without any issue

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.