Canonical Juju

[2.8.2-candidate] "unable to setup network" when deploying lxd containers

Bug #1893629 reported by Marian Gasparovic on 2020-08-31

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Joseph Phillips	Canonical Juju 2.8.2

Bug Description

Juju 2.8.2 candidate
All lxd containers fail to deploy complaining about network spaces

failed to start machine 3/lxd/0 (unable to setup network: host machine "3" has no available device in space(s) "internal-space", "oam-space", "public-space"), retrying in 10s (10 more attempts)

The same bundle works fine with 2.8.1

One of the failed test runs https://solutions.qa.canonical.com/openstack/testRun/9ae5e4fb-9c7c-4fd3-8b67-142afb59c74a

Tags:

Marian Gasparovic (marosg) on 2020-08-31

tags:	added: cdo-release-blocker
summary:	- "unable to setup network" when deploying lxd containers in 2.8.2 - candidate + [2.8.2-candidate] "unable to setup network" when deploying lxd + containers

Revision history for this message

Pen Gale (pengale) wrote on 2020-08-31:

This seems to have slipped through our spaces tests. Currently working on getting a reproducer up on guimaas.

Changed in juju:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → 2.8.2

Revision history for this message

Pen Gale (pengale) wrote on 2020-08-31:

I've been unable to reproduce this so far in our own MAAS setup, though I'm working with a much simpler configuration (lxd containers bound to two spaces, in a test charm).

@marosg are you 100% certain that the machine in question has all three spaces setup and properly configured? It's possible that 2.8.2 is being more strict about something, and is catching a genuine error that was slipping through before.

I'll continue to poke at things from this end.

Revision history for this message

Pen Gale (pengale) wrote on 2020-08-31:

Looking through the output from the test ...

In buckets.yaml, it looks like each machine has four physical nics, bridged in pairs to two vlans. How does that map to the three spaces? Is the physical network something like internal-space, with the vlans mapping to oam-space and public-space?

If so, that kind of torpedoes the idea that Juju is catching a legitimate error. All the spaces are accounted for ...

Revision history for this message

Pen Gale (pengale) wrote on 2020-08-31:

After some discussion, our working theory is that this is a race condition when deploying multiple lxd containers to a machine.

It is possible that the machine needs to have multiple nics.

There may need to be a fair amount of load on the machine in order to reproduce.

The 2.8.2 rc contains some fixes specifically meant to address races that pop up in the above situation, by moving the timing of a lock. It's possible that we've uncovered (or caused) a different race in doing so.

Revision history for this message

Michael Skalka (mskalka) wrote on 2020-09-01:

Pete, regarding the network setup, there are four ports, two of which are bonded. eth0 has the native vlan (oam-space), eth1 has eth1.2696 (internal-space) and eth1.2678 (public-space). Then eth3 & eth4 form bond0 which has bond0.2735 and bond0.2736 which are ceph-replication and ceph-access respectively. So it's actually five spaces in total.

Revision history for this message

Pen Gale (pengale) wrote on 2020-09-01:

Given a machine w/ two network interfaces, and some spaces, I can make deploying to an lxd container break. Unfortunately, it's not breaking in exactly the same way as seen in this bug.

I'm going to outline my steps here. We may want to split this into a separate bug (or fixing the one may, in fact, fix the other):

    juju bootstrap guimaas test-spaces-stable --credential guimaas --no-gui
    juju add-machine --constraints spaces=default,space-alt1
    juju deploy cs:~juju-qa/bionic/space-defender-0 --to lxd:0 --bind "default defend-b=space-alt1"
    juju add-unit -n 10 --to lxd:0 space-defender

In the above case, the first two lxd containers will deploy successfully. The rest will fail with:

failed to start machine 5 (failed to acquire node: No available machine matches constraints

Deploying up to 10 units one by one, waiting for each to deploy, works just fine, which suggests that this is a race.

Unfortunately, the above steps break in the same way on a controller running 2.8.1, so I don't think that they count as a repro of the bug.

Revision history for this message

Pen Gale (pengale) wrote on 2020-09-01:

Darn it. This was just me misunderstanding how --to works on the CLI.

The following (correct) command works, without a hitch:

juju add-unit -n 10 --to lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0 space-defender

We're back to the drawing board as far as having a small reproducer for this :-/

Joseph Phillips (manadart) on 2020-09-02

Changed in juju:
status:	Triaged → In Progress
assignee:	nobody → Joseph Phillips (manadart)

Revision history for this message

Joseph Phillips (manadart) wrote on 2020-09-02:

I have reproduced this and am working to fix. It appears to manifest specifically when using bonds.

Revision history for this message

Joseph Phillips (manadart) wrote on 2020-09-02:

https://github.com/juju/juju/pull/11958

Joseph Phillips (manadart) on 2020-09-03

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2020-09-15

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1917685

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.