[2.8.2-candidate] "unable to setup network" when deploying lxd containers

Bug #1893629 reported by Marian Gasparovic
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Joseph Phillips

Bug Description

Juju 2.8.2 candidate
All lxd containers fail to deploy complaining about network spaces

failed to start machine 3/lxd/0 (unable to setup network: host machine "3" has no available device in space(s) "internal-space", "oam-space", "public-space"), retrying in 10s (10 more attempts)

The same bundle works fine with 2.8.1

One of the failed test runs https://solutions.qa.canonical.com/openstack/testRun/9ae5e4fb-9c7c-4fd3-8b67-142afb59c74a

tags: added: cdo-release-blocker
summary: - "unable to setup network" when deploying lxd containers in 2.8.2
- candidate
+ [2.8.2-candidate] "unable to setup network" when deploying lxd
+ containers
Revision history for this message
Pen Gale (pengale) wrote :

This seems to have slipped through our spaces tests. Currently working on getting a reproducer up on guimaas.

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.8.2
Revision history for this message
Pen Gale (pengale) wrote :

I've been unable to reproduce this so far in our own MAAS setup, though I'm working with a much simpler configuration (lxd containers bound to two spaces, in a test charm).

@marosg are you 100% certain that the machine in question has all three spaces setup and properly configured? It's possible that 2.8.2 is being more strict about something, and is catching a genuine error that was slipping through before.

I'll continue to poke at things from this end.

Revision history for this message
Pen Gale (pengale) wrote :

Looking through the output from the test ...

In buckets.yaml, it looks like each machine has four physical nics, bridged in pairs to two vlans. How does that map to the three spaces? Is the physical network something like internal-space, with the vlans mapping to oam-space and public-space?

If so, that kind of torpedoes the idea that Juju is catching a legitimate error. All the spaces are accounted for ...

Revision history for this message
Pen Gale (pengale) wrote :

After some discussion, our working theory is that this is a race condition when deploying multiple lxd containers to a machine.

It is possible that the machine needs to have multiple nics.

There may need to be a fair amount of load on the machine in order to reproduce.

The 2.8.2 rc contains some fixes specifically meant to address races that pop up in the above situation, by moving the timing of a lock. It's possible that we've uncovered (or caused) a different race in doing so.

Revision history for this message
Michael Skalka (mskalka) wrote :

Pete, regarding the network setup, there are four ports, two of which are bonded. eth0 has the native vlan (oam-space), eth1 has eth1.2696 (internal-space) and eth1.2678 (public-space). Then eth3 & eth4 form bond0 which has bond0.2735 and bond0.2736 which are ceph-replication and ceph-access respectively. So it's actually five spaces in total.

Revision history for this message
Pen Gale (pengale) wrote :

Given a machine w/ two network interfaces, and some spaces, I can make deploying to an lxd container break. Unfortunately, it's not breaking in exactly the same way as seen in this bug.

I'm going to outline my steps here. We may want to split this into a separate bug (or fixing the one may, in fact, fix the other):

    juju bootstrap guimaas test-spaces-stable --credential guimaas --no-gui
    juju add-machine --constraints spaces=default,space-alt1
    juju deploy cs:~juju-qa/bionic/space-defender-0 --to lxd:0 --bind "default defend-b=space-alt1"
    juju add-unit -n 10 --to lxd:0 space-defender

In the above case, the first two lxd containers will deploy successfully. The rest will fail with:

    failed to start machine 5 (failed to acquire node: No available machine matches constraints

Deploying up to 10 units one by one, waiting for each to deploy, works just fine, which suggests that this is a race.

Unfortunately, the above steps break in the same way on a controller running 2.8.1, so I don't think that they count as a repro of the bug.

Revision history for this message
Pen Gale (pengale) wrote :

Darn it. This was just me misunderstanding how --to works on the CLI.

The following (correct) command works, without a hitch:

    juju add-unit -n 10 --to lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0,lxd:0 space-defender

We're back to the drawing board as far as having a small reproducer for this :-/

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

I have reproduced this and am working to fix. It appears to manifest specifically when using bonds.

Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.