ifupdown wants to configure interfaces it shouldn't (lxdbr0 or veth)

Bug #1569064 reported by Serge Hallyn on 2016-04-11
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Undecided
Unassigned
lxd (Ubuntu)
Undecided
Unassigned

Bug Description

The reproducer:

create a xenial vm. There, run

lxc launch ubuntu:

leave that container running, and reboot. If you're "lucky", networking will never come up:

[[0;32m OK [0m] Found device /sys/subsystem/net/devices/lxdbr0.
[[0m[0;31m* [0m] A start job is running for Raise network interfaces (32s / 5min 11s)[K[[0;1;31m*[0m[0;31m* [0m]
 A start job is running for Raise network interfaces (33s / 5min 11s)[K[[0;31m*[0;1;31m*[0m[0;31m* [0m] A start job i
s running for Raise network interfaces (34s / 5min 11s)[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for
Raise network interfaces (34s / 5min 11s)[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Raise network
interfaces (35s / 5min 11s)[K[ [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Raise network interfaces (35
s / 5min 11s)[K[ [0;31m*[0;1;31m*[0m] A start job is running for Raise network interfaces (36s / 5min 11s)[K[ [0
;31m*[0m] A start job is running for Raise network interfaces (37s / 5min 11s)[K[ [0;31m*[0;1;31m*[0m] A start job i

(etc)

I can reproduce this 100% on serverstack (openstack) instances. I cannot reproduce it locally with uvt-kvm.

What appears to be happening is that lxd starts lxd-bridge which creates lxdbr0 with no ipv4 address; this racily happens before systemd does its networking setup; said network setup then sees lxdbr0 and wants to configure it, so waits for it to have an ip address. Plausible?

Related branches

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in lxd (Ubuntu):
status: New → Confirmed
Changed in systemd (Ubuntu):
status: New → Confirmed
Stéphane Graber (stgraber) wrote :

Can you attach your /etc/default/lxd-bridge and "sudo systemctl status lxd-bridge"?

Just confirming that everything looks sane on that front.

It does look like systemd triggers on the interface and expects ifupdown to configure it, waiting forever for it to do so, despite the possibility that the interface isn't ifupdown-managed at all and that there is therefore no way for ifupdown to do anything about it.

cat lxd-bridge: http://paste.ubuntu.com/15768875/
systemctl status lxd-bridge: http://paste.ubuntu.com/15768890/

These are of course before the reboot. I cannot get in after reboot.

Serge Hallyn (serge-hallyn) wrote :

Hm, so lxdbr0 has an ipv4 address in the openstack instance. In the
uvt-kvm instance where networking comes up fine, it does not.

Ok, the fact that this only happens when it does have an IP may point at some kind of conflict.

Any chance you can (obviously before reboot), also post "ip addr show", "ip route show" and the content of /etc/network/interfaces and any file under /etc/network/interfaces.d/ ?

Serge Hallyn (serge-hallyn) wrote :

Note that adding an ipv4 address to the local uvt-kvm guest does *not* make it fail, sadly :( Getting the other info...

Serge Hallyn (serge-hallyn) wrote :

ip addr show: http://paste.ubuntu.com/15769059/
ip route show: http://paste.ubuntu.com/15769063/
/etc/network/interfaces: http://paste.ubuntu.com/15769066/
/etc/network/interfaces.d/50-cloud-init.cfg: http://paste.ubuntu.com/15769068/
/etc/network/interfaces.d/eth0: http://paste.ubuntu.com/15769071/

Serge Hallyn (serge-hallyn) wrote :

Removing the ipv4 configuration from /etc/default/lxd-bridge doesn't help.

Stéphane Graber (stgraber) wrote :

Not seeing anything wrong with the files above, no subnet conflict or other network error.

Serge Hallyn (serge-hallyn) wrote :

Tried adding lxdbr0 to the EXCLUDE_INTERFACES. Works on a kvm vm; but the openstack instance ends up with *only* lxdbr0 configured. Voodoo.

Martin Pitt (pitti) wrote :

Moving to ifupdown. systemd does not configure the network (unless you set them up in networkd, but we don't do that anywhere automatically and there's no evidence that you did, and it's networking.service that hangs).

I also don't see much wrong about this, except maybe that interfaces.d/50-cloud-init.cfg and interfaces.d/eth0 both configure eth0. But the latter (created by cloud-init I suppose) is a dead file as /e/n/interfaces only includes *.cfg files, so that shouldn't matter.

affects: systemd (Ubuntu) → ifupdown (Ubuntu)
summary: - systemctl wants to configure interfaces it shouldn't (lxdbr0)
+ ifupdown wants to configure interfaces it shouldn't (lxdbr0)

This will make networking's output go to the console as well:

sudo mkdir -p /etc/systemd/system/networking.service.d
printf '[Service]\nStandardOutput=journal+console\nStandardError=journal+console\n' | sudo tee /etc/systemd/system/networking.service.d/debug.conf

(I. e. do this before you set up the box to hang on networking). After a reboot, you should then be able to see what's going on in "nova console-log".

Serge Hallyn (serge-hallyn) wrote :

Thanks, Martin, this is showing:

[[0;32m OK [0m] Reached target Network (Pre).
         Starting Raise network interfaces...
[[0;32m OK [0m] Started ifup for lxdbr0.
[[0;32m OK [0m] Found device /sys/subsystem/net/devices/lxdbr0.
[ 18.541352] ifup[1931]: /sbin/ifup: waiting for lock on /run/network/ifstate.lxdbr0
[[0;1;31mFAILED[0m] Failed to start Raise network interfaces.
See 'systemctl status networking.service' for details.
[[0;1;33mDEPEND[0m] Dependency failed for Initial cloud... job (metadata service crawler).
[[0;32m OK [0m] Reached target Network.
[[0;32m OK [0m] Reached target Network is Online.

Serge Hallyn (serge-hallyn) wrote :

Of course there is still the q of why it is trying to deal with lxdbr0 at all. ifquery --read-environment --list, before reboot, only shows eth0 and lo

Ryan Harper (raharper) wrote :

Can you attach cloud-init*.log to the bug?

I cannot. I have not been able to get a console on the systems
where I can reproduce the bug.

Ok, I guess what I need to do is script a systemd job which detects
that lxd is configured; reboots after 5 minutes with lxd
disabled. Then I'll be able to log in.

Removing /etc/network/interfaces.d/* and putting eth0 into /etc/network/interfaces fixes it.

Martin Pitt (pitti) wrote :

So based on Serge's observation that removing /etc/network/interfaces.d/* fixes this, and the log message

[CLOUDINIT] stages.py[INFO]: Applying network configuration from fallback: {'version': 1, 'config': [{'name': 'lxdbr0', 'mac_address': 'fe:68:41:8c:e7:dc', 'subnets': [{'type': 'dhcp'}], 'type': 'physical'}]}

I suspect that cloud-init is adding a /etc/network/interfaces.d/ configuration for lxdbr0, which it shouldn't do?

Changed in lxd (Ubuntu):
status: Confirmed → Invalid
affects: ifupdown (Ubuntu) → cloud-init (Ubuntu)
Scott Moser (smoser) on 2016-04-14
Changed in lxd (Ubuntu):
status: Invalid → New
Scott Moser (smoser) wrote :

ok. So under bug 1569974 i fixed cloud-init to not consider the bridge as something it should pick as a fallback nic.
However, there is still a problem in that cloud-init may select the host side of a veth device.

The result is that
$ cat /etc/network/interfaces.d/50-cloud-init.cfg
auto vethVH9WCK
iface vethVH9WCK inet dhcp

So
a.) cloud-init needs to find some way to still do the right thing and pick the right nic ignoring those veth devices
   Note, though in a container eth0 (a veth) *is* the right device.

b.) lxd should probably not bring up bridges or do other things before 'network-pre.target'
cloud-init-local, the code that determines this runs:
 'Before=network-pre.target'
The fact taht it finds vethVH9WCK means that lxd ran that early. It probably means that lxd started containers that early, which is probably wrong as those containers might well expect to reach the interwebs during boot, and they will not be able to.

Scott Moser (smoser) on 2016-04-14
summary: - ifupdown wants to configure interfaces it shouldn't (lxdbr0)
+ ifupdown wants to configure interfaces it shouldn't (lxdbr0 or veth)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.7~bzr1209-0ubuntu1

---------------
cloud-init (0.7.7~bzr1209-0ubuntu1) xenial; urgency=medium

  * New upstream snapshot.
    - fallback net config: do not consider devices starting with
      'veth' (LP: #1569064)

 -- Scott Moser <email address hidden> Thu, 14 Apr 2016 16:24:38 -0400

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxd - 2.0.0-0ubuntu4

---------------
lxd (2.0.0-0ubuntu4) xenial; urgency=medium

  * Only start lxd after network-online.target has been reached.
    This avoids a cloud-init race at boot time and also makes it more
    likely for whatever IP address or bridge LXD needs to have been properly
    setup. (LP: #1569064)

 -- Stéphane Graber <email address hidden> Thu, 14 Apr 2016 16:03:02 -0400

Changed in lxd (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers