Comment 11 for bug 1636708

Revision history for this message
Tom Verdaat (tom-verdaat) wrote :

We upgraded to the vlan 1.9-3.2ubuntu1.16.04.3 package and our networking broke horribly in a very similar way.

Let me start with our networking configuration. Two slaves, a bond and a vlan on top of that bond:

auto eno1
iface eno1 inet manual
   mtu 1500
   bond-master bond1
   bond-primary eno1

auto eno2
iface eno2 inet manual
   mtu 1500
   bond-master bond1

auto bond1
iface bond1 inet static
   mtu 1500
   address 10.10.10.3
   bond-miimon 100
   bond-mode active-backup
   bond-slaves none
   bond-downdelay 200
   bond-updelay 200
   dns-nameservers 10.10.0.1
   netmask 255.255.0.0

auto bond1.2
iface bond1.2 inet static
   mtu 1500
   address 10.11.10.3
   netmask 255.255.0.0
   vlan-raw-device bond1

This fails to come up correctly, both during boot and manually. Bringing up either eno1, eno2, bond1 or bond1.2 all result in the same problem: "ifup: waiting for lock on /run/network/ifstate.bond1".

Problem seems to be that ifup tries to bring up the base bond1 interface *again*. Even if it is already up. And it gets stuck waiting for the bond1 interface to be unlocked so it can bring it up, but it is already up and thus locked so that will never happen.

We also tried bringing all interfaces down and just running "ifup bond1.2" but that results in the same behavior.

Only workaround that seemed to work for us was to:
1) temporarily remove the bond1.2.cfg from /etc/network/interfaces.d
2) bring up eno1, eno2 and bond1
3) put the bond1.2.cfg back in its place
4) run "ifup bond1.2"
5) using another terminal, list all open processes using "ps -ef | grep ifup"
6) kill the "ifup bond1" process

The "ps -ef | grep ifup" during step 5, outputs two ifup processes. One for bond1 and one for bond1.2. As soon as we kill the "ifup bond1" process, the "ifup bond1.2" process completes immediately and correctly configures the vlan 2 subinterface.

This is clearly linked to vlan, because our infiniband interfaces work just fine. Also it worked just fine before upgrading the package. So my best guess would be that something broke in the code that detects if the vlan-raw-device is up. Perhaps related to LP #1573272 ?