bridge-utils/vlan udev hooks prevent execution of upstart hook, slowing down boot

Bug #1003656 reported by andrew bezella on 2012-05-23
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
bridge-utils (Ubuntu)
Medium
Stéphane Graber
Precise
Low
Stéphane Graber
Quantal
Medium
Stéphane Graber
vlan (Ubuntu)
Medium
Stéphane Graber
Quantal
Medium
Stéphane Graber

Bug Description

[rationale]
With some specific configuration, the boot hangs for minutes in udev as ifupdown is waiting for an interface to show up.

[test case]
 - Use one of the configurations listed in this bug or its duplicate, boot the machine with it and observe it hanging for a couple of minutes.
 - Apply the update
 - Check that the machine now works much faster and that the interface is properly configured.

[regression potential]
I can't see of a situation where someone would be depending on the broken behaviour which wouldn't in itself be a bug. The change landed fairly early in Ubuntu 12.10 and no regression has been reported so far. Worst case scenario, it's easy to revert.

we're trying to migrate our network configuration from lucid to precise. in 10.04 we tied eth0+eth1 together into bond0, then set br0 up on top of that and assigned an address via dhcp. in 12.04 this only works if br0 is configured with a static ip address. it fails when trying to use dhcp. to simplify testing i've removed eth1 from the configuration (sanity checked against http://www.stgraber.org/2012/01/04/networking-in-ubuntu-12-04-lts/ ):

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual
  bond-master bond0

auto bond0
iface bond0 inet manual
  bond-slaves none
  bond-mode 802.3ad
  bond-miimon 100

auto br0
iface br0 inet dhcp
  bridge_ports bond0
  bridge_stp off

the above results in a system w/o network connectivity. the dhcp server reports requests from an unexpected mac addr (different each boot). udevd logs "timeout 'bridge-network-interface'". poking around a little before the timeout shows the following 2 groups of processes:

  |-ifup,1361 --allow auto eth0
  | `-sh,1363 -c run-parts /etc/network/if-pre-up.d
  | `-run-parts,1364 /etc/network/if-pre-up.d
  | `-ifenslave,1392 /etc/network/if-pre-up.d/ifenslave
  | `-sleep,2380 0.1

  | |-udevd,599 --daemon
  | | `-bridge-network-,1429 /lib/udev/bridge-network-interface
  | | `-ifup,1457 --allow auto br0
  | | `-sh,1540 -c dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.br0.pid -lf /var/lib/dhcp/dhclient.br0.leases -1 br0
  | | `-dhclient3,1541 -e IF_METRIC=100 -pf /var/run/dhclient.br0.pid -lf /var/lib/dhcp/dhclient.br0.leases -1 br0

the ifenslave appears to be looping over that `sleep` (testing for /run/network/ifenslave.bond0) until it is killed and the dhclient is making its request w/the unexpected mac addr (also reported in `ip link show br0`). interestingly br0's mac addr matches that of eth0 (as expected) once bridge-network-interface has timed out and been killed.

a workaround appears to be adding the line:
  pre-up /sbin/ifup --allow auto bond0
to the "auto br0" stanza.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: bridge-utils 1.5-2ubuntu6
ProcVersionSignature: Ubuntu 3.2.0-24.38-generic 3.2.16
Uname: Linux 3.2.0-24-generic x86_64
ApportVersion: 2.0.1-0ubuntu7
Architecture: amd64
Date: Wed May 23 13:44:43 2012
ProcEnviron:
 TERM=xterm
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/usr/bin/zsh
SourcePackage: bridge-utils
UpgradeStatus: No upgrade log present (probably fresh install)

andrew bezella (abezella) wrote :
Changed in bridge-utils (Ubuntu):
importance: Undecided → High
Stéphane Graber (stgraber) wrote :

Can you attach a tarball of /var/log/upstart?

andrew bezella (abezella) wrote :

contents of /var/log/upstart attached. as a note our legacy naming for the bridge is "br_phys" so there is some output related to trials using that name instead of "br0".

andrew bezella (abezella) wrote :

i believe that i've narrowed it down a bit. it seems improbable, but from what i can tell it is related to the character encoding of the file. i'm including 2 versions of the interfaces file. interfaces.utf8 works.

andrew bezella (abezella) wrote :

interfaces.fail does not work. as best i can tell the encoding is the only difference between them.

andrew bezella (abezella) wrote :

sorry, on closer inspection the interfaces.utf8 file has some non-printing characters. an interfaces file with no leading spaces still fails (attached). so, despite that red herring, i think the root problem remains. i'm presently unsure why the utf version of the file works. it doesn't seem to be simply ignoring the lines with the extra leading characters, because when i grep those lines out of the interfaces file and reboot the result is just an eth0 and eth1. it's also worth noting this is only a problem at boot-time. issuing an ifdown/ifup br0 once the system has killed off bridge-network-interface gives a working network.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in bridge-utils (Ubuntu):
status: New → Confirmed
Serge Hallyn (serge-hallyn) wrote :

Thanks, Andrew. I can't find any other differences, so if indentation in /etc/network/interfaces is breaking bond, then given that interfaces(5) specifically says indentation doesn't matter, this would definately be a bug.

Serge Hallyn (serge-hallyn) wrote :

Just a note, I've reproduced this on a simple openstack instance. I used the interfaces.utf8 as my /etc/network/interfaces (except using bond-mode balance-rr since I dont' have a switch capable of doing 802.3ad), rebooted and had networking. Then I simply removed all indentation, and after another reboot networking failed to come up.

Serge Hallyn (serge-hallyn) wrote :

Console output after the failed-network boot contained:

udevd[319]: timeout: killing 'bridge-network-interface' [550]

udevd[319]: 'bridge-network-interface' [550] terminated by signal 9 (Killed)

cloud-init-nonet gave up waiting for a network device.
ci-info: lo : 1 127.0.0.1 255.0.0.0 .
ci-info: eth0 : 1 . . fa:16:3e:1d:f0:28
ci-info: bond0 : 1 . . fa:16:3e:1d:f0:28
ci-info: br0 : 1 . . fa:16:3e:1d:f0:28
route_info failed

summary: - when adding a bond the bridge fails to acquire a dhcp address
+ bond entries in /etc/network/interfaces fail without indent (when adding
+ a bond the bridge fails to acquire a dhcp address)
Changed in bridge-utils (Ubuntu):
importance: High → Medium

Lowering the priority (per guidelines) since there is a workaround (use indentation).

andrew bezella (abezella) wrote :

thanks for looking into this. one thing worth noting is that the ascii version of the file attached in comment #5 (interfaces.fail) is indented but non-working. in fact, i only found that the utf8 version worked by accident; i cut'n'pasted what i had posted in the original report back into my interfaces and suddenly it worked! eventually i found the non-printing characters, but was left with no real understanding of the problem.

Stéphane Graber (stgraber) wrote :

I think the source is a broken grep somewhere in the udev hook, I'll have a look when I prepare the bridge-utils upload (likely tomorrow).

Changed in bridge-utils (Ubuntu):
assignee: nobody → Stéphane Graber (stgraber)
Changed in bridge-utils (Ubuntu):
status: Confirmed → Triaged
Stéphane Graber (stgraber) wrote :

While doing some testing here I couldn't reproduce the whitespace issue, though I did find a bug matching your description that's related to our udev hooks.

Could you try doing the two following changes?

In /lib/udev/rules.d/40-bridge-network-interface.rules:
 - replace "bridge-network-interface" by "bridge-network-interface&"

In /lib/udev/rules.d/40-vlan-network-interface.rules (if you have it on your system):
 - replace "vlan-network-interface" by "vlan-network-interface&"

Then reboot to test with your config.

On my test system (12.10), this fixes it, another look at why indentation would fix it was unsuccessful, so hopefully that wasn't the actual source of the problem ;)

Changed in bridge-utils (Ubuntu):
status: Triaged → Incomplete
Stéphane Graber (stgraber) wrote :

If that doesn't fix it for you, I'd be interested by the post-boot result of:
 - ifquery --list --allow auto
 - ifquery br0
 - ifquery bond0
 - ifquery eth0
 - brctl show
 - ifconfig -a

andrew bezella (abezella) wrote :

it appears that the suggested edit to /lib/udev/rules.d/40-bridge-network-interface.rules fixed the problem i was seeing (don't have the vlan-related file on the system). thank you! the interfaces.fail and interfaces.nospc files both now work as expected, as does our standard interfaces configuration (the attached files were simplified versions of this).

the whitespace/indentation question isn't a big deal, but it is a source of confusion. without your fix, the only version of the file that i could get to work was the interfaces.utf8 (comment #4). but it isn't strictly related to indentation: both interfaces.fail (using ascii whitespace indentation) and interfaces.nospc (no indentation) failed. for whatever reason, the apparent indentation using the non-printing characters in interfaces.utf8 worked.

thanks again...

andrew bezella (abezella) wrote :

can this move out of "Incomplete" status? as noted in my last comment the suggested fix appears to work.

Stéphane Graber (stgraber) wrote :

Oops, looks like I missed that comment. Marked as Triaged now, should be able to look into pushing the fix soon.

Changed in bridge-utils (Ubuntu):
status: Incomplete → Triaged
Changed in vlan (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Stéphane Graber (stgraber)
Changed in bridge-utils (Ubuntu Precise):
assignee: nobody → Stéphane Graber (stgraber)
Changed in vlan (Ubuntu Precise):
assignee: nobody → Stéphane Graber (stgraber)
Changed in bridge-utils (Ubuntu Precise):
importance: Undecided → Low
status: New → Triaged
Changed in vlan (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Low
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package vlan - 1.9-3ubuntu7

---------------
vlan (1.9-3ubuntu7) quantal; urgency=low

  * Start vlan-network-interface in the background to avoid blocking the
    rest of the udev events (most importantly the upstart one).
    (LP: #1003656)
 -- Stephane Graber <email address hidden> Fri, 07 Sep 2012 17:24:32 -0400

Changed in vlan (Ubuntu Quantal):
status: Triaged → Fix Released
summary: - bond entries in /etc/network/interfaces fail without indent (when adding
- a bond the bridge fails to acquire a dhcp address)
+ bridge-utils/vlan udev hooks prevent execution of upstart hook, slowing
+ down boot
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bridge-utils - 1.5-4ubuntu1

---------------
bridge-utils (1.5-4ubuntu1) quantal; urgency=low

  * Start bridge-network-interface in the background to avoid blocking the
    rest of the udev events (most importantly the upstart one).
    (LP: #1003656)
 -- Stephane Graber <email address hidden> Fri, 07 Sep 2012 17:35:13 -0400

Changed in bridge-utils (Ubuntu Quantal):
status: Triaged → Fix Released
Stéphane Graber (stgraber) wrote :

A quick test on my test systems and on the reporter's system didn't show any side effect, so I went ahead and pushed it to Ubuntu 12.10 in both the vlan and bridge-utils packages.
I'll wait a few more weeks before pushing this to Ubuntu 12.04 so we can spot any potential regression introduced by the change.

Steve Langasek (vorlon) wrote :

Stéphane,

> * Start vlan-network-interface in the background to avoid blocking the
> rest of the udev events (most importantly the upstart one).
> (LP: #1003656)

This doesn't appear to be a correct fix. I noticed here while working on an unrelated boot issue that udev is now spitting warning messages about these run rules not being valid, because udev tries to find a helper with '&' in the name:

Sep 9 02:28:52 virgil udevd[1887]: failed to execute '/lib/udev/bridge-network-
interface&' 'bridge-network-interface&': No such file or directory
Sep 9 02:28:52 virgil udevd[1888]: failed to execute '/lib/udev/vlan-network-interface&' 'vlan-network-interface&': No such file or directory

So, reopening this report.

Changed in vlan (Ubuntu Quantal):
status: Fix Released → Triaged
Stéphane Graber (stgraber) wrote :

Hmm, that's odd, not sure why I didn't get these when I last tested the change...

I'll put this one back on my todo for tomorrow to figure out how to properly fix it or if I can't find a better way quickly, revert the change.

Stéphane Graber (stgraber) wrote :

Also re-opening the bridge-utils task.

Changed in bridge-utils (Ubuntu Quantal):
status: Fix Released → Triaged
Changed in vlan (Ubuntu Quantal):
status: Triaged → Invalid
no longer affects: vlan (Ubuntu Precise)
Stéphane Graber (stgraber) wrote :

Rationale for not touching vlan is that it's never calling ifup and so shouldn't be able to cause a similar deadlock situation as bridge-utils.

Stéphane Graber (stgraber) wrote :

I just spent a few minutes trying to figure out the ordering based on the scripts for Andrew's system, it's basically:
- eth0 appears
  - triggers udev
    - triggers upstart
      - triggers ifup eth0
        - triggers bonding
          - bond0 appears
            - triggers udev
              - triggers bridge-network-interface
                - triggers ifup br0
                  - br0 appears
                    - triggers udev
                      - triggers upstart
                        - triggers ifup br0 => fails, already configured
                  - dhclient br0 => fails as it's blocking and no interface in bond
              - triggers upstart
                - triggers ifup bond0
          - eth0 is joined in the bond
- eth1 appears
  - triggers udev
    - triggers upstart
      - triggers ifup eth1
        - triggers bonding
          - eth1 is joined in the bond

As you can see, it's relatively complex. The main problem as easily seen above is that udev being sequential, the "ifup br0" will be called before the bond interface is fully setup and so will fail to acquire an IP as nothing's in the bond at this point.

A possible way around the problem would be to only create the bridge from the udev hook but not actually call ifup, letting the upstart job take care of this. This would make the code similar to what vlan and ifenslave are currently doing where as far as I know we're not getting a similar deadlock.

I have a system reproducing this bug, so I'll now be trying my workaround and re-read all the scripts once more to see if I missed something.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bridge-utils - 1.5-4ubuntu2

---------------
bridge-utils (1.5-4ubuntu2) quantal; urgency=low

  * Revert last chance as udev doesn't accept that syntax.
  * Set BRIDGE_HOTPLUG=yes as that's the behaviour we had in the past and
    need for the event based networking. Reverting change from 1.5-4.
  * Don't call ifup from bridge-network-interface, instead just call brctl
    and let udev/upstart bring the interface up (LP: #1003656)
 -- Stephane Graber <email address hidden> Tue, 11 Sep 2012 10:45:45 -0400

Changed in bridge-utils (Ubuntu Quantal):
status: Triaged → Fix Released
description: updated
Stéphane Graber (stgraber) wrote :

Fix uploaded to precise-proposed queue.

Changed in bridge-utils (Ubuntu Precise):
status: Triaged → In Progress

Hello andrew, or anyone else affected,

Accepted bridge-utils into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/bridge-utils/1.5-2ubuntu7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in bridge-utils (Ubuntu Precise):
status: In Progress → Fix Committed
tags: added: verification-needed
Gary Richards (ashak) wrote :

Hrm... I had this issue and raised a different bug. I spoke with Stéphane about it on IRC and was asked to try the quantal version of the bridge-utils package. Switching to it solved all of my problems so Stéphane pushed through the same fix to Precise.

I saw the request to try the Precise package, so I added the proposed repo to one of my boxes, but bridge-utils can't have made it there yet?

Therefore I downloaded the binary package by following the link and a few links from that and installed the Precise version of the package, rebooted my box and my bridge didn't manage to dhcp an address again *sigh*

I rebooted again to be sure, still nothing.

I reverted to the Quntal version of the package, rebooted again... still no address on my bridge.

Nothing in /etc/network/interfaces of my own has changed throughout the whole process.

I'm confused... But it seems that there's still some issue. Quite how it worked the other day and now neither version of the package seems to work I don't know.

Gary Richards (ashak) wrote :

Gah, ignore my comment, perhaps the new version is working.

The issue I see now seems to be that on the last reboot the MAC address of the bond has obtained the MAC of the other ethernet interface in the bond rather than the MAC address of eth0 (which is the one hard wired in my dhcpd config), meaning it's unable to obtain an IP address to to my dhcp server not knowing what address to give it.

Gary Richards (ashak) wrote :

OK... if I force the MAC address of my bridge to be the one expected by my dhcp server then 1.5-2ubuntu7 works.

tags: added: verification-done
removed: verification-needed
andrew bezella (abezella) wrote :

looks good in my testing, too. with bridge-utils 1.5-2ubuntu7 the previously failing versions of my interfaces file work. our current live version of the interfaces continues to work, and a variant w/o the "pre-up" workaround from my initial report works as well. thanks!

gadLinux (gad-aguilardelgado) wrote :

I have the same problem with quantal (image installed by MAAS)

I've just created a bridge br0 in the interfaces file and now the system does not boot.
it seems br0 is not going up

if I do a brcrl addbr br0 everything works.

I have not the bonding one, but seems to be the same error.

The output of initctrl list shows:

upstart-udev-bridge is runing
upstart-socket-bridge is running

cloud-init is running...

it seems that cloud-init needs the interface up to continue but it never goes up...

Also, I've modified the /etc/network/if-pre.up.d/bridge to show what interfaces are up when the system stalls:

It shows that $IF_BRIDGE_PORTS goes empty, and is executed twice...

It should have the bridge port br0 in my case...

gadLinux (gad-aguilardelgado) wrote :

I added a rule in udev to bring up the port directly with brctl addbr br0 and now everything works.

But I don't know the cause.

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bridge-utils - 1.5-2ubuntu7

---------------
bridge-utils (1.5-2ubuntu7) precise-proposed; urgency=low

  * Don't call ifup from udev hook, instead just create the bridge and let
    upstart bring it up. Fixes hang at boot time. (LP: #1003656)
 -- Stephane Graber <email address hidden> Fri, 26 Oct 2012 17:16:59 +0200

Changed in bridge-utils (Ubuntu Precise):
status: Fix Committed → Fix Released
Hungerburg (pch-myzel) wrote :

Please excuse the incomplete information; But I highly suspect it is releated, and maybe some bells ring with the experts.

The week a system that I manage (xubuntu 12.04.1) failed to connect to the net on reboot. The br0 bridge interface does not get fully configured. There is nothing in syslog that points at a specific failure. The startup GUI shows "waiting for network" and "waiting 60 more seconds for network" before the greeter comes up.

The bridge interface is used to bridge eth1 and OpenVPN tap0. It is solely configured from /etc/network/interfaces.

I found that after tearing down the bridge completely, "ifup br0" successfully completes. But it fails at boot time. bridge-utils may be the only package update between the last reboot, when thins still were fine (some 80 days ago).

Next week I may collect more information (ifquery br0 # of the disfunctional bridge, brctl show, etc).

Stéphane Graber (stgraber) wrote :

The most relevant log files are in /var/log/upstart/network*

Hungerburg (pch-myzel) wrote :

Thank you Stéphane, in those log files I found the message:

  device br0 already exists; can't create bridge with the same name
  Failed to bring up br0.

In /etc/network/interfaces below the br0 stance I had just to remove some lines:

  pre-up brctl addbr br0
  pre-up brctl addif br0 eth1
  pre-up brctl addif br0 tap0
  ifconfig br0 192.168.0.2 netmask 255.255.255.0 broadcast 192.168.0.255

These maybe were necessary in 10.04 lucid, when I set up the system first.

So the update of bridge-utils made the config simpler in the end. Great!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers