Bonded network device is not correctly detected during boot-up.

Bug #1056792 reported by annunaki2k2
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ifenslave-2.6 (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

We have an x86_64 Intel server running 12.04.1, and it is connected using two on board 1G network in an LACP bond. The configuration works fine, but for some very annoying reason, when the machine boots, the start-up scripts hang for two minutes waiting for the connection to come up - yet the connection is actually already up (and pingable remotely).

Here is my interfaces configuration file:
russell@pm1 ~ $ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# Slave Definition for bond0
auto eth0
iface eth0 inet manual
 bond-master bond0

auto eth1
iface eth1 inet manual
 bond-master bond0

# The primary network interface
auto bond0
iface bond0 inet static
 address 10.0.1.151
 netmask 255.255.254.0
 broadcast 10.0.1.255
 network 10.0.0.0
 gateway 10.0.0.1
 dns-nameservers 10.0.0.120 10.0.1.120
 dns-search mps.lan wilts.mps.lan
 dns-domain mps.lan
 bond-mode 802.3ad
 bond-miimon 100
 bond-lacp_rate 1
 bond-slaves none
# bond-use_carrier 1
 post-up /usr/local/sbin/check-bond.sh $IFACE
 pre-down /usr/local/sbin/check-bond.sh stop $IFACE

And (once the machine times out and continues it's boot), here is the resultant configuration:
russell@pm1 ~ $ ifconfig
bond0 Link encap:Ethernet HWaddr 00:1e:67:44:58:88
          inet addr:10.0.1.151 Bcast:10.0.1.255 Mask:255.255.254.0
          inet6 addr: fe80::21e:67ff:fe44:5888/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:2644 errors:0 dropped:827 overruns:0 frame:0
          TX packets:1575 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:282832 (282.8 KB) TX bytes:261199 (261.1 KB)

eth0 Link encap:Ethernet HWaddr 00:1e:67:44:58:88
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:803 errors:0 dropped:803 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:70241 (70.2 KB) TX bytes:992 (992.0 B)
          Memory:d0b20000-d0b40000

eth1 Link encap:Ethernet HWaddr 00:1e:67:44:58:88
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:1841 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1567 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:212591 (212.5 KB) TX bytes:260207 (260.2 KB)
          Memory:d0b00000-d0b20000

russell@pm1 ~ $ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
 Aggregator ID: 1
 Number of ports: 1
 Actor Key: 17
 Partner Key: 1
 Partner Mac Address: 00:00:00:00:00:00

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:1e:67:44:58:88
Aggregator ID: 1
Slave queue ID: 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:1e:67:44:58:87
Aggregator ID: 2
Slave queue ID: 0

As you can see, it has actually booted with the correct configuration - it just decided to waste two minutes because it failed to detect correctly that the network is actually configured and ready.

Here are the relevant lines from the syslog relating to the bonding interface:
russell@pm1 ~ $ sudo cat /var/log/syslog | grep -i bond | grep kernel | grep "Sep 26 12:06"
Sep 26 12:06:38 pm1 kernel: [ 6.069287] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Sep 26 12:06:38 pm1 kernel: [ 6.077144] bonding: bond0: Setting MII monitoring interval to 100.
Sep 26 12:06:38 pm1 kernel: [ 6.084404] bonding: bond0: setting mode to 802.3ad (4).
Sep 26 12:06:38 pm1 kernel: [ 6.086176] bonding: bond0: Setting LACP rate to fast (1).
Sep 26 12:06:38 pm1 kernel: [ 6.088046] ADDRCONF(NETDEV_UP): bond0: link is not ready
Sep 26 12:06:38 pm1 kernel: [ 6.213700] bonding: bond0: Adding slave eth1.
Sep 26 12:06:38 pm1 kernel: [ 6.296412] bonding: bond0: enslaving eth1 as a backup interface with a down link.
Sep 26 12:06:38 pm1 kernel: [ 7.083578] bonding: bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
Sep 26 12:06:38 pm1 kernel: [ 7.084460] ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
Sep 26 12:06:38 pm1 kernel: [ 7.270717] bonding: bond0: Adding slave eth0.
Sep 26 12:06:38 pm1 kernel: [ 7.354304] bonding: bond0: enslaving eth0 as a backup interface with an up link.
Sep 26 12:06:38 pm1 kernel: [ 7.594951] bonding: bond0: Setting MII monitoring interval to 100.
Sep 26 12:06:38 pm1 kernel: [ 7.595780] bonding: unable to update mode of bond0 because interface is up.
Sep 26 12:06:38 pm1 kernel: [ 7.596696] bonding: bond0: Unable to update LACP rate because interface is up.
Sep 26 12:06:46 pm1 kernel: [ 17.418840] bond0: no IPv6 routers present

It appears that the ifenslave script is trying to modify the bond network device after it is brought up - though it has already brought it up in the correct way before hand - perhaps this is the reason for the failed detection? The relevant lines are:
Sep 26 12:06:38 pm1 kernel: [ 7.595780] bonding: unable to update mode of bond0 because interface is up.
Sep 26 12:06:38 pm1 kernel: [ 7.596696] bonding: bond0: Unable to update LACP rate because interface is up.

And in fact, you see these lines on boot-up just before the big wait happens (please see attached screen shot taken using the Remote Management Module at boot time).

Revision history for this message
annunaki2k2 (russell-knighton) wrote :
Revision history for this message
annunaki2k2 (russell-knighton) wrote :

ifenslave-2.6 information:
russell@pm1 ~ $ aptitude show ifenslave-2.6
Package: ifenslave-2.6
State: installed
Automatically installed: no
Version: 1.1.0-19ubuntu5
Priority: optional
Section: net
Maintainer: Ubuntu Developers <email address hidden>
Architecture: amd64
Uncompressed Size: 103 k
Depends: libc6 (>= 2.4), iproute
Recommends: net-tools
Conflicts: ifenslave (< 2), ifenslave (< 2), ifenslave-2.4 (<= 0.07+2.5.15-6), ifenslave-2.4 (<= 0.07+2.5.15-6), ifenslave-2.6
Provides: ifenslave
Description: Attach and detach slave interfaces to a bonding device

affects: linux (Ubuntu) → ifenslave-2.6 (Ubuntu)
Revision history for this message
Stéphane Graber (stgraber) wrote :

Please attach a tarball of /var/log/upstart/

Changed in ifenslave-2.6 (Ubuntu):
status: New → Incomplete
Changed in ifenslave-2.6 (Ubuntu):
importance: Undecided → Medium
importance: Medium → Undecided
Revision history for this message
annunaki2k2 (russell-knighton) wrote :

As requested, an attached tarball file of the upstart logs.

Changed in ifenslave-2.6 (Ubuntu):
status: Incomplete → New
Revision history for this message
annunaki2k2 (russell-knighton) wrote :

Has anyone taken a look at the logs or had any thoughts on this? Can it be assigned to the maintainer of the ifenslave-2.6 package?

Revision history for this message
Stéphane Graber (stgraber) wrote :

Based on the errors in /var/log/upstart, it looks like your interface is already getting configured by something prior to ifupdown making it fail to bring up the interface and causing the hang.

Can you please attach a tarball containing all of /etc/network/ and your /usr/local/sbin/check-bond.sh script?

Thanks

Changed in ifenslave-2.6 (Ubuntu):
status: New → Incomplete
Revision history for this message
annunaki2k2 (russell-knighton) wrote :

Attached is my post-up file. It is simply a bash script that monitors the status of the bonded device in /proc - so should, of course, in no way influence the actual process of bring up the device (it's a post-up command after all).

That said, you go me thinking - so I tried commenting out the lines, and when I did that, the system booted without error!

I have been using this script (or something very similar) since 8.04, and it was definitely working great in 10.04, so obviously somewhere between 10.04 and 12.04 something has changed in the way "post-up" is handled on bonded devices.

Revision history for this message
annunaki2k2 (russell-knighton) wrote :

As requested, complete tar of /etc/network attached to this bug.

Changed in ifenslave-2.6 (Ubuntu):
status: Incomplete → New
Revision history for this message
Stéphane Graber (stgraber) wrote :

Can you also attach your /etc/fstab?

I'm wondering if perhaps the network is coming up early enough that some devices aren't mounted yet causing the script to fail.
A failure of a post-up script on bond0 will likely hold eth0 and eth1 causing the delay. ifupdown only considers an interface as fully up if all the post-up scripts returned 0.

Revision history for this message
annunaki2k2 (russell-knighton) wrote :

I'm very grateful for your help.

Interesting idea that - my fstab is attached.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Right, so /var may be mounted after your script is triggered, as your script depends on /var, that may explain the failure.

Can you maybe make the script "exit 0" right at the beginning to confirm that calling the script isn't the problem and that's indeed something done within the script that's making ifupdown fail to bring up the interface?

Revision history for this message
annunaki2k2 (russell-knighton) wrote :

I have added an exit 0 at the top of the script, and it boots correctly.

The strange thing is that this is happening on other machines built with 12.04, even where there is no separate /var partition. Do you have any further ideas why this would be happening now in 12.04, but didn't use to occur in 10.04. Has the boot priority/order been changed? Is it possible that even "/proc" is now unavailable at the time networking is started?

Revision history for this message
Stéphane Graber (stgraber) wrote :

/proc and /sys are guaranteed to be mounted by either the initramfs or by init.

I'm going to mark this bug invalid as it's not caused by anything in ifenslave-2.6 itself and probably not by Ubuntu.

Changed in ifenslave-2.6 (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.