[lucid] netxen_nic driver and ethernet bonding broken at boot time

Bug #688703 reported by Peter Matulis
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

A HP Proliant DL360 G6 system running Lucid 64-bit (only certified by Canonical for 9.04 64bit and 10.10 32bit [1]) is using a dual-port netxen_nic-based ethernet device.

When ethernet bonding is configured via /etc/network/interfaces [2] the bonded device appears to exist [5] but does not work. Upon restarting the networking service [3] the device works as expected. The bonded device also works when configured manually [4]. This used to work with Hardy.

Apport information from the affected system is forthcoming.

---------------------------------------------------

[1] http://webapps.ubuntu.com/certification/make/HP/servers

[2]

auto bond0
iface bond0 inet static
address 10.2.10.70
netmask 255.255.0.0
broadcast 10.2.255.255
gateway 10.2.10.254
slaves eth0 eth1
bond-mode 1
bond-miimon 100

[3] sudo /etc/init.d/networking restart

[4]

$ sudo modprobe -r bonding
$ sudo modprobe bonding mode=active-backup miimon=100
$ sudo ifconfig bond0 10.2.10.70 netmask 255.255.0.0
$ sudo ifenslave bond0 eth0 eth1
$ ifconfig bond0
$ cat /proc/net/bonding/bond0

[5]

$ dmesg | grep bonding

[ 6.403320] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[ 6.405552] bonding: bond0: doing slave updates when interface is down.
[ 6.405555] bonding: bond0: Adding slave eth0.
[ 6.405557] bonding bond0: master_dev is not up in bond_enslave
[ 7.192670] bonding: bond0: enslaving eth0 as an active interface with an up link.
[ 7.194312] bonding: bond0: doing slave updates when interface is down.
[ 7.194315] bonding: bond0: Adding slave eth1.
[ 7.194317] bonding bond0: master_dev is not up in bond_enslave
[ 7.721406] bonding: bond0: enslaving eth1 as an active interface with an up link.
[ 7.722009] bonding: bond0: setting mode to active-backup (1).
[ 7.722029] bonding: bond0: Setting MII monitoring interval to 100.

$ ifconfig

bond0 Link encap:Ethernet HWaddr f4:ce:46:af:58:40
inet addr:10.2.10.70 Bcast:10.1.255.255 Mask:255.255.0.0
inet6 addr: fe80::f6ce:46ff:feaf:5840/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:216 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:14159 (14.1 KB)

eth0 Link encap:Ethernet HWaddr f4:ce:46:af:58:40
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:216 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:14159 (14.1 KB)
Interrupt:62

eth1 Link encap:Ethernet HWaddr f4:ce:46:af:58:40
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:66

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:77 errors:0 dropped:0 overruns:0 frame:0
TX packets:77 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:7050 (7.0 KB) TX bytes:7050 (7.0 KB)

description: updated
Revision history for this message
Ryan Blazecka (rblazecka) wrote : apport information

AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 10.04
InstallationMedia: Ubuntu-Server 10.04 LTS "Lucid Lynx" - Release amd64 (20100427)
MachineType: HP ProLiant DL360 G6
Package: linux (not installed)
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/vmlinuz-2.6.32-26-server root=UUID=58f91cc3-f8b0-4443-bd20-179a0386ff59 ro --debug
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-26.48-server 2.6.32.24+drm33.11
Regression: Yes
Reproducible: Yes
Tags: lucid networking regression-potential needs-upstream-testing
Uname: Linux 2.6.32-26-server x86_64
UserGroups:

dmi.bios.date: 07/24/2009
dmi.bios.vendor: HP
dmi.bios.version: P64
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP64:bd07/24/2009:svnHP:pnProLiantDL360G6:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL360 G6
dmi.sys.vendor: HP

tags: added: apport-collected
Revision history for this message
Ryan Blazecka (rblazecka) wrote : BootDmesg.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : Lspci.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : Lsusb.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : ProcModules.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : UdevDb.txt

apport information

Revision history for this message
Ryan Blazecka (rblazecka) wrote : UdevLog.txt

apport information

Revision history for this message
Andy Whitcroft (apw) wrote :

It does sound a little like a race condition. Probabally in the initialisation order or timing of configuration of the bonding driver. Some of the messages imply that actions are being done in the wrong order:

    [ 6.405552] bonding: bond0: doing slave updates when interface is down.

First, I see we have the dmesg ordering for the failed setup, it would be instructive to have the same information for the successful manual configuration.

Second, it might be interesting to see if we can work round this issue by triggering the bonding driver to load earlier in boot. Could we try adding it to /etc/modules. I would also recommend a second test with a module configuration for that module in /etc/modprobe.d, specifically setting the parameters as in the manual test case: mode=active-backup miimon=100.

Finally, I believe that testing with the LTS backport kernel from Maverick does correctly initiate the bond. There is one commit between these two which does talk to races with udev. I have pulled that patch back as a test and produced test kernels. Could you test the kernels at the URL below and see if this changes the behaviour. Please include the dmesg fragments for the binding driver in your report. Kernels are at the URL below:

    http://people.canonical.com/~apw/lp688703-lucid/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Ryan Blazecka (rblazecka) wrote :
Download full text (3.2 KiB)

Here [1] is the dmesg output for the manual restart of networking (running "sudo /etc/init.d/networking restart").

Triggering the bonding driver to load earlier in boot did not fix things. I added "bonding" to /etc/modules, and restarted the system. Networking still failed to initialize correctly. The bonding-related dmesg output is listed below [2]. Regarding your second test, I'm not sure exactly what change to make to /etc/modprobe.d, if you let me know I can run that test too.

Yes, testing with the backport Maverick kernel does fix the bonding issue. I'm not familiar enough with Ubuntu to know how to test the kernels you produced; you could let me know what commands to run to install the kernels? Also, just looking at the filenames, I'm not sure which ones to test against each other. Should each .deb file be applied and tested individually, or do certain ones need to be applied together?

---------------------------------------------------

[1]

[ 150.452703] bonding: bond0: doing slave updates when interface is down.
[ 150.452706] bonding: bond0: Removing slave eth0
[ 150.452709] bonding: bond0: Warning: the permanent HWaddr of eth0 - f4:ce:46:af:58:40 - is still in use by bond0. Set the HWaddr of eth0 to a different address to avoid conflicts.
[ 150.452711] bonding: bond0: releasing active interface eth0
[ 150.452714] bonding: bond0: making interface eth1 the new active one.
[ 150.607703] bonding: bond0: doing slave updates when interface is down.
[ 150.607706] bonding: bond0: Removing slave eth1
[ 150.607708] bonding: bond0: releasing active interface eth1
[ 150.772167] bonding: bond0: doing slave updates when interface is down.
[ 150.772170] bonding: bond0: Adding slave eth0.
[ 150.772171] bonding bond0: master_dev is not up in bond_enslave
[ 151.133971] bonding: bond0: making interface eth0 the new active one.
[ 151.133976] bonding: bond0: first active interface up!
[ 151.133983] bonding: bond0: enslaving eth0 as an active interface with an up link.
[ 151.136220] bonding: bond0: doing slave updates when interface is down.
[ 151.136223] bonding: bond0: Adding slave eth1.
[ 151.136224] bonding bond0: master_dev is not up in bond_enslave
[ 151.632833] bonding: bond0: enslaving eth1 as a backup interface with an up link.
[ 151.633323] bonding: bond0: setting mode to active-backup (1).
[ 151.633343] bonding: bond0: Setting MII monitoring interval to 100.

[2]

[ 6.149943] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[ 6.442893] bonding: bond0: doing slave updates when interface is down.
[ 6.442896] bonding: bond0: Adding slave eth0.
[ 6.442898] bonding bond0: master_dev is not up in bond_enslave
[ 7.176017] bonding: bond0: enslaving eth0 as an active interface with an up link.
[ 7.177578] bonding: bond0: doing slave updates when interface is down.
[ 7.177581] bonding: bond0: Adding slave eth1.
[ 7.177582] bonding bond0: master_dev is not up in bond_enslave
[ 7.572679] bonding: bond0: enslaving eth1 as an active interface with an up link.
[ 7.573204] bond...

Read more...

Revision history for this message
Pete Graner (pgraner) wrote :

@Ryan,

You can download the kernel image from the URL that apw gave you and you can install with:

sudo dpkg -i linux-image-2.6.32-27-server_2.6.32-27.49lp688703v201012201823_amd64.deb

Then you can hold down the shift key on reboot which will bring up the grub menu then you can select the test kernel for boot.

Revision history for this message
Ryan Blazecka (rblazecka) wrote :

I tried the new kernel, using the commandline that Pete posted. The problem was not fixed by that kernel. In fact, it was arguably made worse, as on that kernel, even a manual restart of networking does not get the network to work. I am attaching a dmesg log of the new kernel's boot.

When I manually restart networking, this is what I get:

 * Reconfiguring network interfaces...
Failed to enslave eth0 to bond0. Is bond0 ready and a bonding interface ?
Failed to enslave eth1 to bond0. Is bond0 ready and a bonding interface ?
/etc/network/if-pre-up.d/ifenslave: 129: cannot create /sys/class/net/bond0/bonding/mode: Directory nonexistent
/etc/network/if-pre-up.d/ifenslave: 129: cannot create /sys/class/net/bond0/bonding/miimon: Directory nonexistent
ssh stop/waiting
ssh start/running, process 1951
   ...done.

Finally, adding the options to /etc/modprobe.d did not improve the situation, regardless of whether "bonding" was added to /etc/modules or not.

tags: added: kernel-key
Revision history for this message
Stefan Bader (smb) wrote :

There is some possible race mentioned in the examples/readme of ifenslave. Basically the suggested way is to have no slaves defined in the bonding definition but add definitions for the slaves. Like

auto bond0
iface bond0 inet static
  ...
  bond-slaves none
  ...

auto eth0
iface eth0 inet manual
  bond-master bond0

auto eth1
iface eth1 inet manual
  bond-master bond0

Has that been tried?

Revision history for this message
hansaplast (9orums) wrote :

Damn. This just took me 8 hours.
My system is a supermicro H8DGU-F which uses an Intel 82576 Gigabit NIC.

In my case there are two problems.
- Firstly, the igb network driver supplied by the xen kernel (2.6.32-5-xen-amd64) isn't up-to-date. So I compiled it from source
  http://downloadcenter.intel.com/Default.aspx?lang=eng

- Secondly, the driver isn't inserted (in time) during boot, so I followed Andy Whitcroft's advise by putting the module (igb) into /etc/modules.

This did the trick.. Funny thing is. You would expect this behaviour also in the NON xen kernel (2.6.32-5-amd64) but this isn't true. Which makes me believe some drivers from xen kernel differ from regular kernel drivers. I wonder why?

Hope this helps..
Please let me know.

Revision history for this message
Stefan Bader (smb) wrote :

Well, the main issue I can see is that I would have no clue where that kernel comes from. The main version number indicates Lucid (10.04). But then the official kernel package released as 2.6.32-21 and there is no -xen flavour. So all odds are open...

Revision history for this message
hansaplast (9orums) wrote :

Sorry, I forgot to mention, issue is the same but I'm running Debian Squeeze with that kernel.
Andy Whitcroft's advise solved it for me (at least partly). It seemed more than reasonable to report back here..

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Peter

Can you confirm if this is still an issue for you?

Revision history for this message
Peter Matulis (petermatulis) wrote :

@Joseph

Not me personally, nor professionally, but I did want to keep this one open because I *thought* I was getting access to the appropriate hardware. So this is the blocker for me right now (ability to test).

tags: removed: kernel-key
Revision history for this message
penalvch (penalvch) wrote :

Peter Matulis / Ryan Blazecka, the potential fix commit noted in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/688703/comments/11 is available for Lucid as per http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-lucid.git;a=commit;h=6151b3d435feeeae7487032fcd5c8c7f281ba05c . When you updated to the latest version of Lucid was this issue resolved for you?

Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.