802.3ad interface bonding fails if started too early

Bug #482419 reported by Bryan McLellan on 2009-11-13
142
This bug affects 23 people
Affects Status Importance Assigned to Milestone
ifenslave-2.6 (Debian)
Fix Released
Unknown
ifenslave-2.6 (Ubuntu)
Medium
Unassigned
Declined for Karmic by Steve Langasek
Lucid
Medium
Stéphane Graber

Bug Description

Impact: see original report below
How the patch fixes it: pre-up sets up master before attempting to enslave and setup slaves
Patch: https://bugs.edge.launchpad.net/ubuntu/+source/ifenslave-2.6/+bug/482419/+attachment/1455658/+files/ifenslave-2.6-sru.diff
Reproducing: http://ubuntuforums.org/showpost.php?p=8285696&postcount=3
Regression potential: none known

== Original report ==
802.3ad bonding configurations that formerly worked on jaunty are now failing on startup under karmic. After the system has started, restarting networking will bring the bond up correctly. This only applies to bond_mode 4 / 802.3ad, I've tested that switching to bond_mode 0 corrects the issue, and other users experiencing this bug all were using bond_mode 4 as well.

dmesg output fills with "bonding: bond0: Warning: Found an uninitialized port", even after the system starts up and the port should be "initialized"

It appears to occur on multiple drivers (bnx2, e1000 confirmed).

One initially wants to blame the startup ordering due to the switch to upstart, but I believe it is an edge case that hasn't been seen before because we haven't been starting up so quickly that the hardware hasn't had time to fully initialized.

Configuration and output from multiple users is in this thread:
http://ubuntuforums.org/showthread.php?p=8311572

Bryan McLellan (btm) on 2009-11-13
description: updated
description: updated
affects: ubuntu → linux (Ubuntu)
Richard Huddleston (rhuddusa) wrote :

I'm also seeing this with e1000e driver on an intel s5000psl motherboard

iface bond0 inet dhcp
        slaves eth0 eth1
        bond-mode 4
        bond-miimon 10
        bond-lacp-rate fast
        bond-xmit_hash_policy layer2+3

after boot up
/proc/net/bonding/bond0 has
MII Status: down

i can bring the bond up if i do
sudo ifdown bond0; sudo ifup bond0

i tried playing with the bonding updelay, just for kicks, but no change

04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
        Subsystem: Intel Corporation Device 3476
        Flags: bus master, fast devsel, latency 0, IRQ 57
        Memory at b8820000 (32-bit, non-prefetchable) [size=128K]
        Memory at b8400000 (32-bit, non-prefetchable) [size=4M]
        I/O ports at 3020 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
        Capabilities: [e0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting <?>
        Kernel driver in use: e1000e
        Kernel modules: e1000e

04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
        Subsystem: Intel Corporation Device 3476
        Flags: bus master, fast devsel, latency 0, IRQ 58
        Memory at b8800000 (32-bit, non-prefetchable) [size=128K]
        Memory at b8000000 (32-bit, non-prefetchable) [size=4M]
        I/O ports at 3000 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
        Capabilities: [e0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting <?>
        Kernel driver in use: e1000e
        Kernel modules: e1000e

Bryan McLellan (btm) wrote :

Marking medium. While there is a function workaround, it requires access to the console on startup. Since this configuration is most often see in a server environment, this is makes the workaround less practical and thus more severe.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
Tessa (unit3) wrote :

Seeing this on a Supermicro server system, motherboard X8STi, with dual e1000e NICs.

Andy Whitcroft (apw) on 2009-11-30
tags: added: kernel-series-unknown
Chad Netzer (chad-netzer) wrote :

I'm also seeing this running On a Penguin Relion 2612, using the forcedeth gigabit ethernet driver, bonding eth0 and eth1 with bond-mode 4. Also, I'm running a vlan setup on top of the bonded driver.

I can generally get the setup to work by hand, although seemingly not with a mere "/etc/init.d/networking restart" or "ifdown bond0; ifup bond0". On initial startup, it seems that only one of the two slaves, either eth0 or eth1 is listed with "ifconfig", but not both. If I remove all devices with "ifconfig eth0 down; ifconfig bond0 down; ifconfig lo down", etc. and then "ifdown bond0", and do a network restart, it will end up working after some fiddling. In which case the bond0 interface, both eth0 and eth1, the vlan, and lo, are all listed by ifconfig.

This is on a fresh karmic install from alternate install CD, btw. I just reinstalled from a system running Hardy, where this worked fine, although I had to update the /etc/network/interfaces configs (I was using post-up/down ifenslave passages before)

Tessa (unit3) wrote :

Just switched my karmic config to "mode 1", and I'm still seeing this issue, so it isn't limited to strict 802.3ad mode (4).

On Fri, Dec 4, 2009 at 11:28 AM, Graeme Humphries <email address hidden> wrote:
> Just switched my karmic config to "mode 1", and I'm still seeing this
> issue, so it isn't limited to strict 802.3ad mode (4).

Are you sure the system is using bonding mode 1? What does 'cat
/sys/class/net/bond0/link_mode' display?

The "Warning: Found an uninitialized port" error message is in
bond_3ad.c in the kernel source, which would lead me to to believe
that only the code supporting 802.3ad is separated into this source
file and thus if you are getting this error the system is using
802.3ad.

Tessa (unit3) wrote :

Oh, you're right, this might be a different issue. /sys/class/net/bond0/link_mode shows "0", even though the device in /etc/network/interfaces is configured with "bond-mode 1". And I'm not seeing the "unintialized port" errors anymore, but it still fails to come up when initially started, I have to give it a "service networking restart" before it starts.

Bryan McLellan (btm) wrote :

On Fri, Dec 4, 2009 at 2:49 PM, Graeme Humphries <email address hidden> wrote:
> Oh, you're right, this might be a different issue.
> /sys/class/net/bond0/link_mode shows "0", even though the device in
> /etc/network/interfaces is configured with "bond-mode 1". And I'm not
> seeing the "unintialized port" errors anymore, but it still fails to
> come up when initially started, I have to give it a "service networking
> restart" before it starts.

The lack of the error message aside, out of curiosity, what about
'/sys/class/net/bond0/bonding/mode' ? I think I was wrong in pointing
to the earlier file.

Tessa (unit3) wrote :

Ahhh yes, that shows "active-backup 1", which is what I'd expect for configuring with mode 1.

tags: added: karmic
removed: kernel-series-unknown
cfreak (cfreak) wrote :

i'm seeing this on karmic with 2.6.31-16-server with Broadcom Corporation NetXtreme BCM5721 cards. Is a blocker for me, i can't restart networking on each server after booting..

Scott Call (scott-call-viz) wrote :

Is this possible same/related as Debian bug 588097 related to ifenslave-2.6?
 http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/9f05eea3b27339a6

cfreak (cfreak) wrote :

that debug output looks the same, so i'd say yes.

2010/1/26 Scott Call <email address hidden>:
> Is this possible same/related as Debian bug 588097 related to ifenslave-2.6?
>  http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/9f05eea3b27339a6
>
> --
> 802.3ad interface bonding fails if started too early
> https://bugs.launchpad.net/bugs/482419
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> 802.3ad bonding configurations that formerly worked on jaunty are now failing on startup under karmic. After the system has started, restarting networking will bring the bond up correctly. This only applies to bond_mode 4 / 802.3ad, I've tested that switching to bond_mode 0 corrects the issue, and other users experiencing this bug all were using bond_mode 4 as well.
>
> dmesg output fills with "bonding: bond0: Warning: Found an uninitialized port", even after the system starts up and the port should be "initialized"
>
> It appears to occur on multiple drivers (bnx2, e1000 confirmed).
>
> One initially wants to blame the startup ordering due to the switch to upstart, but I believe it is an edge case that hasn't been seen before because we haven't been starting up so quickly that the hardware hasn't had time to fully initialized.
>
> Configuration and output from multiple users is in this thread:
> http://ubuntuforums.org/showthread.php?p=8311572
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/482419/+subscribe
>

Some links about the ifenslave in Lucid vs. Karmic vs. Debian Sid

http://changelogs.ubuntu.com/changelogs/pool/main/i/ifenslave-2.6/ifenslave-2.6_1.1.0-14ubuntu1/changelog

Current package versions
Karmic 1.1.0-13ubuntu1 08-Jun-2009
Lucid 1.1.0-14ubuntu1 06-Nov-2009
Debian Sid 1.1.0-15 18-Dec-2009

Bhavani Shankar (bhavi) wrote :

Thanks I ll prepare a diff against latest version in debian

Regards

Changed in linux (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Bhavani Shankar (bhavi)
Bhavani Shankar (bhavi) wrote :
Changed in linux (Ubuntu):
status: In Progress → Confirmed
assignee: Bhavani Shankar (bhavi) → nobody

ifenslave-2.6_1.1.0-15ubuntu1_amd64.deb is working for me on this box:

$ sudo dmidecode | grep Prod
        Product Name: SUN FIRE X4450

$ sudo dmidecode | grep Desc.*LAN
        Description: LAN 1 of Gilgal (Intel 82563EB)
        Description: LAN 2 of Gilgal (Intel 82563EB)
        Description: LAN 3 of Gilgal (Intel 82571EB)
        Description: LAN 4 of Gilgal (Intel 82571EB)

:)

Bryn Hughes (linux-nashira) wrote :

This package fixes things for me too... I've got two e1000's and two bnx2's, they all work properly with this package.

Is this bug/fix eligible for an SRU?

https://wiki.ubuntu.com/StableReleaseUpdates

Alexander Usyskin (sanniu) wrote :

Affected by this bug too.

With bhavi package the problem was negated. Please do SRU!

Thank you.

Maybe it's about the language. I'm testing to see if it makes a difference if it's worded like LP: #552017

Michael Rolli (mrolli) wrote :

Same isssue here on 10.04LTS.

Installed the package of Bhavani (ifenslave-2.6_1.1.0-15ubuntu1_amd64.deb)

Bonding of 4 NICs (802.3ad) now works like a charm, even after a restart. +1 for SRU!

Neil Wilson (neil-aldur) wrote :

This is confirmed as a failure within 10.04 LTS on boot.

I feel this is a critical bug that shouldn't be in an LTS and the ifenslave package should be updated with the patch above.

I've added the ifenslave package above to my release archive at ppa:neil-aldur/ppa

Please SRU Lucid, and look at SRU for Karmic as well.

Steve Langasek (vorlon) on 2010-06-04
affects: linux (Ubuntu) → ifenslave-2.6 (Ubuntu)
Changed in ifenslave-2.6 (Ubuntu):
status: Confirmed → Fix Released
Changed in ifenslave-2.6 (Ubuntu Lucid):
status: New → Triaged
importance: Undecided → Medium
Steve Langasek (vorlon) wrote :

For an SRU we need a targeted backport of just the bugfix, without taking changes that entirely redo the packaging. Bhavani, can you prepare this or should I?

Benjamin Drung (bdrung) wrote :

unsubscribing ubuntu-sponsors. Please resubscribe the team once you have a debdiff prepared.

Bhavani Shankar (bhavi) wrote :

yeah steve on it now

Bhavani Shankar (bhavi) wrote :

Hello steve, ubuntu-sru and others

Attached is the lucid SRU diff

Please test out the patch

regards

Steve Langasek (vorlon) wrote :

Hi Bhavani,

What dpkg trigger error are you trying to fix? The changes to the maintainer scripts should have no effect on dpkg triggers. In fact, they should have no effect in Ubuntu at all because the change only applies to upgrades from version 1.1.0-6 or before, which is the version that was in dapper. (If a user *does* try to upgrade directly from dapper, then this won't work either, because you use the wrong conffile names in your patch to the preinst). If you believe this is an important change to make in spite of this, please provide a separate bug reference in the changelog; there should be a separate bug report in Launchpad for each issue being fixed in SRU.

As for the pre-up change: why is this function being split into two? The two halves of the function are still being called, in the same order, with nothing else between them, so that seems unnecessary for an SRU? The change appears to be equivalent to this much shorter patch:

@@ -128,2 +131,2 @@
-enslave_slaves
 setup_master
+enslave_slaves

Have I overlooked some reason that we want to split the function?

Bhavani Shankar (bhavi) wrote :

Okay steve!

Shortening the patch !

regards

On Thu, Jul 15, 2010 at 1:18 AM, Steve Langasek
<email address hidden> wrote:
> Hi Bhavani,
>
> What dpkg trigger error are you trying to fix?  The changes to the
> maintainer scripts should have no effect on dpkg triggers.  In fact,
> they should have no effect in Ubuntu at all because the change only
> applies to upgrades from version 1.1.0-6 or before, which is the version
> that was in dapper.  (If a user *does* try to upgrade directly from
> dapper, then this won't work either, because you use the wrong conffile
> names in your patch to the preinst).  If you believe this is an
> important change to make in spite of this, please provide a separate bug
> reference in the changelog; there should be a separate bug report in
> Launchpad for each issue being fixed in SRU.
>
> As for the pre-up change:  why is this function being split into two?
> The two halves of the function are still being called, in the same
> order, with nothing else between them, so that seems unnecessary for an
> SRU?  The change appears to be equivalent to this much shorter patch:
>
> @@ -128,2 +131,2 @@
> -enslave_slaves
>  setup_master
> +enslave_slaves
>
> Have I overlooked some reason that we want to split the function?
>
> --
> 802.3ad interface bonding fails if started too early
> https://bugs.launchpad.net/bugs/482419
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in “ifenslave-2.6” package in Ubuntu: Fix Released
> Status in “ifenslave-2.6” source package in Lucid: Triaged
> Status in “ifenslave-2.6” package in Debian: Unknown
>
> Bug description:
> 802.3ad bonding configurations that formerly worked on jaunty are now failing on startup under karmic. After the system has started, restarting networking will bring the bond up correctly. This only applies to bond_mode 4 / 802.3ad, I've tested that switching to bond_mode 0 corrects the issue, and other users experiencing this bug all were using bond_mode 4 as well.
>
> dmesg output fills with "bonding: bond0: Warning: Found an uninitialized port", even after the system starts up and the port should be "initialized"
>
> It appears to occur on multiple drivers (bnx2, e1000 confirmed).
>
> One initially wants to blame the startup ordering due to the switch to upstart, but I believe it is an edge case that hasn't been seen before because we haven't been starting up so quickly that the hardware hasn't had time to fully initialized.
>
> Configuration and output from multiple users is in this thread:
> http://ubuntuforums.org/showthread.php?p=8311572
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu/+source/ifenslave-2.6/+bug/482419/+subscribe
>

--
Bhavani Shankar.R
https://launchpad.net/~bhavi, a proud ubuntu community  member.
What matters in life is application of mind!,
It makes great sense to have some common sense..!

Bhavani Shankar (bhavi) wrote :

here is the shortened patch

regards

Any chance we'll see this as part of the 10.04.1 iso?

Ryan Tandy (rtandy) wrote :

The patch supplied by Steve in comment #28 worked for me in a virtual machine and I will test it on a physical one when I get the chance. Is there anything holding up the release of this fixed package?

Somebody to execute the SRU process I'd imagine.

On 5 September 2010 05:49, Ryan Tandy <email address hidden> wrote:
> The patch supplied by Steve in comment #28 worked for me in a virtual
> machine and I will test it on a physical one when I get the chance.  Is
> there anything holding up the release of this fixed package?
>
> --
> 802.3ad interface bonding fails if started too early
> https://bugs.launchpad.net/bugs/482419
> You received this bug notification because you are a direct subscriber
> of the bug.
>

--
Neil Wilson

Chad Netzer (chad-netzer) wrote :

I was hit by this bug, on multiple machines and architectures (x86, x86_64). I installed the provisional version 1.1.0-15ubuntu1, from comment #16, and the problems went away.

description: updated
Martin Pitt (pitti) wrote :

Bhavani,

this patch looks a bit weird. It not just changes the order of the function calls, it also introduces a new function call "setup_slaves", without any further changes. However, in the original Debian patch (still visible at http://launchpadlibrarian.net/49642996/ifenslave-2.6_1.1.0-14ubuntu2_1.1.0-15ubuntu1.diff.gz) that new function is introduced (by splitting the original one). So can you please confirm that our version already has the new function, but not the call to it?

Any chance we'll see this fix as part of the January 10.04.2 iso?

(Any chance Ubuntu LTS will have this fixed before CentOS 6 is out?) (*ducks*)

Neil Wilson (neil-aldur) on 2010-11-14
Changed in ifenslave-2.6 (Ubuntu Lucid):
assignee: nobody → Neil Wilson (neil-aldur)
Neil Wilson (neil-aldur) on 2010-11-14
Changed in ifenslave-2.6 (Ubuntu Lucid):
assignee: Neil Wilson (neil-aldur) → nobody
status: Triaged → Incomplete
Neil Wilson (neil-aldur) wrote :

I'm no longer sure this is a bug in the package and may instead be more of a duplicate of #559090. I cannot get the fault to replicate if I use the correct hotplug configurations for the bonding system on a fresh install of lucid with the vanilla lucid ifenslave-2.6 package.

All the posted configs seem to have slaves defined in the bond master interface rather than using 'bond-master' in the slaves.

If you believe you are affected by this bug, can you reconfirm it please after following the guide I've put together here:

http://www.3spoken.co.uk/2010/11/how-to-do-ethernet-bonding-on-ubuntu.html

Steve Langasek (vorlon) wrote :

I am sure this is a bug in the package. That's why it was marked triaged.

Changed in ifenslave-2.6 (Ubuntu Lucid):
status: Incomplete → Triaged
Neil Wilson (neil-aldur) wrote :

I had the problem. I was going to sort the SRU.

Changing the configuration to the hotplug configuration fixed the
problem using the standard package.

Can you tell me what status I should use to get that checked by everybody else?

On 15 November 2010 08:07, Steve Langasek <email address hidden> wrote:
> I am sure this is a bug in the package.  That's why it was marked
> triaged.
>
> ** Changed in: ifenslave-2.6 (Ubuntu Lucid)
>       Status: Incomplete => Triaged
>
> --
> 802.3ad interface bonding fails if started too early
> https://bugs.launchpad.net/bugs/482419
> You received this bug notification because you are a direct subscriber
> of the bug.
>

--
Neil Wilson

Martin Pitt (pitti) wrote :

Setting to incomplete, see comment 35.

Changed in ifenslave-2.6 (Ubuntu Lucid):
status: Triaged → Incomplete
Dave Walker (davewalker) wrote :

As per current status, unsubscribing Ubuntu sponsors team. Please free free to re-subscribe the team when the patch is ready for review and upload. Thanks!

Steve Langasek (vorlon) wrote :

The proposed patch from Bhavani is incorrect, but the bug as originally described is real, is present even when using the hotplug configuration, and should be backported to lucid. Setting this as 'triaged' again.

Changed in ifenslave-2.6 (Ubuntu Lucid):
status: Incomplete → Triaged
assignee: nobody → Steve Langasek (vorlon)
Neil Wilson (neil-aldur) wrote :

Can you demonstrate the configuration to make it fail then.

My configuration does not work - note that there is also a bridge set up on two vlan interfaces.

# The loopback network interface
auto lo
iface lo inet loopback

auto bond0
iface bond0 inet manual
        bond-slaves none
        bond-mode 4
        bond-miimon 100

# Internet
auto bond0.1501
iface bond0.1501 inet dhcp

# Wireless
auto bond0.1502
iface bond0.1502 inet manual

# Wired
auto bond0.1503
iface bond0.1503 inet manual

# Wireless-Wired bridge
auto br0
iface br0 inet static
 address 10.15.0.1
 network 10.15.0.0
 netmask 255.255.0.0
 bridge_ports bond0.1502 bond0.1503

# Printer
auto bond0.1504
iface bond0.1504 inet static
 address 10.16.0.1
 netmask 255.255.255.252

auto eth0
iface eth0 inet manual
        bond-master bond0
        bond-primary eth0 eth1

auto eth1
iface eth1 inet manual
        bond-master bond0
        bond-primary eth0 eth1

Malcolm Scott (malcscott) wrote :

re #44, Michal, you're bridging together two VLANs -- beware, this will cause most VLAN-aware switches to break down (as VLANs share one forwarding database, and a MAC address can't appear on two ports; in my experience this leads to intermittent very high packet loss as the forwarding database starts flapping). Are you sure that that isn't the problem you're seeing? What are your symptoms?

My symptoms are that interfaces are not brought up at boot. I have to manually:
1. stop networking
2. rmmod bonding
3. modprobe bonding
4. start networking
5. bring up all vlans one by one (ifup bond0.150x)

Once interfaces are up I do not have any problems with the network.

The history is that it didn't work after fresh Lucid installation. Then - after one of the updates it started working and now it is broken again (so I'm not sure if it is actually ifenslave issue). Sorry but I don't remember what updates they were.
I've tried different versions of ifenslave and various kernels. For me it looks like it has something to do with vlans being started too early.

BenLake (me-benlake) wrote :

Just confirming Neil Wilson's findings. I had the "no bond connection upon reboot" issue and was using the non hot-plug configuration method. I switched to hot-plug and the problem was sorted.

Description: Ubuntu 10.04.2 LTS

--- current config ----

# bond/trunk eth0-1
auto bond0
iface bond0 inet static
    address 192.168.1.94
    netmask 255.255.255.0
    gateway 192.168.1.1
# bond-slaves eth0 eth1
    bond-slaves none
    bond-mode 802.3ad
    bond-miimon 100

# enslave eth0-1
auto eth0
iface eth0 inet manual
    bond-master bond0

auto eth1
iface eth1 inet manual
    bond-master bond0
--------------

Previous config had bond-slaves eth0 eth1 as comment shows and both bond-master lines on eth0 and eth1 were non-existent.

Mark Favas (mark-favas) wrote :

I can confirm that this bug still occurs. Although I've been using this configuration for ~6 months, it bit me today. This is on a box running Ubuntu server, with two bonded interfaces. Bond0 failed to come up after a reboot, while bond1 did come up. After I got onto the box via bond1, a reboot worked - both bond0 and bond 1 came up. However, /var/log/messages still contains many lines of the below:

Apr 23 11:02:43 server1 kernel: [104080.510023] bonding: bond0: Warning: Found an uninitialized port

and many lines of:

Apr 23 11:05:14 server1 kernel: [ 16.985610] bonding: bond0: doing slave updates when interface is down.

Details:

root@server1:/etc/network# cat /etc/issue
Ubuntu 10.04.2 LTS \n \l

root@server1:/etc/network#

root@server1e:/etc/network# uname -a
Linux server1 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011 x86_64 GNU/Linux
root@server1:/etc/network#

root@servers:/etc/network# apt-cache search ifenslave
ifenslave-2.6 - Attach and detach slave interfaces to a bonding device
root@server1:/etc/network#

root@server1:/etc/network# grep -r -i bond /etc/modprobe.d/
root@server1:/etc/network#

root@server1:/etc/network# cat interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
# Bonded interfaces
auto bond0
iface bond0 inet static
 address a.b.c.d
 netmask a.b.c.d
 network a.b.c.d
 broadcast a.b.c.d
 metric 99
 gateway a.b.c.d
 bond-slaves none
 bond-mode 4
 bond-miimon 100
 bond-xmit_hash_policy layer2+3

auto eth0
iface eth0 inet manual
 bond-master bond0

auto eth2
iface eth2 inet manual
 bond-master bond0

auto bond1
iface bond1 inet static
 address w.x.y.z
 netmask w.x.y.z
 network w.x.y.z
 broadcast w.x.y.z
 bond-slaves none
 bond-mode 4
 bond-miimon 100
 bond-xmit_hash_policy layer2+3

auto eth1
iface eth1 inet manual
 bond-master bond1

auto eth3
iface eth3 inet manual
 bond-master bond1

root@server1:/etc/network#

@Mark Favas - dude, it's not about fixing things, it's about declaring them fixed. Fiat bug fixes are what Canonical and the other Ubuntunistas do best.

Steve Langasek (vorlon) wrote :

On Sat, Apr 23, 2011 at 01:23:30PM -0000, nutznboltz wrote:
> @Mark Favas - dude, it's not about fixing things, it's about declaring
> them fixed. Fiat bug fixes are what Canonical and the other
> Ubuntunistas do best.

Such false and slanderous comments have no place in bug reports. Take it
somewhere else.

Is there a workaround for this on 10.04.2? I'd rather that the bond came up the first time so the subsequent network dependent daemons could run properly rather than having to do a "/etc/init.d/networking restart" in rc.local as suggested by the forum thread.

@Pete Ashdown yes, there is a workaround.

First one of the things that goes wrong is that ethernet media autoconfiguration takes up time and causes network interfaces to take so long to come up there is the possibility that that will interfere with bonding configuring. You can hardcode the media speed/duplex/etc on the switch and in /etc/network/interfaces

auto eth0
    iface eth0 inet manual
    media 1000baseTx-FD

for each eth0 device to be bonded.

Next you can get bonding to work by using pre-up, etc. in /etc/network/interfaces like this:

auto bond0
iface bond0 inet manual
    pre-up modprobe bonding mode=802.3ad ad_select=bandwidth downdelay=400 miimon=100 lacp_rate=0 max_bonds=2 ; ifconfig bond0 up ; ifconfig eth0 up ; ifconfig eth1 up
    post-up ifenslave bond0 eth0 eth1
    pre-down ifenslave -d bond0 eth0 eth1
    post-down ifconfig eth0 down ; ifconfig eth1 down ; ifconfig bond0 down

scott (scott.phelps) wrote :

speed/duplex auto-negotiation was not the source of the problem in my experience (not saying that it couldn't potentially cause some problem). Relying on auto-negotiation is not a good idea when using ether-channel.

What I found was that problem is originating from the udev (hotplug) ethernet system which is should be loading the bonding module prior to bringing up the bond interface. This does not happen for some reason and would require detailed inspection of uevents that are being triggered when the interfaces file is parsed.

The workaround, which I posted on 5/17/11 on related bug, 574456 [ https://bugs.launchpad.net/ubuntu/+source/ifenslave-2.6/+bug/574456/comments/5 ], is to use the pre-up scripts to explicitly state the _proper_ order things should happen. My solution works across reboots and is udev agnostic.

Unfortunately this workaround isn't as reliable as adding (the sloppier) "/etc/init.d/networking restart" to my /etc/rc.local.

Short followup to my previous comment #54. The workaround is functioning well for me now. I think I had a typo in my first round of testing.

Steve Langasek (vorlon) on 2011-06-06
Changed in ifenslave-2.6 (Ubuntu Lucid):
assignee: Steve Langasek (vorlon) → Stéphane Graber (stgraber)
Stéphane Graber (stgraber) wrote :

I'm having a look at this bug now.

I used a spare DELL PowerEdge 750 with two Intel Gigabit NICs connected to a DELL managed switch with LACP setup.
My /etc/network/interfaces looks like:
 auto bond0
 iface bond0 inet static
     address 10.145.15.20
     netmask 255.255.255.0
     gateway 10.145.15.1

     slaves eth0 eth1
     bond-mode 4
     bond-miimon 10
     bond-lacp-rate fast
     bond-xmit_hash_policy layer2+3

The switch configuration for the two ports is:
  port-channel load-balance layer-2-3-4
  interface range ethernet g(11-12)
  channel-group 4 mode auto
  exit

Stéphane Graber (stgraber) wrote :
Download full text (5.2 KiB)

With the current package, I get the issue described in this bug report, with the following output:

Switch:
sw06# show lacp port-channel 4
Port-Channel ch4
       Port Type Gigabit Ethernet
       Attached Lag id:
       Actor
               System Priority:1
               MAC Address: 00:14:22:66:25:10
               Admin Key: 28
               Oper Key: 28
       Partner
               System Priority:0
               MAC Address: 00:00:00:00:00:00
               Oper Key: 0

sw06# show lacp ethernet g11
g11 LACP parameters:
      Actor
              system priority: 1
              system mac addr: 00:14:22:66:25:10
              port Admin key: 28
              port Oper key: 28
              port Oper number: 11
              port Admin priority: 1
              port Oper priority: 1
              port Admin timeout: LONG
              port Oper timeout: LONG
              LACP Activity: ACTIVE
              Aggregation: AGGREGATABLE
              synchronization: FALSE
              collecting: FALSE
              distributing: FALSE
              expired: FALSE
      Partner
              system priority: 65535
              system mac addr: 00:c0:9f:3f:3b:2c
              port Admin key: 0
              port Oper key: 17
              port Oper number: 2
              port Admin priority: 0
              port Oper priority: 255
              port Oper timeout: SHORT
              LACP Activity: ACTIVE
              Aggregation: INDIVIDUAL
              synchronization: FALSE
              collecting: TRUE
              distributing: TRUE
              expired: FALSE
g11 LACP statistics:
      LACP Pdus sent: 252
      LACP Pdus received: 27
g11 LACP Protocol State:
      LACP State Machines:
              Receive FSM: Port Disabled State
              Mux FSM: Detached State
              Periodic Tx FSM: No Periodic State
      Control Variables:
              BEGIN: FALSE
              LACP_Enabled: TRUE
              Ready_N: FALSE
              Selected: UNSELECTED
              Port_moved: FALSE
              NNT: FALSE
              Port_enabled: FALSE
      Timer counters:
              periodic tx timer: 0
              current while timer: 0
              wait while timer: 0
sw06# show lacp ethernet g12
g12 LACP parameters:
      Actor
              system priority: 1
              system mac addr: 00:14:22:66:25:10
              port Admin key: 28
              port Oper key: 28
              port Oper number: 12
              port Admin priority: 1
              port Oper priority: 1
              port Admin timeout: LONG
              port Oper timeout: LONG
              LACP Activity: ACTIVE
              Aggregation: AGGREGATABLE
              synchronization: TRUE
    ...

Read more...

Stéphane Graber (stgraber) wrote :
Download full text (5.2 KiB)

With the fix applied, I now get:

Switch:

sw06# show lacp port-channel 4
Port-Channel ch4
       Port Type Gigabit Ethernet
       Attached Lag id:
       Actor
               System Priority:1
               MAC Address: 00:14:22:66:25:10
               Admin Key: 28
               Oper Key: 28
       Partner
               System Priority:65535
               MAC Address: 00:c0:9f:3f:3b:2c
               Oper Key: 17

sw06# show lacp ethernet g11
g11 LACP parameters:
      Actor
              system priority: 1
              system mac addr: 00:14:22:66:25:10
              port Admin key: 28
              port Oper key: 28
              port Oper number: 11
              port Admin priority: 1
              port Oper priority: 1
              port Admin timeout: LONG
              port Oper timeout: LONG
              LACP Activity: ACTIVE
              Aggregation: AGGREGATABLE
              synchronization: FALSE
              collecting: FALSE
              distributing: FALSE
              expired: FALSE
      Partner
              system priority: 65535
              system mac addr: 00:c0:9f:3f:3b:2c
              port Admin key: 0
              port Oper key: 17
              port Oper number: 2
              port Admin priority: 0
              port Oper priority: 255
              port Oper timeout: SHORT
              LACP Activity: ACTIVE
              Aggregation: INDIVIDUAL
              synchronization: FALSE
              collecting: TRUE
              distributing: TRUE
              expired: FALSE
g11 LACP statistics:
      LACP Pdus sent: 252
      LACP Pdus received: 27
g11 LACP Protocol State:
      LACP State Machines:
              Receive FSM: Port Disabled State
              Mux FSM: Detached State
              Periodic Tx FSM: No Periodic State
      Control Variables:
              BEGIN: FALSE
              LACP_Enabled: TRUE
              Ready_N: FALSE
              Selected: UNSELECTED
              Port_moved: FALSE
              NNT: FALSE
              Port_enabled: FALSE
      Timer counters:
              periodic tx timer: 0
              current while timer: 0
              wait while timer: 0
sw06# show lacp ethernet g12
g12 LACP parameters:
      Actor
              system priority: 1
              system mac addr: 00:14:22:66:25:10
              port Admin key: 28
              port Oper key: 28
              port Oper number: 12
              port Admin priority: 1
              port Oper priority: 1
              port Admin timeout: LONG
              port Oper timeout: LONG
              LACP Activity: ACTIVE
              Aggregation: AGGREGATABLE
              synchronization: TRUE
              collecting: TRUE
              distribu...

Read more...

Stéphane Graber (stgraber) wrote :

The only applied change is the following:

--- ../../../orig/ifenslave-2.6-1.1.0/debian/pre-up 2011-06-07 14:03:19.000000000 -0400
+++ pre-up 2011-06-07 14:26:30.449289528 -0400
@@ -125,5 +125,5 @@ fi
 [ -z "$BOND_MASTER$BOND_SLAVES" ] && exit

 add_master
-enslave_slaves
 setup_master
+enslave_slaves

Stéphane Graber (stgraber) wrote :

Above patch has now been uploaded to lucid-proposed waiting for a SRU team member to review and let it through.
Once it's through, I'd recommend anyone with a similar setup to run the test.

The above /etc/network/interfaces and switch configuration should consistently reproduce the issue with current lucid's ifenslave.
So far (5 reboots), the proposed fix also seems to work fine.

Accepted into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in ifenslave-2.6 (Ubuntu Lucid):
status: Triaged → Fix Committed
Stéphane Graber (stgraber) wrote :

I enabled proposed on my test server and made sure that my upload indeed works (as in, that the patch didn't vanish for some reason).

It works great here, but I want at least someone else to confirm that the proposed package also works for them (and I'm sure the SRU team wants that too).

Ryan Tandy (rtandy) wrote :

The package from proposed works here.

Steve Langasek (vorlon) on 2011-06-08
tags: added: verification-done
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ifenslave-2.6 - 1.1.0-14ubuntu2.2

---------------
ifenslave-2.6 (1.1.0-14ubuntu2.2) lucid-proposed; urgency=low

  * Fix pre-up script to first setup the master and only
    then setup the slaves. (LP: #482419)
 -- Stephane Graber <email address hidden> Tue, 07 Jun 2011 13:41:43 -0400

Changed in ifenslave-2.6 (Ubuntu Lucid):
status: Fix Committed → Fix Released
Download full text (5.7 KiB)

After applying the update on my servers, I got massive problems in the stability of my network connections. I was running bond in mode 6, which now seems not to work anymore.

auto bond0
iface bond0 inet static
    hwaddress ether 00:30:88:88:88:88

    address 192.168.111.5
    netmask 255.255.254.0
    network 192.168.111.0
    broadcast 192.168.112.255
    gateway 192.168.111.1
    dns-nameservers 127.0.0.1
    dns-search bar.local

    # Both network interfaces
    slaves eth0 eth1

    # (balance-alb) Adaptive load balancing
    bond_mode 6

    bond_miimon 100
    bond_updelay 200
    bond_downdelay 200

My kern.log got flushed with the following "Jun 20 12:33:57 foo kernel: [ 1043.270668] bonding: bond0: Error: found a client with no channel in the client's hash table", followed by up/down messages from time to time:

[...]
Jun 20 12:33:31 foo kernel: [ 1009.900122] bonding: bond0: Error: found a client with no channel in the client's hash table
Jun 20 12:33:31 foo kernel: [ 1009.900123] bonding: bond0: Error: found a client with no channel in the client's hash table
Jun 20 12:33:31 foo kernel: [ 1009.900125] bonding: bond0: Error: found a client with no channel in the client's hash table
Jun 20 12:33:31 foo kernel: [ 1010.774755] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jun 20 12:33:31 foo kernel: [ 1010.862510] bonding: bond0: link status up for interface eth0, enabling it in 0 ms.
Jun 20 12:33:31 foo kernel: [ 1010.862514] bonding: bond0: link status definitely up for interface eth0.
Jun 20 12:33:31 foo kernel: [ 1010.862517] bonding: bond0: making interface eth0 the new active one.
Jun 20 12:33:31 foo kernel: [ 1010.863951] bonding: bond0: first active interface up!
Jun 20 12:33:31 foo kernel: [ 1010.900005] bonding: bond0: Error: found a client with no channel in the client's hash table
Jun 20 12:33:31 foo kernel: [ 1010.900007] bonding: bond0: Error: found a client with no channel in the client's hash table
Jun 20 12:33:31 foo kernel: [ 1010.900008] bonding: bond0: Error: found a client with no channel in the client's hash table
Jun 20 12:33:31 foo kernel: [ 1010.900010] bonding: bond0: Error: found a client with no channel in the client's hash table
[...]
Jun 20 12:34:53 foo kernel: [ 1050.282624] bonding: bond0: Error: found a client with no channel in the client's hash table
Jun 20 12:34:53 foo kernel: [ 1050.673490] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jun 20 12:34:53 foo kernel: [ 1050.733759] bonding: bond0: link status up for interface eth0, enabling it in 0 ms.
Jun 20 12:34:53 foo kernel: [ 1050.733762] bonding: bond0: link status definitely up for interface eth0.
Jun 20 12:34:53 foo kernel: [ 1050.733765] bonding: bond0: making interface eth0 the new active one.
Jun 20 12:34:53 foo kernel: [ 1050.735167] bonding: bond0: first active interface up!
Jun 20 12:34:53 foo kernel: [ 1063.804210] e1000e: eth0 NIC Link is Down
Jun 20 12:34:53 foo kernel: [ 1063.833760] bonding: bond0: link status down for interface eth0, disabling it in 200 ms.
Jun 20 12:34:53 foo kernel: [ 1064.033767] bonding: bond0: link status definitely down for interface eth0, disabling it
Jun ...

Read more...

Stéphane Graber (stgraber) wrote :

Can you confirm that downgrading ifenslave to the older version fixes it for you?

I'm going to setup an identical server (2x e1000e with bond_mode 6) this afternoon and see if I can reproduce the same regression.

Stéphane Graber (stgraber) wrote :

Hmm, I don't seem to be able to reproduce your problem.

I currently use:

stgraber@halla:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto bond0
iface bond0 inet static
    address 10.145.15.20
    netmask 255.255.255.0
    gateway 10.145.15.1

    slaves eth0 eth1

    bond_mode 6
    bond_miimon 100
    bond_updelay 200
    bond_downdelay 200

and I get the following in my dmesg:

[ 7.151348] Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
[ 7.151354] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[ 7.155412] bonding: bond0: setting mode to balance-alb (6).
[ 7.155520] bonding: bond0: Setting MII monitoring interval to 100.
[ 7.155610] bonding: bond0: Setting up delay to 200.
[ 7.155686] bonding: bond0: Setting down delay to 200.
[ 7.160557] bonding: bond0: doing slave updates when interface is down.
[ 7.160568] bonding: bond0: Adding slave eth0.
[ 7.160573] bonding bond0: master_dev is not up in bond_enslave
[ 7.172556] bonding: bond0: enslaving eth0 as an active interface with a down link.
[ 7.197647] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 7.198734] bonding: bond0: doing slave updates when interface is down.
[ 7.198741] bonding: bond0: Adding slave eth1.
[ 7.198744] bonding bond0: master_dev is not up in bond_enslave
[ 7.209650] bonding: bond0: enslaving eth1 as an active interface with a down link.
[ 7.214369] bonding: bond0: link status up for interface eth0, enabling it in 0 ms.
[ 7.214539] ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 7.214550] bonding: bond0: link status definitely up for interface eth0.
[ 7.214556] bonding: bond0: making interface eth0 the new active one.
[ 7.214956] bonding: bond0: first active interface up!
[ 7.215301] ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ 18.148005] bond0: no IPv6 routers present

Did I miss something ?

Stéphane Graber (stgraber) wrote :

Oh, and output from /proc/net/bonding/bond0:

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:c0:9f:3f:3b:2c

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:c0:9f:3f:3b:2d

Changed in ifenslave-2.6 (Debian):
status: Unknown → Fix Released

The patch for this bug breaks the use of bond_primary in /etc/network/interfaces. See Bug #823366.

The following piece of code from the "setup_master()" function in /etc/network/if-pre-up.d/ifenslave will only setup a primary if the primary is listed as a slave (which will not be the case with the patch that reorders the enslave_slaves and setup_master functions.

---------------------------------------
        # The first slave in bond-primary found in current slaves becomes the primary.
        # If no slave in bond-primary is found, then primary does not change.
        for slave in $IF_BOND_PRIMARY ; do
                if grep -sq "\\<$slave\\>" "/sys/class/net/$BOND_MASTER/bonding/slaves" ; then
                        sysfs "$slave" primary
                        break
                fi
        done
---------------------------------------

Thanks,
Deric

Had the same problem on 11.10 x86_64, but this can be easily solved by changing network configuration a bit.

One has to use:

iface ethX ...
   bond-master bondX

instead of

iface bondX ...
   slaves ethX

Either manuals/howtos should be updated or the 'slaves' functionality should be restored
I will update https://help.ubuntu.com/community/UbuntuBonding ASAP

Stéphane Graber (stgraber) wrote :

Janusz: Please try with oneiric-proposed added as described above.

Indeed, you should be using bond-master on a slave interface instead of listing the slaves on the bond, at least in 802.3ad configuration, some other modes may require an explicit list of master and slaves.

Hello Customer

Become the owner of exquisite and stylish thing!
A vast range of numerous different brand name watches, all of top quality, and at direct wholesale pricing.

---------------------------------
Received my Rolex Daytona yesterday…best replica I have ever seen.
Thanks!
                     Jamal Stratton
---------------------------------

Click here ---> http://phoem.ru

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.