Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot

Bug #1834322 reported by Przemyslaw Hausman on 2019-06-26
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Focal
Bionic
High
Unassigned
Disco
High
Unassigned
Eoan
High
Unassigned
Focal
High
Unassigned

Bug Description

We are losing port channel aggregation on reboot.

After the reboot, /var/log/syslog contains the entries:
[ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
               Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports
[ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
               Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports

Aggregator IDs of the slave interfaces are different:
ubuntu@node-6:~$ cat /proc/net/bonding/bond2
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable

Slave Interface: enp24s0f1np1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b0:26:28:48:9f:51
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0

Slave Interface: enp24s0f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b0:26:28:48:9f:50
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1

The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port-channel and becomes aggregated.

The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second.

When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there.

When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal.

I installed 5.0.0 kernel, the issue is still there.

Operating System:
Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)

ubuntu@node-6:~$ uname -a
Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@node-6:~$ sudo lspci -vnvn
https://pastebin.ubuntu.com/p/Dy2CKDbySC/

Hardware: Dell PowerEdge R740xd
BIOS version: 2.1.7

sosreport: https://drive.google.com/open?id=1-eN7cZJIeu-AQBEU7Gw8a_AJTuq0AOZO

ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G
https://pastebin.ubuntu.com/p/sqCx79vZWM/

ubuntu@node-6:~$ lspci -n | grep 18:00
18:00.0 0200: 14e4:16d8 (rev 01)
18:00.1 0200: 14e4:16d8 (rev 01)

ubuntu@node-6:~$ modinfo bnx2x
https://pastebin.ubuntu.com/p/pkmzsFjK8M/

ubuntu@node-6:~$ ip -o l
https://pastebin.ubuntu.com/p/QpW7TjnT2v/

ubuntu@node-6:~$ ip -o a
https://pastebin.ubuntu.com/p/MczKtrnmDR/

ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml
https://pastebin.ubuntu.com/p/9cZpPc7C6P/

ubuntu@node-6:~$ sudo lshw -c network
https://pastebin.ubuntu.com/p/gmfgZptzDT/
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jun 26 10:21 seq
 crw-rw---- 1 root audio 116, 33 Jun 26 10:21 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.6
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 18.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 004: ID 1604:10c0 Tascam
 Bus 001 Device 003: ID 1604:10c0 Tascam
 Bus 001 Device 002: ID 1604:10c0 Tascam
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Dell Inc. PowerEdge R740xd
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-52-generic root=UUID=9b1eae60-0941-4638-a22a-98a719104259 ro console=tty0 console=ttyS0,115200 console=ttyS1,115200 raid=noautodetect intel_iommu=on iommu=pt pti=off
ProcVersionSignature: Ubuntu 4.15.0-52.56-generic 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-52-generic N/A
 linux-backports-modules-4.15.0-52-generic N/A
 linux-firmware 1.173.6
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic uec-images
Uname: Linux 4.15.0-52-generic x86_64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy libvirt netdev plugdev sudo video
_MarkForUpload: False
dmi.bios.date: 04/03/2019
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.1.7
dmi.board.name: 0JMK61
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.1.7:bd04/03/2019:svnDellInc.:pnPowerEdgeR740xd:pvr:rvnDellInc.:rn0JMK61:rvrA00:cvnDellInc.:ct23:cvr:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R740xd
dmi.sys.vendor: Dell Inc.

Przemyslaw Hausman (phausman) wrote :

subscribed ~field-critical

the issue is critically impairing the networking to the instance during the ongoing customer deployment

Dean Henrichsmeyer (dean) wrote :

Unsubscribing ~field-critical as the kernel isn't covered under the Field SLA.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1834322

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic

apport information

tags: added: apport-collected uec-images
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Brad Figg (brad-figg) on 2019-07-24
tags: added: ubuntu-certified
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Changed in linux (Ubuntu):
status: Expired → In Progress
tags: added: sts

https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/

There is a test kernel above (from that LP bug).

FWIW, the fix has been committed to -stable:

"bonding: fix state transition issue in link monitoring"
Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea

Fix has been committed to B, D, E. I've manually updated this
bug for now (it was not formally DUP'd to LP Bug 1852077.

Changed in linux (Ubuntu Focal):
importance: Undecided → High
Changed in linux (Ubuntu Eoan):
importance: Undecided → High
Changed in linux (Ubuntu Disco):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Fix Committed
Changed in linux (Ubuntu Disco):
status: New → Fix Committed
Changed in linux (Ubuntu Eoan):
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers