[i40e] LACP bonding start up race conditions

Bug #1753662 reported by Nobuto Murata
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Joseph Salisbury
Xenial
Triaged
High
Joseph Salisbury

Bug Description

When provisioning Ubuntu servers with MAAS at once, some bonding pairs will have unexpected LACP status such as "Expired". It randomly happens at each provisioning with the default xenial kernel(4.4), but not reproducible with HWE kernel(4.13). I'm using Intel X710 cards (Dell-branded).

Using the HWE kernel works as a workaround for short term, but it's not ideal since 4.13 is not covered by Canonical Livepatch service.

How to reproduce:
1. configure LACP bonding with MAAS
2. provision machines
3. check the bonding status in /proc/net/bonding/bond*

frequency of occurrence:
About 5 bond pairs in 22 pairs at each provisioning.

[reproducible combination]
$ uname -a
Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ethtool -i eno1
driver: i40e
version: 1.4.25-k
firmware-version: 6.00 0x800034e6 18.3.6
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

[non-reproducible combination]
$ uname -a
Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ethtool -i eno1
driver: i40e
version: 2.1.14-k
firmware-version: 6.00 0x800034e6 18.3.6
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-116-generic 4.4.0-116.140
ProcVersionSignature: Ubuntu 4.4.0-116.140-generic 4.4.98
Uname: Linux 4.4.0-116-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 6 06:37 seq
 crw-rw---- 1 root audio 116, 33 Mar 6 06:37 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.15
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Tue Mar 6 06:46:32 2018
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 002: ID 8087:8002 Intel Corp.
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub
 Bus 001 Device 002: ID 8087:800a Intel Corp.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Dell Inc. PowerEdge R730
PciMultimedia:

ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-116-generic.efi.signed root=UUID=0528f88e-cf1a-43e2-813a-e7261b88d460 ro console=tty0 console=ttyS0,115200n8
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-116-generic N/A
 linux-backports-modules-4.4.0-116-generic N/A
 linux-firmware 1.157.17
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 08/16/2017
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.5.5
dmi.board.name: 072T6D
dmi.board.vendor: Dell Inc.
dmi.board.version: A08
dmi.chassis.asset.tag: 0018880
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.5.5:bd08/16/2017:svnDellInc.:pnPowerEdgeR730:pvr:rvnDellInc.:rn072T6D:rvrA08:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R730
dmi.sys.vendor: Dell Inc.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

This might be related (not exactly the same):
https://sourceforge.net/p/e1000/bugs/524/
One says 1.6.42 fixed his issue.

Looks like Intel has around 10 releases between 1.4.25 and 2.1.14, so it may not be handy to bisect.
https://downloadcenter.intel.com/download/24411/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Connections-Under-Linux-?product=82947

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

There is one LACP commit that sticks out between v4.4 and v4.13:
c15e07b02bf0 ("team: loadbalance: push lacpdus to exact delivery")

I built a Xenial test kernel with this commit. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

If that commit doesn't fix the issue, we can perform a "Reverse" bisect between 4.4 and 4.13 to find the fix.

Revision history for this message
Nobuto Murata (nobuto) wrote :

The kernel with c15e07b02bf0 didn't make a difference on the race condition. The issue is still reproducible. Let me know when you need my testing again with different kernels.

So far, I'm using rc.local below to reboot the same node multiple times.

====
#!/bin/sh

exec >> /root/bond_check.log 2>&1

echo
echo '############################################################'
echo

date -R
uname -a
ethtool -i eno1
modinfo i40e
path=$(modinfo i40e | grep filename: | awk '{print $2}')
sha256sum "$path"
package=$(dpkg -S "$path" | cut -d: -f1)
apt-cache policy "$package"

if [ "$(grep state /proc/net/bonding/bond* | grep -c -v -w 61)" != 0 ]; then
    echo
    echo '*** Unexpected LACP status ***'
    grep -r '.*' /proc/net/bonding/bond*
fi

sleep 300
reboot

exit 0
====

Revision history for this message
Nobuto Murata (nobuto) wrote :

For the record of testing.

Revision history for this message
Nobuto Murata (nobuto) wrote :

The record of xenial default kernel.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

To perform a "Reverse" bisect, we need to identify the last kernel version that had the bug and the first kernel version that does not.

Can you test the following upstream kernels:

4.4 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-wily/
4.6 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/
4.8 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
4.10 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/
4.13 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13/

You don't have to test everyone, just up until we know the first kernel version that does not have the bug.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Thanks, I was running some tests with existing HWE kernels in xenial repo like linux-image-4.8.0-58-generic, linux-image-4.10.0-42-generic and linux-image-4.13.0-36-generic. It looks like 4.10 is the last bad one and 4.13 is the first good one.

Let me double-check with those two:
4.10 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/
4.13 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13/

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

To narrow it down further, can you also test the following kernels:

v4.11 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/
v4.13-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc1/

Revision history for this message
Nobuto Murata (nobuto) wrote :

Reproducible with:
4.10 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/

The next test will be with:
v4.11 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/

Revision history for this message
Nobuto Murata (nobuto) wrote :

Reproducible with:
v4.11 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/

The next is v4.13-rc1.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Not reproducible with 4.13-rc1 with 5 reboots.

4.10 - bad
4.11 - bad
4.13-rc1 - good

The next is 4.12.

Revision history for this message
Nobuto Murata (nobuto) wrote :

v4.11 with i40e 1.6.27 - bad
v4.12 with i40e 2.1.14 - good

next: v4.12-rc1 with i40e 2.1.7 - ?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

If v4.12-rc1 is still bad, we would need to test some of the other release candidates, such as rc2, rc3, rc4, etc.

Once we have the last bad and first good, I'll start the reverse bisect and build a kernel.

Revision history for this message
Nobuto Murata (nobuto) wrote :

v4.12-rc1 with i40e 2.1.7 - bad
v4.12 with i40e 2.1.14 - good

I'm running out of time. So more bisections are for tomorrow.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Correction. I thought v4.12-rc1 had i40e 2.1.7 because of:
https://github.com/torvalds/linux/commit/15990832cd3e7e8904f8dacdabfa33adb9a836d6
But it actually has 2.1.14 from the log output.

So the correct status is:

v4.12-rc1 with i40e 2.1.14 - bad
v4.12 with i40e 2.1.14 - good

Revision history for this message
Nobuto Murata (nobuto) wrote :

I run an overnight test with v4.12 just to make sure it really fixed the issue. It happened sometimes, but way less frequencies. We may need to test it longer for "good" cases since the patch may not be only one. Anyway, the current status is:

v4.12-rc1 with i40e 2.1.14 - bad (3 out of 3)
v4.12 with i40e 2.1.14 - relatively good (5 out of 68)

The next test would be with v4.12-rc4.

Revision history for this message
Nobuto Murata (nobuto) wrote :

For the record,

v4.12-rc1 - bad (3 out of 3)
v4.12 - relatively good (5 out of 68)
v4.13 - good (0 out of 41)

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you happen to have results from any of the other release candidates, such as 4.12-rc4?

Revision history for this message
Nobuto Murata (nobuto) wrote :

I have let rc4 run for hours.

v4.12-rc1 - bad (3 out of 3)
v4.12-rc4 - relatively good (1 out of 70)
v4.12 - relatively good (5 out of 68)
v4.13 - good (0 out of 41)

I will let rc3 run during my night.

Revision history for this message
Nobuto Murata (nobuto) wrote :

With rc3, will test rc2 next.

v4.12-rc1 - bad (3 out of 3)
v4.12-rc3 - mixture result (24 out of 90)
v4.12-rc4 - relatively good (1 out of 70)
v4.12 - relatively good (5 out of 68)
v4.13 - good (0 out of 41)

Revision history for this message
Nobuto Murata (nobuto) wrote :

With rc2 result. It looks like there is a noticeable difference between v4.12-rc3 and v4.12-rc4.

@Joseph, can you please start looking into diffs? I'm keeping one dedicated node just for this testing, so I can run the same script one by one for more bisections.

v4.12-rc1 - bad (3 of 3)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 out of 68 - 7.4%)
v4.13 - good (0 out of 41 - 0%)

FWIW, I will run the same test with xenial's 4.4 kernel to make sure around 30% is the base line of "bad".

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between v4.12-rc3 and v4.12-rc4. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

Ok, 25% - 30% seems a baseline. I'd like to make sure v4.13 is really 0% for longer running test, but will do the bisection of v4.12-rc3 and v4.12-rc4 first.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)

v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I just noticed I forgot to paste the SHA1 for the test kernel posted in comment #29:
ff5a20169b98d84ad8d7f99f27c5ebbb008204d6

Do you have results from that kernel? Once you do, I'll update the bisect and build the next kernel.

Revision history for this message
Nobuto Murata (nobuto) wrote :

So far 0 of 6 with 4.12.0-041200rc3-generic #201803080803. But I will keep it running for a while to see if it becomes close to 30% or 0%.

Revision history for this message
Nobuto Murata (nobuto) wrote :

4.12.0-041200rc3-generic #201803080803 looks good. Please proceed to the next one. I will test it my tomorrow which would be 12 hours later from now.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)

v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc3 - good (0 of 22 - 0%) #201803080803 ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
0bb230399fd337cc9a838d47a0c9ec3433aa612e

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

I run xenial HWE over a night while sleeping, the result was 0/119. The next test is with:
v4.12.0-041200rc3 #201803081620 - ? 0bb230399fd337cc9a838d47a0c9ec3433aa612e

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)

====

v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

0bb230399fd337cc9a838d47a0c9ec3433aa612e seems good. I'm ready for the next test.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)

====

v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337cc9a838d47a0c9ec3433aa612e
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
400129f0a3ae989c30b37104bbc23b35c9d7a9a4

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

400129f0a3ae989c30b37104bbc23b35c9d7a9a4 looks good.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)

====
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989c30b37104bbc23b35c9d7a9a4
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337cc9a838d47a0c9ec3433aa612e
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
25f480e89a022d382ddc5badc23b49426e89eabc

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

25f480e89a022d382ddc5badc23b49426e89eabc looks good.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)

====
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d382ddc5badc23b49426e89eabc
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989c30b37104bbc23b35c9d7a9a4
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337cc9a838d47a0c9ec3433aa612e
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
d38162e4b5c643733792f32be4ea107c831827b4

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

@Joseph,

Will do. Just as a possibility, I could build a kernel on the host if that's helpful. Because the host is already reserved for this testing and has hundreds of GBs of memory and many CPU cores. If you have a pointer how to replicate your build process, that would be great.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Oh wait,

> I built the next test kernel, up to the following commit:
> d38162e4b5c643733792f32be4ea107c831827b4
>
> The test kernel can be downloaded from:
> http://kernel.ubuntu.com/~jsalisbury/lp1753662

d38162e4b5c643733792f32be4ea107c831827b4 looks in-between v4.12-rc3 and rc4 which is expected.
https://github.com/torvalds/linux/commit/d38162e4b5c643733792f32be4ea107c831827b4

However, the link says "rc1".
http://kernel.ubuntu.com/~jsalisbury/lp1753662/linux-image-4.12.0-041200rc1-generic_4.12.0-041200rc1.201803131457_amd64.deb

Can you confirm that the 4.12.0-041200rc1 is the expected kernel test?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yes that is the correct kernel. The mainline-build-one script uses the 'git describe' command to come up with the name. That command returns the closest git tag and not the one that contains it. So in the case of commit d38162e4b5c643733792f32be4ea107c831827b4:

git describe d38162e4b5c643733792f32be4ea107c831827b4
v4.12-rc1-13-gd38162e

However, that test kernel is actually using commits in -rc4, which you can see if the '--contains' option is given to git describe:

git describe --contains d38162e4b5c643733792f32be4ea107c831827b4
v4.12-rc4~20^2~4^2~3

I can continue to build the kernels, there are about 3 left. It only takes about 15 minutes to build the kernel for me.

Revision history for this message
Nobuto Murata (nobuto) wrote :

The test is still in progress, but so far d38162e4b5c643733792f32be4ea107c831827b4 looks good (1 of 37). Since I already downloaded the kernel locally, please go ahead to build the next one. Thanks,

Revision history for this message
Nobuto Murata (nobuto) wrote :

d38162e4b5c643733792f32be4ea107c831827b4 looks good.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)

====
v4.12.0-041200rc1 #201803131457 - relatively good (1 of 93 - 1.1%) d38162e4b5c643733792f32be4ea107c831827b4
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d382ddc5badc23b49426e89eabc
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989c30b37104bbc23b35c9d7a9a4
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337cc9a838d47a0c9ec3433aa612e
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
171d8b9363725e122b164e6b9ef2acf2f751e387

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

The test is still in progress, but so far 171d8b9363725e122b164e6b9ef2acf2f751e387 looks good (0 of 21). Since I already downloaded the kernel locally, please go ahead to build the next one. Thanks,

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
4681ee21d62cfed4364e09ec50ee8e88185dd628

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
Revision history for this message
Nobuto Murata (nobuto) wrote :

171d8b9363725e122b164e6b9ef2acf2f751e387 looks good. The next test is with 4681ee21d62cfed4364e09ec50ee8e88185dd628.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)

====
v4.12.0-041200rc1 #201803141333 - relatively good (1 of 217 - 0.5%) 171d8b9363725e122b164e6b9ef2acf2f751e387
v4.12.0-041200rc1 #201803131457 - relatively good (1 of 93 - 1.1%) d38162e4b5c643733792f32be4ea107c831827b4
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d382ddc5badc23b49426e89eabc
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989c30b37104bbc23b35c9d7a9a4
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337cc9a838d47a0c9ec3433aa612e
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Revision history for this message
Nobuto Murata (nobuto) wrote :

4681ee21d62cfed4364e09ec50ee8e88185dd628 looks good.

4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)

====
v4.12.0-041200rc1 #201803141835 - relatively good (1 of 113 - 0.9%) 4681ee21d62cfed4364e09ec50ee8e88185dd628
v4.12.0-041200rc1 #201803141333 - relatively good (1 of 217 - 0.5%) 171d8b9363725e122b164e6b9ef2acf2f751e387
v4.12.0-041200rc1 #201803131457 - relatively good (1 of 93 - 1.1%) d38162e4b5c643733792f32be4ea107c831827b4
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d382ddc5badc23b49426e89eabc
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989c30b37104bbc23b35c9d7a9a4
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337cc9a838d47a0c9ec3433aa612e
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The reverse bisect reported the following commit as the fix, but I'm doubtful since its and i915 commit:

commit 4681ee21d62cfed4364e09ec50ee8e88185dd628
Author: Joonas Lahtinen <email address hidden>
Date: Thu May 18 11:49:39 2017 +0300
    drm/i915: Do not sync RCU during shrinking

We may have went wrong somewhere in the bisect. However, just to be sure, I built a v4.12-rc4 test kernel. This kernel should be bad and contain the bug. If it does not, it may be due to the configs I'm using to build the test kernels.

My v4.12-rc4 test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not?

Revision history for this message
Nobuto Murata (nobuto) wrote :

> We may have went wrong somewhere in the bisect. However, just to be sure, I built a v4.12-rc4 test kernel. This kernel should be bad and contain the bug. If it does not, it may be due to the configs I'm using to build the test kernels.

I'm not following since I thought we tested that already and good.

> v4.12-rc4 - relatively good (1 of 70 - 1.4%)

But anyway I will let it run with your newly built one.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yes, sorry this is a "Reverse" bisect, so v4.12-rc4 should be good and not bad.

I'm also going to build a v4.12-rc3 kernel with my configs to confirm it's bad. I'll post that shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

A re-build of v4.12-rc3 is now available here:
http://kernel.ubuntu.com/~jsalisbury/lp1753662/v4.12-rc3

Can you confirm that this kernel is bad and contains the bug?

Revision history for this message
Nobuto Murata (nobuto) wrote :

v4.12-rc4 is good, 1 of 146. Going to test v4.12-rc3.

Revision history for this message
Nobuto Murata (nobuto) wrote :

The new build of v4.12-rc3 is a good build (2 of 151).
4.12.0-041200rc3-generic #201803151851

v4.12-rc3 - bad (24 of 90 - 26.6%)
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc3/

v4.12-rc3 - relatively good (2 of 151 - 1.3%)
http://kernel.ubuntu.com/~jsalisbury/lp1753662/v4.12-rc3/

So what is the difference between those two? build config?

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built another v4.12-rc3 test kernel. This time with Xenial configs instead of Artful configs. This test kernel can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc3-xenial-configs

Can you see if this kernel exhibits the bug?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry the correct link for the new 4.12-rc3 kernel with Xenial configs is:
http://kernel.ubuntu.com/~jsalisbury/lp1753662/v4.12-rc3-xenial-configs/

Revision history for this message
Nobuto Murata (nobuto) wrote :

Ok, we see some differences with the three kernels. How do we want to proceed from here?

v4.12-rc3 - bad (24 of 90 - 26.6%)
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc3/

v4.12-rc3 #201803151851 - relatively good (2 of 151 - 1.3%)
http://kernel.ubuntu.com/~jsalisbury/lp1753662/v4.12-rc3/
(artful configs)

v4.12-rc3 #201803161156 - relatively bad (36 or 249 - 14.5%)
http://kernel.ubuntu.com/~jsalisbury/lp1753662/v4.12-rc3-xenial-configs/
(xenial configs)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It's good that the v4.12-rc3 with xenial configs was bad. It means we should use Xenial configs when performing the bisect and not Artful configs. I'll kick off another bisect and post the first test kernel.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I restarted the bisect. This time using Xenial configs and not Artful configs.

I built the first test kernel, up to the following commit:
ff5a20169b98d84ad8d7f99f27c5ebbb008204d6

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

You've tested this SHA1 in prior test kernels. However, that test kernel was built with the Artful configs, and this one is with the Xenial configs.

Revision history for this message
Nobuto Murata (nobuto) wrote :

ff5a20169b98d84ad8d7f99f27c5ebbb008204d6 looks bad.

[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%)

Revision history for this message
Nobuto Murata (nobuto) wrote :

ff5a20169b98d84ad8d7f99f27c5ebbb008204d6 looks bad.

[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
ea094f3c830a67f252677aacba5d04ebcf55c4d9

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

ea094f3c830a67f252677aacba5d04ebcf55c4d9 looks bad.

[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12.0-041200rc3 #201803200004 relatively bad (6 of 56 - 10.7%) ea094f3c830a67f252677aacba5d04ebcf55c4d9

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
55cbdaf6399de16b61d40d49b6c8bb739a877dea

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Nobuto Murata (nobuto) wrote :

I was pretty occupied today, so I'm going to test 55cbdaf6399de16b61d40d49b6c8bb739a877dea now and report back my tomorrow morning.

[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12.0-041200rc3 #201803200004 relatively bad (60 of 499 - 12.0%) ea094f3c830a67f252677aacba5d04ebcf55c4d9

Revision history for this message
Nobuto Murata (nobuto) wrote :

55cbdaf6399de16b61d40d49b6c8bb739a877dea looks bad.

[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 - relatively bad (21 of 150 - 14.0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12.0-041200rc3 #201803200004 - relatively bad (60 of 499 - 12.0%) ea094f3c830a67f252677aacba5d04ebcf55c4d9
v4.12.0-041200rc3 #201803201426 - relatively bad (30 of 208 - 14.4%) 55cbdaf6399de16b61d40d49b6c8bb739a877dea

Revision history for this message
Nobuto Murata (nobuto) wrote :

BTW, have we set the baseline of "good" in this bisection with xenial configs?

> 4.13.0-36(xenial HW) - good (0 of 119 - 0%)

Does HWE kernel man with xenial configs? Or was it built with the source release config i.e. artful?

Revision history for this message
Nobuto Murata (nobuto) wrote :

Correction: Does HWE kernel mean it's with xenial configs? Or was it built with the source release config i.e. artful?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The HWE kernel was built with the Artful configs. I restarted the bisect using the Xenial configs and marking 4.12-rc4 as good and 4.12-rc3 as bad. We should re-test that to confirm we are going down the right patch. I built a 4.12-rc4 kernel with Xenial configs, which can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1753662

If this kernel ends up being bad and not good, then the bisect should be stopped. It may be that a patch did not fix this bug at all, just a change in one of the config options. If that is the case, I'll review the diff between Xenial and Artful configs in more detail.

Revision history for this message
Nobuto Murata (nobuto) wrote :

4.12-rc4 kernel with Xenial configs looks bad.

[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 - relatively bad (21 of 150 - 14.0%) ff5a20169b98d84ad8d7f99f27c5ebbb008204d6
v4.12.0-041200rc3 #201803200004 - relatively bad (60 of 499 - 12.0%) ea094f3c830a67f252677aacba5d04ebcf55c4d9
v4.12.0-041200rc3 #201803201426 - relatively bad (30 of 208 - 14.4%) 55cbdaf6399de16b61d40d49b6c8bb739a877dea
v4.12.0-041200rc4 #201803221452 - relatively bad (60 of 595 - 10.1%)

Is it possible to build xenial kernel (4.4) with artful config just for testing?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a 4.4 kernel using the Artful configs, it can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Revision history for this message
Nobuto Murata (nobuto) wrote : Re: [Bug 1753662] Re: [i40e] LACP bonding start up race conditions
Download full text (5.4 KiB)

2018年3月27日(火) 0:11 Joseph Salisbury <email address hidden>:

> I built a 4.4 kernel using the Artful configs, it can be downloaded from:
> http://kernel.ubuntu.com/~jsalisbury/lp1753662

Thanks. I just lost the access to the machine today. So I have to use
another host in a different data center. Will resume testing in a few days.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1753662
>
> Title:
> [i40e] LACP bonding start up race conditions
>
> Status in linux package in Ubuntu:
> Triaged
> Status in linux source package in Xenial:
> Triaged
>
> Bug description:
> When provisioning Ubuntu servers with MAAS at once, some bonding pairs
> will have unexpected LACP status such as "Expired". It randomly
> happens at each provisioning with the default xenial kernel(4.4), but
> not reproducible with HWE kernel(4.13). I'm using Intel X710 cards
> (Dell-branded).
>
> Using the HWE kernel works as a workaround for short term, but it's
> not ideal since 4.13 is not covered by Canonical Livepatch service.
>
> How to reproduce:
> 1. configure LACP bonding with MAAS
> 2. provision machines
> 3. check the bonding status in /proc/net/bonding/bond*
>
> frequency of occurrence:
> About 5 bond pairs in 22 pairs at each provisioning.
>
> [reproducible combination]
> $ uname -a
> Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> $ sudo ethtool -i eno1
> driver: i40e
> version: 1.4.25-k
> firmware-version: 6.00 0x800034e6 18.3.6
> expansion-rom-version:
> bus-info: 0000:01:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
> [non-reproducible combination]
> $ uname -a
> Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16
> 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> $ sudo ethtool -i eno1
> driver: i40e
> version: 2.1.14-k
> firmware-version: 6.00 0x800034e6 18.3.6
> expansion-rom-version:
> bus-info: 0000:01:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
> ProblemType: Bug
> DistroRelease: Ubuntu 16.04
> Package: linux-image-4.4.0-116-generic 4.4.0-116.140
> ProcVersionSignature: Ubuntu 4.4.0-116.140-generic 4.4.98
> Uname: Linux 4.4.0-116-generic x86_64
> AlsaDevices:
> total 0
> crw-rw---- 1 root audio 116, 1 Mar 6 06:37 seq
> crw-rw---- 1 root audio 116, 33 Mar 6 06:37 timer
> AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
> ApportVersion: 2.20.1-0ubuntu2.15
> Architecture: amd64
> ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
> AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq',
> '/dev/snd/timer'] failed with exit code 1:
> Date: Tue Mar 6 06:46:32 2018
> IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
> Lsusb:
> Bus 002 Device 002: ID 8087:8002 Intel Corp.
> Bus 002 Device 001: ID 1d6b:0002 ...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update. I'll start comparing the configs between Xenial and Artful to see if the change that caused this sticks out.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Just curious if you had a chance to test the kernel posted in #77?

I compared the configs between these two kernels:
v4.12-rc3: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc3/
v4.12-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc4/

Nothing sticks out as a fix in v4.12-rc4. I'll attach a screen shot of the diffs.

We could try reverse bisecting with the Ubuntu kernels instead of mainline kernels.

Revision history for this message
Nobuto Murata (nobuto) wrote :

The new environment is not fully up yet to test. ETA would be by the end of this week.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Finally got a machine up and running. Will resume testing shortly.

Revision history for this message
Nobuto Murata (nobuto) wrote :

FWIW, kernel trace happens with the kernel in:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1753662/comments/77

But I will let it running anyway since I'm not sure if it affects to the testing or not.

====
[ 5.999557] rtc_cmos 00:00: setting system clock to 2018-04-11 15:52:23 UTC (1523461943)
[ 5.999637] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 5.999637] EDD information not available.
[ 6.856436] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x229834d6ac9, max_idle_ns: 440795285103 ns
[ 6.872996] Freeing unused kernel memory: 1844K (ffffffff81f71000 - ffffffff8213e000)
[ 6.885727] Write protecting the kernel read-only data: 14336k
[ 6.897566] Freeing unused kernel memory: 1964K (ffff880001815000 - ffff880001a00000)
[ 6.911315] Freeing unused kernel memory: 272K (ffff880001dbc000 - ffff880001e00000)
[ 6.924438] ------------[ cut here ]------------
[ 6.934158] WARNING: CPU: 16 PID: 1 at /home/jsalisbury/bugs/lp1753662/v4.4/linux/arch/x86/mm/dump_pagetables.c:225 note_page+0x649/0x840()
[ 6.958330] x86/mm: Found insecure W+X mapping at address ffff880000010000/0xffff880000010000
[ 6.973431] Modules linked in:
[ 6.982498] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 4.4.0-040400-generic #201803261439
[ 6.997629] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.5.5 08/16/2017
[ 7.012149] 0000000000000000 000000009d064fa9 ffff883f6576fd48 ffffffff813da8d8
[ 7.026822] ffff883f6576fd90 ffff883f6576fd80 ffffffff8107ec52 ffff883f6576fe90
[ 7.041667] 8000000000000163 0000000000000004 0000000000000000 0000000000000000
[ 7.056684] Call Trace:
[ 7.066183] [<ffffffff813da8d8>] dump_stack+0x44/0x5c
[ 7.078824] [<ffffffff8107ec52>] warn_slowpath_common+0x82/0xc0
[ 7.092586] [<ffffffff8107ecec>] warn_slowpath_fmt+0x5c/0x80
[ 7.106139] [<ffffffff81070a09>] note_page+0x649/0x840
[ 7.119215] [<ffffffff81070ef9>] ptdump_walk_pgd_level_core+0x2f9/0x430
[ 7.134087] [<ffffffff81071067>] ptdump_walk_pgd_level_checkwx+0x17/0x20
[ 7.149121] [<ffffffff81066e2c>] mark_rodata_ro+0xec/0x100
[ 7.162852] [<ffffffff81801b60>] ? rest_init+0x80/0x80
[ 7.176273] [<ffffffff81801b7d>] kernel_init+0x1d/0xe0
[ 7.189752] [<ffffffff8180e18f>] ret_from_fork+0x3f/0x70
[ 7.203505] [<ffffffff81801b60>] ? rest_init+0x80/0x80
[ 7.217134] ---[ end trace 43f75b6421b8d01c ]---
[ 7.235677] x86/mm: Checked W+X mappings: FAILED, 235610 W+X pages found.

Revision history for this message
Nobuto Murata (nobuto) wrote :

4.4 kernel using the Artful configs didn't make much difference.

Failure rate: 117/726 (16.1%), 4.4.0-040400-generic #201803261439 SMP Mon Mar 26 14:43:35 UTC 2018

I will let stock 4.4 and 4.13 hwe run just to make sure to know the occurrence rate with this host.

Revision history for this message
Nobuto Murata (nobuto) wrote :

FWIW, I tried PCI hot-plugging to try another way for faster iterations without reboot.
https://paste.ubuntu.com/p/qDVkMcTYPQ/

However, the issue wasn't reproducible with hot-plugging. Rebooting is the easiest reproduction so far.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. I would have thought 4.4 with the Artful configs would not have had the bug if a config change is the fix.

We know:
4.12-rc4 with Artful configs is good.
4.12-rc4 with Xenial configs is bad.
Any kernel with Xenial configs is bad.

It is possible a patch in combination to a config change fixed the bug. We could try testing kernels starting with 4.4 and forward using the Artful configs to find where the bug stops happening.

We should first confirm that v4.12-rc3 with Artful configs is good before working backwards. I built that kernel and posted it here:

http://kernel.ubuntu.com/~jsalisbury/lp1753662

Can you test that kernel when you have a chance?

Revision history for this message
Nobuto Murata (nobuto) wrote :

I ran HWE 4.13 just to make sure the result is the same as the previous host. And as we confirmed before, the issue is not reproducible with HWE 4.13.

[HWE 4.13]
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018

> We should first confirm that v4.12-rc3 with Artful configs is good before working backwards. I built that kernel and posted it here:
>
> http://kernel.ubuntu.com/~jsalisbury/lp1753662

I will test it and report back early next week.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Ok, we have some numbers with the new host.

Failure rate: 45/112 (40.2%), 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018
Failure rate: 87/222 (39.2%), 4.4.0-120-generic #144-Ubuntu SMP Thu Apr 5 14:11:49 UTC 2018
Failure rate: 117/726 (16.1%), 4.4.0-040400-generic #201803261439 SMP Mon Mar 26 14:43:35 UTC 2018
Failure rate: 4/277 (1.4%), 4.12.0-041200rc3-generic #201804132111 SMP Fri Apr 13 21:13:47 UTC 2018
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018

The xenial stock kernel and -120 from -proposed (just to see if LP: #1723127 is related or not) have around 40% failure rate.

However, xenial kernel with artful config (4.4.0-040400-generic #201803261439) cuts down the failure rate on some level while we didn't expect any difference from failure rate point of view.

4.12.0-041200rc3 cuts down more, then HWE eliminates the occurrence.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Just for the record, I'm using the attached rc.local for testing.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Just for the record, up-to-date numbers after the weekend.

Failure rate: 167/422 (39.6%), 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018
Failure rate: 87/222 (39.2%), 4.4.0-120-generic #144-Ubuntu SMP Thu Apr 5 14:11:49 UTC 2018
Failure rate: 117/726 (16.1%), 4.4.0-040400-generic #201803261439 SMP Mon Mar 26 14:43:35 UTC 2018
Failure rate: 4/277 (1.4%), 4.12.0-041200rc3-generic #201804132111 SMP Fri Apr 13 21:13:47 UTC 2018
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

I would suggest testing

commit de77ecd4ef02ca783f7762e04e92b3d0964be66b
Author: Mahesh Bandewar <email address hidden>
Date: Mon Mar 27 11:37:33 2017 -0700

    bonding: improve link-status update in mii-monitoring

and

commit d94708a553022bf012fa95af10532a134eeb5a52
Author: WANG Cong <email address hidden>
Date: Tue Jul 25 09:44:25 2017 -0700

    bonding: commit link status change after propose

backported to 4.4.0-120 (in the order above; the second is a fix to the first).

The first patch initially appears in 4.12-rc1, the second in 4.13.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the pointer, Jay! I'll build a Xenial test kernel with these two commits and post a link to it.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with the two commits pointed out by Jay. The test kernel also required commit f307668bfc as a prereq.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753662

Revision history for this message
Nobuto Murata (nobuto) wrote :

#146~lp1753662ThreeCommits is better at some level (around 40% failure rate to 20%).

Failure rate: 187/470 (39.8%), 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018
Failure rate: 87/222 (39.2%), 4.4.0-120-generic #144-Ubuntu SMP Thu Apr 5 14:11:49 UTC 2018
Failure rate: 138/712 (19.4%), 4.4.0-122-generic #146~lp1753662ThreeCommits SMP Fri Apr 27 16:52:26 UTC 2018
Failure rate: 308/696 (44.3%), 4.4.0-040400-generic #201601101930 SMP Mon Jan 11 00:32:41 UTC 2016
Failure rate: 117/726 (16.1%), 4.4.0-040400-generic #201803261439 SMP Mon Mar 26 14:43:35 UTC 2018
Failure rate: 4/277 (1.4%), 4.12.0-041200rc3-generic #201804132111 SMP Fri Apr 13 21:13:47 UTC 2018
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

We've seen a similar-sounding issue in the past, but couldn't get it tracked down to the root cause.

Is it possible to enable some instrumentation in the /etc/network/interfaces and
obtain some data on a failing occurrence?

What we've used in the past is adding something like

pre-up echo 'file bond_3ad.c +p' > /sys/kernel/debug/dynamic_debug/control
pre-up echo 'file bond_main.c +p' > /sys/kernel/debug/dynamic_debug/control

to the /e/n/i section for the bond itself, and

post-up tcpdump -U -p -w /tmp/eth4.td -i eth4 ether proto 0x8809 &

to the sections for each slave in the bond (adjusting the "eth4" above to the actual interface name).

The bond debug will appear in the kernel log, and the tcpdump data will have to copied from the output file specified on the tcpdump command line (and the tcpdump process terminated if need be).

Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

Hi Joseph,

We're continuing the investigation into this issue, and I was wondering
if you and Nabuto could provide what the last point you had reached was,
and/or next step you were going to do.

From what I can summarize (please confirm/correct):

* Artful (4.13.*) kernels (with any Artful config) is good
* Artful (4.13.*) kernels (with any Xenial config) is also bad

* 4.12-rc4 - relatively good (1.x%) but still not 0% (<5%)
* 4.12-rc3 - also bad (~ 27%)

* Xenial (4.4.*) kernels (with any Xenial config) is bad
* Xenial (4.4.*) kernels (with any Artful config) is still bad

[data point: 4.12-rc4 with Artful configs is good. 4.12-rc4 with Xenial configs is bad.]

So a kernel change + config change results in masked/fixed behavior, I guess?

Is the remaining bisect window basically 4.12-rc4 -> 4.13 ?

Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

I would have thought this would be the relevant patch:

bonding: speed/duplex update at NETDEV_UP event
 Mahesh Bandewar authored and davem330 committed on Sep 28, 2017
1 parent b5c7d4e commit 4d2c0cda07448ea6980f00102dc3964eb25e241c

However, it was first available in v4.15-rc1.

At least as far as bonding kernel changes go, there does not
seem another obvious candidate that might have fixed this problem
between 4.12 and 4.13 (first skim).

At least for one scenario I looked at, we got a bad speed/duplex
setting, which eventually ended up with the bond interface
aggregating on a separate port, and/or ending up in LACP DISABLED
state which it never got out of. We only checked correct/latest
device speed/duplex settings via the NETDEV_CHANGE path, where
we called _ethtool_get_settings(). If we don't receive a change
event again to correct the speed/duplex, we never recover.

There are some other patches which help address this at different
points, but are either before or later (see above) the window.

I'll take a look at code outside the bonding dir which might
impact this.

Joseph, could you provide the raw config files you used as well?
It was not super clear in the png image if those were the only
diffs. They did not seem very relevant diffs either.

Revision history for this message
Jeffrey Honig (jchonig) wrote :

We are also seeing this running trusty with the HWE kernel (i.e. 4.4) in which we have upgraded to the upstream i40e drivers.

When we started using i40e 2.0.26 we found that we needed to add

pre-up sleep 15

for bond0 and this seems to work all the time.

However, when using i40e 2.3.6 or 2.4.6 we find that the sleep does not work most of the time, even increasing it significantly.

Manually running "ifdown bond0 && ifup bond0 && ifdown eth0 && ifup eth0 && ifdown eth1 && ifup eth1" does result in bringing bond0 up, but is less than ideal.

Will the debugging output you requested be useful from a trusty system (currently with 4.4.0-116)?

Thanks.

Jeff

Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

Jeff,

Please do provide your logs and whatever other information you can share from your error case, any piece of info will help here. I do not yet have a repro environment myself.

I suspect that most of the changes which seem to help or fix the issue are simply changing the timing enough to affect the race window, making it less likely to occur, so are masking the problem rather than fixing the root cause.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.