[i40e] LACP bonding start up race conditions
- Xenial (16.04)
- Bug #1753662
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Triaged
|
High
|
Unassigned | ||
Xenial |
Triaged
|
High
|
Unassigned |
Bug Description
When provisioning Ubuntu servers with MAAS at once, some bonding pairs will have unexpected LACP status such as "Expired". It randomly happens at each provisioning with the default xenial kernel(4.4), but not reproducible with HWE kernel(4.13). I'm using Intel X710 cards (Dell-branded).
Using the HWE kernel works as a workaround for short term, but it's not ideal since 4.13 is not covered by Canonical Livepatch service.
How to reproduce:
1. configure LACP bonding with MAAS
2. provision machines
3. check the bonding status in /proc/net/
frequency of occurrence:
About 5 bond pairs in 22 pairs at each provisioning.
[reproducible combination]
$ uname -a
Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ sudo ethtool -i eno1
driver: i40e
version: 1.4.25-k
firmware-version: 6.00 0x800034e6 18.3.6
expansion-
bus-info: 0000:01:00.0
supports-
supports-test: yes
supports-
supports-
supports-
[non-reproducible combination]
$ uname -a
Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ sudo ethtool -i eno1
driver: i40e
version: 2.1.14-k
firmware-version: 6.00 0x800034e6 18.3.6
expansion-
bus-info: 0000:01:00.0
supports-
supports-test: yes
supports-
supports-
supports-
ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-
ProcVersionSign
Uname: Linux 4.4.0-116-generic x86_64
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Mar 6 06:37 seq
crw-rw---- 1 root audio 116, 33 Mar 6 06:37 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.15
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Tue Mar 6 06:46:32 2018
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
Bus 002 Device 002: ID 8087:8002 Intel Corp.
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub
Bus 001 Device 002: ID 8087:800a Intel Corp.
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Dell Inc. PowerEdge R730
PciMultimedia:
ProcEnviron:
TERM=screen
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=
RelatedPackageV
linux-
linux-
linux-firmware 1.157.17
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 08/16/2017
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.5.5
dmi.board.name: 072T6D
dmi.board.vendor: Dell Inc.
dmi.board.version: A08
dmi.chassis.
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.
dmi.product.name: PowerEdge R730
dmi.sys.vendor: Dell Inc.
Nobuto Murata (nobuto) wrote : | #1 |
- CRDA.txt Edit (422 bytes, text/plain; charset="utf-8")
- CurrentDmesg.txt Edit (119.0 KiB, text/plain; charset="utf-8")
- Dependencies.txt Edit (2.3 KiB, text/plain; charset="utf-8")
- JournalErrors.txt Edit (5.9 KiB, text/plain; charset="utf-8")
- Lspci.txt Edit (175.4 KiB, text/plain; charset="utf-8")
- ProcCpuinfo.txt Edit (66.2 KiB, text/plain; charset="utf-8")
- ProcCpuinfoMinimal.txt Edit (1.2 KiB, text/plain; charset="utf-8")
- ProcInterrupts.txt Edit (306.9 KiB, text/plain; charset="utf-8")
- ProcModules.txt Edit (4.7 KiB, text/plain; charset="utf-8")
- UdevDb.txt Edit (357.0 KiB, text/plain; charset="utf-8")
- WifiSyslog.txt Edit (143.3 KiB, text/plain; charset="utf-8")
Nobuto Murata (nobuto) wrote : | #2 |
Nobuto Murata (nobuto) wrote : | #3 |
Nobuto Murata (nobuto) wrote : | #4 |
Nobuto Murata (nobuto) wrote : | #5 |
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed | #6 |
This change was made by a bot.
Changed in linux (Ubuntu): | |
status: | New → Confirmed |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Xenial): | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in linux (Ubuntu): | |
status: | Confirmed → Triaged |
Joseph Salisbury (jsalisbury) wrote : | #7 |
There is one LACP commit that sticks out between v4.4 and v4.13:
c15e07b02bf0 ("team: loadbalance: push lacpdus to exact delivery")
I built a Xenial test kernel with this commit. The test kernel can be downloaded from:
http://
Can you test this kernel and see if it resolves this bug?
Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.
Thanks in advance!
Changed in linux (Ubuntu): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu Xenial): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Joseph Salisbury (jsalisbury) wrote : | #8 |
If that commit doesn't fix the issue, we can perform a "Reverse" bisect between 4.4 and 4.13 to find the fix.
Nobuto Murata (nobuto) wrote : | #9 |
The kernel with c15e07b02bf0 didn't make a difference on the race condition. The issue is still reproducible. Let me know when you need my testing again with different kernels.
So far, I'm using rc.local below to reboot the same node multiple times.
====
#!/bin/sh
exec >> /root/bond_
echo
echo '######
echo
date -R
uname -a
ethtool -i eno1
modinfo i40e
path=$(modinfo i40e | grep filename: | awk '{print $2}')
sha256sum "$path"
package=$(dpkg -S "$path" | cut -d: -f1)
apt-cache policy "$package"
if [ "$(grep state /proc/net/
echo
echo '*** Unexpected LACP status ***'
grep -r '.*' /proc/net/
fi
sleep 300
reboot
exit 0
====
Nobuto Murata (nobuto) wrote : | #10 |
Nobuto Murata (nobuto) wrote : | #11 |
Joseph Salisbury (jsalisbury) wrote : | #12 |
To perform a "Reverse" bisect, we need to identify the last kernel version that had the bug and the first kernel version that does not.
Can you test the following upstream kernels:
4.4 Final: http://
4.6 Final: http://
4.8 Final: http://
4.10 Final: http://
4.13 Final: http://
You don't have to test everyone, just up until we know the first kernel version that does not have the bug.
Nobuto Murata (nobuto) wrote : | #13 |
Thanks, I was running some tests with existing HWE kernels in xenial repo like linux-image-
Let me double-check with those two:
4.10 Final: http://
4.13 Final: http://
Joseph Salisbury (jsalisbury) wrote : | #14 |
To narrow it down further, can you also test the following kernels:
v4.11 final: http://
v4.13-rc1: http://
Nobuto Murata (nobuto) wrote : | #15 |
- bond_check_xenial_mainline_4.10.log Edit (20.8 KiB, text/plain)
Reproducible with:
4.10 Final: http://
The next test will be with:
v4.11 final: http://
Nobuto Murata (nobuto) wrote : | #16 |
- bond_check_xenial_mainline_4.11.log Edit (20.7 KiB, text/plain)
Reproducible with:
v4.11 final: http://
The next is v4.13-rc1.
Nobuto Murata (nobuto) wrote : | #17 |
- bond_check_xenial_mainline_4.13-rc1.log Edit (11.5 KiB, text/plain)
Not reproducible with 4.13-rc1 with 5 reboots.
4.10 - bad
4.11 - bad
4.13-rc1 - good
The next is 4.12.
Nobuto Murata (nobuto) wrote : | #18 |
- bond_check_xenial_mainline_4.12.log Edit (11.3 KiB, text/plain)
v4.11 with i40e 1.6.27 - bad
v4.12 with i40e 2.1.14 - good
next: v4.12-rc1 with i40e 2.1.7 - ?
Joseph Salisbury (jsalisbury) wrote : | #19 |
If v4.12-rc1 is still bad, we would need to test some of the other release candidates, such as rc2, rc3, rc4, etc.
Once we have the last bad and first good, I'll start the reverse bisect and build a kernel.
Nobuto Murata (nobuto) wrote : | #20 |
- bond_check_xenial_mainline_4.12-rc1.log Edit (31.1 KiB, text/plain)
v4.12-rc1 with i40e 2.1.7 - bad
v4.12 with i40e 2.1.14 - good
I'm running out of time. So more bisections are for tomorrow.
Nobuto Murata (nobuto) wrote : | #21 |
Correction. I thought v4.12-rc1 had i40e 2.1.7 because of:
https:/
But it actually has 2.1.14 from the log output.
So the correct status is:
v4.12-rc1 with i40e 2.1.14 - bad
v4.12 with i40e 2.1.14 - good
Nobuto Murata (nobuto) wrote : | #22 |
I run an overnight test with v4.12 just to make sure it really fixed the issue. It happened sometimes, but way less frequencies. We may need to test it longer for "good" cases since the patch may not be only one. Anyway, the current status is:
v4.12-rc1 with i40e 2.1.14 - bad (3 out of 3)
v4.12 with i40e 2.1.14 - relatively good (5 out of 68)
The next test would be with v4.12-rc4.
Nobuto Murata (nobuto) wrote : | #23 |
- bond_check_xenial_mainline_4.13_full.log Edit (93.2 KiB, text/plain)
For the record,
v4.12-rc1 - bad (3 out of 3)
v4.12 - relatively good (5 out of 68)
v4.13 - good (0 out of 41)
Nobuto Murata (nobuto) wrote : | #24 |
Joseph Salisbury (jsalisbury) wrote : | #25 |
Do you happen to have results from any of the other release candidates, such as 4.12-rc4?
Nobuto Murata (nobuto) wrote : | #26 |
- bond_check_xenial_mainline_4.12-rc4_full.log Edit (167.4 KiB, text/plain)
I have let rc4 run for hours.
v4.12-rc1 - bad (3 out of 3)
v4.12-rc4 - relatively good (1 out of 70)
v4.12 - relatively good (5 out of 68)
v4.13 - good (0 out of 41)
I will let rc3 run during my night.
Nobuto Murata (nobuto) wrote : | #27 |
- bond_check_xenial_mainline_4.12-rc3_full.log Edit (399.4 KiB, text/plain)
With rc3, will test rc2 next.
v4.12-rc1 - bad (3 out of 3)
v4.12-rc3 - mixture result (24 out of 90)
v4.12-rc4 - relatively good (1 out of 70)
v4.12 - relatively good (5 out of 68)
v4.13 - good (0 out of 41)
Nobuto Murata (nobuto) wrote : | #28 |
- bond_check_xenial_mainline_4.12-rc2_full.log Edit (242.2 KiB, text/plain)
With rc2 result. It looks like there is a noticeable difference between v4.12-rc3 and v4.12-rc4.
@Joseph, can you please start looking into diffs? I'm keeping one dedicated node just for this testing, so I can run the same script one by one for more bisections.
v4.12-rc1 - bad (3 of 3)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 out of 68 - 7.4%)
v4.13 - good (0 out of 41 - 0%)
FWIW, I will run the same test with xenial's 4.4 kernel to make sure around 30% is the base line of "bad".
Joseph Salisbury (jsalisbury) wrote : | #29 |
I started a kernel bisect between v4.12-rc3 and v4.12-rc4. The kernel bisect will require testing of about 7-10 test kernels.
I built the first test kernel, up to the following commit:
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #30 |
- bond_check_xenial_4.4.0-116_full.log Edit (142.3 KiB, text/plain)
Ok, 25% - 30% seems a baseline. I'd like to make sure v4.13 is really 0% for longer running test, but will do the bisection of v4.12-rc3 and v4.12-rc4 first.
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
Joseph Salisbury (jsalisbury) wrote : | #31 |
I just noticed I forgot to paste the SHA1 for the test kernel posted in comment #29:
ff5a20169b98d84
Do you have results from that kernel? Once you do, I'll update the bisect and build the next kernel.
Nobuto Murata (nobuto) wrote : | #32 |
So far 0 of 6 with 4.12.0-
Nobuto Murata (nobuto) wrote : | #33 |
- bond_check_xenial_4.12.0-041200rc3_201803080803.log Edit (50.1 KiB, text/plain)
4.12.0-
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc3 - good (0 of 22 - 0%) #201803080803 ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
Joseph Salisbury (jsalisbury) wrote : | #34 |
I built the next test kernel, up to the following commit:
0bb230399fd337c
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #35 |
I run xenial HWE over a night while sleeping, the result was 0/119. The next test is with:
v4.12.0-041200rc3 #201803081620 - ? 0bb230399fd337c
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Nobuto Murata (nobuto) wrote : | #36 |
Nobuto Murata (nobuto) wrote : | #37 |
- bond_check_xenial_4.12.0-041200rc3_201803081620.log Edit (90.0 KiB, text/plain)
0bb230399fd337c
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337c
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Joseph Salisbury (jsalisbury) wrote : | #38 |
I built the next test kernel, up to the following commit:
400129f0a3ae989
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #39 |
- bond_check_xenial_4.12.0-041200rc3_201803090724.log Edit (191.4 KiB, text/plain)
400129f0a3ae989
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337c
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Joseph Salisbury (jsalisbury) wrote : | #40 |
I built the next test kernel, up to the following commit:
25f480e89a022d3
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #41 |
- bond_check_xenial_4.12.0-041200rc3_201803121355.log Edit (581.7 KiB, text/plain)
25f480e89a022d3
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d3
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337c
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Joseph Salisbury (jsalisbury) wrote : | #42 |
I built the next test kernel, up to the following commit:
d38162e4b5c6437
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #43 |
@Joseph,
Will do. Just as a possibility, I could build a kernel on the host if that's helpful. Because the host is already reserved for this testing and has hundreds of GBs of memory and many CPU cores. If you have a pointer how to replicate your build process, that would be great.
Nobuto Murata (nobuto) wrote : | #44 |
Oh wait,
> I built the next test kernel, up to the following commit:
> d38162e4b5c6437
>
> The test kernel can be downloaded from:
> http://
d38162e4b5c6437
https:/
However, the link says "rc1".
http://
Can you confirm that the 4.12.0-041200rc1 is the expected kernel test?
Joseph Salisbury (jsalisbury) wrote : | #45 |
Yes that is the correct kernel. The mainline-build-one script uses the 'git describe' command to come up with the name. That command returns the closest git tag and not the one that contains it. So in the case of commit d38162e4b5c6437
git describe d38162e4b5c6437
v4.12-rc1-
However, that test kernel is actually using commits in -rc4, which you can see if the '--contains' option is given to git describe:
git describe --contains d38162e4b5c6437
v4.12-rc4~
I can continue to build the kernels, there are about 3 left. It only takes about 15 minutes to build the kernel for me.
Nobuto Murata (nobuto) wrote : | #46 |
The test is still in progress, but so far d38162e4b5c6437
Nobuto Murata (nobuto) wrote : | #47 |
- bond_check_xenial_4.12.0-041200rc1_201803131457.log Edit (219.8 KiB, text/plain)
d38162e4b5c6437
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc1 #201803131457 - relatively good (1 of 93 - 1.1%) d38162e4b5c6437
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d3
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337c
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Joseph Salisbury (jsalisbury) wrote : | #48 |
I built the next test kernel, up to the following commit:
171d8b9363725e1
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #49 |
The test is still in progress, but so far 171d8b9363725e1
Joseph Salisbury (jsalisbury) wrote : | #50 |
I built the next test kernel, up to the following commit:
4681ee21d62cfed
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
tags: | added: cdo-qa-blocker |
Nobuto Murata (nobuto) wrote : | #51 |
- bond_check_xenial_4.12.0-041200rc1_201803141333.log Edit (502.1 KiB, text/plain)
171d8b9363725e1
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc1 #201803141333 - relatively good (1 of 217 - 0.5%) 171d8b9363725e1
v4.12.0-041200rc1 #201803131457 - relatively good (1 of 93 - 1.1%) d38162e4b5c6437
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d3
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337c
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Nobuto Murata (nobuto) wrote : | #52 |
4681ee21d62cfed
4.4.0-116(xenial) - bad (9 of 31 - 29.0%)
v4.12-rc2 - bad (15 of 53 - 28.3%)
v4.12-rc3 - bad (24 of 90 - 26.6%)
====
v4.12.0-041200rc1 #201803141835 - relatively good (1 of 113 - 0.9%) 4681ee21d62cfed
v4.12.0-041200rc1 #201803141333 - relatively good (1 of 217 - 0.5%) 171d8b9363725e1
v4.12.0-041200rc1 #201803131457 - relatively good (1 of 93 - 1.1%) d38162e4b5c6437
v4.12.0-041200rc3 #201803121355 - relatively good (1 of 252 - 0.4%) 25f480e89a022d3
v4.12.0-041200rc3 #201803090724 - relatively good (2 of 77 - 2.6%) 400129f0a3ae989
v4.12.0-041200rc3 #201803081620 - relatively good (1 of 36 - 2.8%) 0bb230399fd337c
v4.12.0-041200rc3 #201803080803 - good (0 of 22 - 0%) ff5a20169b98d84
v4.12-rc4 - relatively good (1 of 70 - 1.4%)
v4.12 - relatively good (5 of 68 - 7.4%)
v4.13 - good (0 of 41 - 0%)
4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Nobuto Murata (nobuto) wrote : | #53 |
Joseph Salisbury (jsalisbury) wrote : | #54 |
The reverse bisect reported the following commit as the fix, but I'm doubtful since its and i915 commit:
commit 4681ee21d62cfed
Author: Joonas Lahtinen <email address hidden>
Date: Thu May 18 11:49:39 2017 +0300
drm/i915: Do not sync RCU during shrinking
We may have went wrong somewhere in the bisect. However, just to be sure, I built a v4.12-rc4 test kernel. This kernel should be bad and contain the bug. If it does not, it may be due to the configs I'm using to build the test kernels.
My v4.12-rc4 test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not?
Nobuto Murata (nobuto) wrote : | #55 |
> We may have went wrong somewhere in the bisect. However, just to be sure, I built a v4.12-rc4 test kernel. This kernel should be bad and contain the bug. If it does not, it may be due to the configs I'm using to build the test kernels.
I'm not following since I thought we tested that already and good.
> v4.12-rc4 - relatively good (1 of 70 - 1.4%)
But anyway I will let it run with your newly built one.
Joseph Salisbury (jsalisbury) wrote : | #56 |
Yes, sorry this is a "Reverse" bisect, so v4.12-rc4 should be good and not bad.
I'm also going to build a v4.12-rc3 kernel with my configs to confirm it's bad. I'll post that shortly.
Joseph Salisbury (jsalisbury) wrote : | #57 |
A re-build of v4.12-rc3 is now available here:
http://
Can you confirm that this kernel is bad and contains the bug?
Nobuto Murata (nobuto) wrote : | #58 |
v4.12-rc4 is good, 1 of 146. Going to test v4.12-rc3.
Nobuto Murata (nobuto) wrote : | #59 |
The new build of v4.12-rc3 is a good build (2 of 151).
4.12.0-
v4.12-rc3 - bad (24 of 90 - 26.6%)
http://
v4.12-rc3 - relatively good (2 of 151 - 1.3%)
http://
So what is the difference between those two? build config?
Nobuto Murata (nobuto) wrote : | #60 |
Joseph Salisbury (jsalisbury) wrote : | #61 |
I built another v4.12-rc3 test kernel. This time with Xenial configs instead of Artful configs. This test kernel can be downloaded from:
http://
Can you see if this kernel exhibits the bug?
Joseph Salisbury (jsalisbury) wrote : | #62 |
Sorry the correct link for the new 4.12-rc3 kernel with Xenial configs is:
http://
Nobuto Murata (nobuto) wrote : | #63 |
Ok, we see some differences with the three kernels. How do we want to proceed from here?
v4.12-rc3 - bad (24 of 90 - 26.6%)
http://
v4.12-rc3 #201803151851 - relatively good (2 of 151 - 1.3%)
http://
(artful configs)
v4.12-rc3 #201803161156 - relatively bad (36 or 249 - 14.5%)
http://
(xenial configs)
Joseph Salisbury (jsalisbury) wrote : | #64 |
It's good that the v4.12-rc3 with xenial configs was bad. It means we should use Xenial configs when performing the bisect and not Artful configs. I'll kick off another bisect and post the first test kernel.
Joseph Salisbury (jsalisbury) wrote : | #65 |
I restarted the bisect. This time using Xenial configs and not Artful configs.
I built the first test kernel, up to the following commit:
ff5a20169b98d84
The test kernel can be downloaded from:
http://
You've tested this SHA1 in prior test kernels. However, that test kernel was built with the Artful configs, and this one is with the Xenial configs.
Nobuto Murata (nobuto) wrote : | #66 |
ff5a20169b98d84
[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%)
Nobuto Murata (nobuto) wrote : | #67 |
ff5a20169b98d84
[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%) ff5a20169b98d84
Joseph Salisbury (jsalisbury) wrote : | #68 |
I built the next test kernel, up to the following commit:
ea094f3c830a67f
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #69 |
ea094f3c830a67f
[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%) ff5a20169b98d84
v4.12.0-041200rc3 #201803200004 relatively bad (6 of 56 - 10.7%) ea094f3c830a67f
Joseph Salisbury (jsalisbury) wrote : | #70 |
I built the next test kernel, up to the following commit:
55cbdaf6399de16
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Nobuto Murata (nobuto) wrote : | #71 |
I was pretty occupied today, so I'm going to test 55cbdaf6399de16
[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 relatively bad (21 of 150 - 14.0%) ff5a20169b98d84
v4.12.0-041200rc3 #201803200004 relatively bad (60 of 499 - 12.0%) ea094f3c830a67f
Nobuto Murata (nobuto) wrote : | #72 |
55cbdaf6399de16
[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 - relatively bad (21 of 150 - 14.0%) ff5a20169b98d84
v4.12.0-041200rc3 #201803200004 - relatively bad (60 of 499 - 12.0%) ea094f3c830a67f
v4.12.0-041200rc3 #201803201426 - relatively bad (30 of 208 - 14.4%) 55cbdaf6399de16
Nobuto Murata (nobuto) wrote : | #73 |
BTW, have we set the baseline of "good" in this bisection with xenial configs?
> 4.13.0-36(xenial HW) - good (0 of 119 - 0%)
Does HWE kernel man with xenial configs? Or was it built with the source release config i.e. artful?
Nobuto Murata (nobuto) wrote : | #74 |
Correction: Does HWE kernel mean it's with xenial configs? Or was it built with the source release config i.e. artful?
Joseph Salisbury (jsalisbury) wrote : | #75 |
The HWE kernel was built with the Artful configs. I restarted the bisect using the Xenial configs and marking 4.12-rc4 as good and 4.12-rc3 as bad. We should re-test that to confirm we are going down the right patch. I built a 4.12-rc4 kernel with Xenial configs, which can be downloaded from:
http://
If this kernel ends up being bad and not good, then the bisect should be stopped. It may be that a patch did not fix this bug at all, just a change in one of the config options. If that is the case, I'll review the diff between Xenial and Artful configs in more detail.
Nobuto Murata (nobuto) wrote : | #76 |
4.12-rc4 kernel with Xenial configs looks bad.
[xenial configs]
v4.12-rc3 #201803161156 - relatively bad (36 of 249 - 14.5%)
v4.12.0-041200rc3 #201803191316 - relatively bad (21 of 150 - 14.0%) ff5a20169b98d84
v4.12.0-041200rc3 #201803200004 - relatively bad (60 of 499 - 12.0%) ea094f3c830a67f
v4.12.0-041200rc3 #201803201426 - relatively bad (30 of 208 - 14.4%) 55cbdaf6399de16
v4.12.0-041200rc4 #201803221452 - relatively bad (60 of 595 - 10.1%)
Is it possible to build xenial kernel (4.4) with artful config just for testing?
Joseph Salisbury (jsalisbury) wrote : | #77 |
I built a 4.4 kernel using the Artful configs, it can be downloaded from:
http://
Nobuto Murata (nobuto) wrote : Re: [Bug 1753662] Re: [i40e] LACP bonding start up race conditions | #78 |
2018年3月27日(火) 0:11 Joseph Salisbury <email address hidden>:
> I built a 4.4 kernel using the Artful configs, it can be downloaded from:
> http://
Thanks. I just lost the access to the machine today. So I have to use
another host in a different data center. Will resume testing in a few days.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> [i40e] LACP bonding start up race conditions
>
> Status in linux package in Ubuntu:
> Triaged
> Status in linux source package in Xenial:
> Triaged
>
> Bug description:
> When provisioning Ubuntu servers with MAAS at once, some bonding pairs
> will have unexpected LACP status such as "Expired". It randomly
> happens at each provisioning with the default xenial kernel(4.4), but
> not reproducible with HWE kernel(4.13). I'm using Intel X710 cards
> (Dell-branded).
>
> Using the HWE kernel works as a workaround for short term, but it's
> not ideal since 4.13 is not covered by Canonical Livepatch service.
>
> How to reproduce:
> 1. configure LACP bonding with MAAS
> 2. provision machines
> 3. check the bonding status in /proc/net/
>
> frequency of occurrence:
> About 5 bond pairs in 22 pairs at each provisioning.
>
> [reproducible combination]
> $ uname -a
> Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> $ sudo ethtool -i eno1
> driver: i40e
> version: 1.4.25-k
> firmware-version: 6.00 0x800034e6 18.3.6
> expansion-
> bus-info: 0000:01:00.0
> supports-
> supports-test: yes
> supports-
> supports-
> supports-
>
> [non-reproducible combination]
> $ uname -a
> Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16
> 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> $ sudo ethtool -i eno1
> driver: i40e
> version: 2.1.14-k
> firmware-version: 6.00 0x800034e6 18.3.6
> expansion-
> bus-info: 0000:01:00.0
> supports-
> supports-test: yes
> supports-
> supports-
> supports-
>
> ProblemType: Bug
> DistroRelease: Ubuntu 16.04
> Package: linux-image-
> ProcVersionSign
> Uname: Linux 4.4.0-116-generic x86_64
> AlsaDevices:
> total 0
> crw-rw---- 1 root audio 116, 1 Mar 6 06:37 seq
> crw-rw---- 1 root audio 116, 33 Mar 6 06:37 timer
> AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
> ApportVersion: 2.20.1-0ubuntu2.15
> Architecture: amd64
> ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
> AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq',
> '/dev/snd/timer'] failed with exit code 1:
> Date: Tue Mar 6 06:46:32 2018
> IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
> Lsusb:
> Bus 002 Device 002: ID 8087:8002 Intel Corp.
> Bus 002 Device 001: ID 1d6b:0002 ...
Joseph Salisbury (jsalisbury) wrote : | #79 |
Thanks for the update. I'll start comparing the configs between Xenial and Artful to see if the change that caused this sticks out.
Joseph Salisbury (jsalisbury) wrote : | #80 |
- ConfigDiffs-rc3-rc4.png Edit (947.0 KiB, image/png)
Just curious if you had a chance to test the kernel posted in #77?
I compared the configs between these two kernels:
v4.12-rc3: http://
v4.12-rc4: http://
Nothing sticks out as a fix in v4.12-rc4. I'll attach a screen shot of the diffs.
We could try reverse bisecting with the Ubuntu kernels instead of mainline kernels.
Nobuto Murata (nobuto) wrote : | #81 |
The new environment is not fully up yet to test. ETA would be by the end of this week.
Nobuto Murata (nobuto) wrote : | #82 |
Finally got a machine up and running. Will resume testing shortly.
Nobuto Murata (nobuto) wrote : | #83 |
FWIW, kernel trace happens with the kernel in:
https:/
But I will let it running anyway since I'm not sure if it affects to the testing or not.
====
[ 5.999557] rtc_cmos 00:00: setting system clock to 2018-04-11 15:52:23 UTC (1523461943)
[ 5.999637] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 5.999637] EDD information not available.
[ 6.856436] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x229834d6ac9, max_idle_ns: 440795285103 ns
[ 6.872996] Freeing unused kernel memory: 1844K (ffffffff81f71000 - ffffffff8213e000)
[ 6.885727] Write protecting the kernel read-only data: 14336k
[ 6.897566] Freeing unused kernel memory: 1964K (ffff880001815000 - ffff880001a00000)
[ 6.911315] Freeing unused kernel memory: 272K (ffff880001dbc000 - ffff880001e00000)
[ 6.924438] ------------[ cut here ]------------
[ 6.934158] WARNING: CPU: 16 PID: 1 at /home/jsalisbur
[ 6.958330] x86/mm: Found insecure W+X mapping at address ffff88000001000
[ 6.973431] Modules linked in:
[ 6.982498] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 4.4.0-040400-
[ 6.997629] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.5.5 08/16/2017
[ 7.012149] 0000000000000000 000000009d064fa9 ffff883f6576fd48 ffffffff813da8d8
[ 7.026822] ffff883f6576fd90 ffff883f6576fd80 ffffffff8107ec52 ffff883f6576fe90
[ 7.041667] 8000000000000163 0000000000000004 0000000000000000 0000000000000000
[ 7.056684] Call Trace:
[ 7.066183] [<ffffffff813da
[ 7.078824] [<ffffffff8107e
[ 7.092586] [<ffffffff8107e
[ 7.106139] [<ffffffff81070
[ 7.119215] [<ffffffff81070
[ 7.134087] [<ffffffff81071
[ 7.149121] [<ffffffff81066
[ 7.162852] [<ffffffff81801
[ 7.176273] [<ffffffff81801
[ 7.189752] [<ffffffff8180e
[ 7.203505] [<ffffffff81801
[ 7.217134] ---[ end trace 43f75b6421b8d01c ]---
[ 7.235677] x86/mm: Checked W+X mappings: FAILED, 235610 W+X pages found.
Nobuto Murata (nobuto) wrote : | #84 |
4.4 kernel using the Artful configs didn't make much difference.
Failure rate: 117/726 (16.1%), 4.4.0-040400-
I will let stock 4.4 and 4.13 hwe run just to make sure to know the occurrence rate with this host.
Nobuto Murata (nobuto) wrote : | #85 |
FWIW, I tried PCI hot-plugging to try another way for faster iterations without reboot.
https:/
However, the issue wasn't reproducible with hot-plugging. Rebooting is the easiest reproduction so far.
Joseph Salisbury (jsalisbury) wrote : | #86 |
Thanks for testing. I would have thought 4.4 with the Artful configs would not have had the bug if a config change is the fix.
We know:
4.12-rc4 with Artful configs is good.
4.12-rc4 with Xenial configs is bad.
Any kernel with Xenial configs is bad.
It is possible a patch in combination to a config change fixed the bug. We could try testing kernels starting with 4.4 and forward using the Artful configs to find where the bug stops happening.
We should first confirm that v4.12-rc3 with Artful configs is good before working backwards. I built that kernel and posted it here:
http://
Can you test that kernel when you have a chance?
Nobuto Murata (nobuto) wrote : | #88 |
I ran HWE 4.13 just to make sure the result is the same as the previous host. And as we confirmed before, the issue is not reproducible with HWE 4.13.
[HWE 4.13]
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018
> We should first confirm that v4.12-rc3 with Artful configs is good before working backwards. I built that kernel and posted it here:
>
> http://
I will test it and report back early next week.
Nobuto Murata (nobuto) wrote : | #89 |
Ok, we have some numbers with the new host.
Failure rate: 45/112 (40.2%), 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018
Failure rate: 87/222 (39.2%), 4.4.0-120-generic #144-Ubuntu SMP Thu Apr 5 14:11:49 UTC 2018
Failure rate: 117/726 (16.1%), 4.4.0-040400-
Failure rate: 4/277 (1.4%), 4.12.0-
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018
The xenial stock kernel and -120 from -proposed (just to see if LP: #1723127 is related or not) have around 40% failure rate.
However, xenial kernel with artful config (4.4.0-
4.12.0-041200rc3 cuts down more, then HWE eliminates the occurrence.
Nobuto Murata (nobuto) wrote : | #90 |
Nobuto Murata (nobuto) wrote : | #91 |
Just for the record, up-to-date numbers after the weekend.
Failure rate: 167/422 (39.6%), 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018
Failure rate: 87/222 (39.2%), 4.4.0-120-generic #144-Ubuntu SMP Thu Apr 5 14:11:49 UTC 2018
Failure rate: 117/726 (16.1%), 4.4.0-040400-
Failure rate: 4/277 (1.4%), 4.12.0-
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018
Jay Vosburgh (jvosburgh) wrote : | #92 |
I would suggest testing
commit de77ecd4ef02ca7
Author: Mahesh Bandewar <email address hidden>
Date: Mon Mar 27 11:37:33 2017 -0700
bonding: improve link-status update in mii-monitoring
and
commit d94708a553022bf
Author: WANG Cong <email address hidden>
Date: Tue Jul 25 09:44:25 2017 -0700
bonding: commit link status change after propose
backported to 4.4.0-120 (in the order above; the second is a fix to the first).
The first patch initially appears in 4.12-rc1, the second in 4.13.
Joseph Salisbury (jsalisbury) wrote : | #93 |
Thanks for the pointer, Jay! I'll build a Xenial test kernel with these two commits and post a link to it.
Joseph Salisbury (jsalisbury) wrote : | #94 |
I built a test kernel with the two commits pointed out by Jay. The test kernel also required commit f307668bfc as a prereq.
The test kernel can be downloaded from:
http://
Nobuto Murata (nobuto) wrote : | #95 |
#146~lp1753662T
Failure rate: 187/470 (39.8%), 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018
Failure rate: 87/222 (39.2%), 4.4.0-120-generic #144-Ubuntu SMP Thu Apr 5 14:11:49 UTC 2018
Failure rate: 138/712 (19.4%), 4.4.0-122-generic #146~lp1753662T
Failure rate: 308/696 (44.3%), 4.4.0-040400-
Failure rate: 117/726 (16.1%), 4.4.0-040400-
Failure rate: 4/277 (1.4%), 4.12.0-
Failure rate: 0/407 (0.0%), 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018
Jay Vosburgh (jvosburgh) wrote : | #96 |
We've seen a similar-sounding issue in the past, but couldn't get it tracked down to the root cause.
Is it possible to enable some instrumentation in the /etc/network/
obtain some data on a failing occurrence?
What we've used in the past is adding something like
pre-up echo 'file bond_3ad.c +p' > /sys/kernel/
pre-up echo 'file bond_main.c +p' > /sys/kernel/
to the /e/n/i section for the bond itself, and
post-up tcpdump -U -p -w /tmp/eth4.td -i eth4 ether proto 0x8809 &
to the sections for each slave in the bond (adjusting the "eth4" above to the actual interface name).
The bond debug will appear in the kernel log, and the tcpdump data will have to copied from the output file specified on the tcpdump command line (and the tcpdump process terminated if need be).
Nivedita Singhvi (niveditasinghvi) wrote : | #97 |
Hi Joseph,
We're continuing the investigation into this issue, and I was wondering
if you and Nabuto could provide what the last point you had reached was,
and/or next step you were going to do.
From what I can summarize (please confirm/correct):
* Artful (4.13.*) kernels (with any Artful config) is good
* Artful (4.13.*) kernels (with any Xenial config) is also bad
* 4.12-rc4 - relatively good (1.x%) but still not 0% (<5%)
* 4.12-rc3 - also bad (~ 27%)
* Xenial (4.4.*) kernels (with any Xenial config) is bad
* Xenial (4.4.*) kernels (with any Artful config) is still bad
[data point: 4.12-rc4 with Artful configs is good. 4.12-rc4 with Xenial configs is bad.]
So a kernel change + config change results in masked/fixed behavior, I guess?
Is the remaining bisect window basically 4.12-rc4 -> 4.13 ?
Nivedita Singhvi (niveditasinghvi) wrote : | #98 |
I would have thought this would be the relevant patch:
bonding: speed/duplex update at NETDEV_UP event
Mahesh Bandewar authored and davem330 committed on Sep 28, 2017
1 parent b5c7d4e commit 4d2c0cda07448ea
However, it was first available in v4.15-rc1.
At least as far as bonding kernel changes go, there does not
seem another obvious candidate that might have fixed this problem
between 4.12 and 4.13 (first skim).
At least for one scenario I looked at, we got a bad speed/duplex
setting, which eventually ended up with the bond interface
aggregating on a separate port, and/or ending up in LACP DISABLED
state which it never got out of. We only checked correct/latest
device speed/duplex settings via the NETDEV_CHANGE path, where
we called _ethtool_
event again to correct the speed/duplex, we never recover.
There are some other patches which help address this at different
points, but are either before or later (see above) the window.
I'll take a look at code outside the bonding dir which might
impact this.
Joseph, could you provide the raw config files you used as well?
It was not super clear in the png image if those were the only
diffs. They did not seem very relevant diffs either.
Jeffrey Honig (jchonig) wrote : | #99 |
We are also seeing this running trusty with the HWE kernel (i.e. 4.4) in which we have upgraded to the upstream i40e drivers.
When we started using i40e 2.0.26 we found that we needed to add
pre-up sleep 15
for bond0 and this seems to work all the time.
However, when using i40e 2.3.6 or 2.4.6 we find that the sleep does not work most of the time, even increasing it significantly.
Manually running "ifdown bond0 && ifup bond0 && ifdown eth0 && ifup eth0 && ifdown eth1 && ifup eth1" does result in bringing bond0 up, but is less than ideal.
Will the debugging output you requested be useful from a trusty system (currently with 4.4.0-116)?
Thanks.
Jeff
Nivedita Singhvi (niveditasinghvi) wrote : | #100 |
Jeff,
Please do provide your logs and whatever other information you can share from your error case, any piece of info will help here. I do not yet have a repro environment myself.
I suspect that most of the changes which seem to help or fix the issue are simply changing the timing enough to affect the race window, making it less likely to occur, so are masking the problem rather than fixing the root cause.
This might be related (not exactly the same): /sourceforge. net/p/e1000/ bugs/524/
https:/
One says 1.6.42 fixed his issue.
Looks like Intel has around 10 releases between 1.4.25 and 2.1.14, so it may not be handy to bisect. /downloadcenter .intel. com/download/ 24411/Intel- Network- Adapter- Driver- for-PCIe- 40-Gigabit- Ethernet- Network- Connections- Under-Linux- ?product= 82947
https:/