e1000e in 4.4.0-97-generic breaks 82574L under heavy load.

Bug #1730550 reported by david
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Joseph Salisbury
Xenial
Fix Released
Medium
Joseph Salisbury
Zesty
Won't Fix
Medium
Joseph Salisbury
Artful
Fix Released
Medium
Joseph Salisbury

Bug Description

== SRU Justification ==
This issue was first reported on the netdev email list by Lennart Sorensen:
https://<email address hidden>/msg178170.html

Commit 16ecba59bc333d6282ee057fb02339f77a880beb causes link drops on the 82574L under heavy load.

"Unfortunately this commit changed the driver to assume
that the Other Causes interrupt can only mean link state change and
hence sets the flag that (unfortunately) means both link is down and link
state should be checked. Since this now happens 3000 times per second,
the chances of it happening while the watchdog_task is checking the link
state becomes pretty high, and it if does happen to coincice, then the
watchdog_task will reset the adapter, which causes a real loss of link."

The original reported experienced this issue on a Supermicro X7SPA-HF-D525 server board.
However, the bug is now seen on many servers running X9DBL-1F server boards.

This bug is fixed by commits 19110cfbb34 and 4aea7a5c5e9, which were both added
to mainline in v4.15-rc1.

The commit that introduced this bug,16ecba5, was added to mainlien in v4.5-rc1. However,
Xenial recived this commit as well as commit 531ff577a. Bionic master-next does not need
these commits, since it got them via bug 1735843 and the 4.14.3 updates.

== Fixes ==
19110cfbb34 ("e1000e: Separate signaling for link check/link up")
4aea7a5c5e9 ("e1000e: Avoid receiver overrun interrupt bursts")

== Regression Potential ==
These commits are specific to e1000.

== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

== Original Bug Descriptio ==
This issue was first reported on the netdev email list by Lennart Sorensen:
https://<email address hidden>/msg178170.html

Commit 16ecba59bc333d6282ee057fb02339f77a880beb causes link drops on the 82574L under heavy load.

"Unfortunately this commit changed the driver to assume
that the Other Causes interrupt can only mean link state change and
hence sets the flag that (unfortunately) means both link is down and link
state should be checked. Since this now happens 3000 times per second,
the chances of it happening while the watchdog_task is checking the link
state becomes pretty high, and it if does happen to coincice, then the
watchdog_task will reset the adapter, which causes a real loss of link."

A fix for this issue was accepted into the net-next branch, along with other e1000e/igb patches: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=f44dea3421b47d355a835e9cfcc59ca7318575a9

The original reported experienced this issue on a Supermicro X7SPA-HF-D525 server board. We see this issue on many servers running X9DBL-1F server boards. Both boards use the Intel 82574L for the network interfaces. We see messages like this under heavy load:

[Nov 6 15:42] e1000e: eth0 NIC Link is Down
[ +0.001670] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Nov 6 16:10] e1000e: eth0 NIC Link is Down
[ +0.008505] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Nov 7 00:49] e1000e: eth0 NIC Link is Down
[ +2.235111] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

We have confirmed that the connected switch sees the link drops also, to these are not false alarms from the e1000e driver.

# lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04

I could not cleanly apply the net-next patch to 4.4.0 so I tested with just the following cherry picked changes on the latest 4.4.0 kernel source package.
https://patchwork.ozlabs.org/patch/823942/
https://patchwork.ozlabs.org/patch/823945/
https://patchwork.ozlabs.org/patch/823940/
https://patchwork.ozlabs.org/patch/823941/
https://patchwork.ozlabs.org/patch/823939/

Although it's my understanding the first two are the critical ones for the race condition. I have been running with the patches e1000e kernel driver, under network load for 7 days and I no longer see the network interface drops.

Could we pull these changes into the Ubuntu 4.4.0 kernel ?

Thanks
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jul 19 07:34 seq
 crw-rw---- 1 root audio 116, 33 Jul 19 07:34 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=49ca52b8-cf08-4485-b296-0dffb098e557
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
 Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 0557:2221 ATEN International Co., Ltd Winbond Hermon
 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Supermicro X9DBL-3F/X9DBL-iF
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-83-generic root=UUID=957d7126-5452-4606-942d-1d58adbeb253 ro net.ifnames=0 biosdevname=0 quiet splash nomdmonddf nomdmonisw
ProcVersionSignature: Ubuntu 4.4.0-83.106-generic 4.4.70
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-83-generic N/A
 linux-backports-modules-4.4.0-83-generic N/A
 linux-firmware 1.157.11
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial xenial
Uname: Linux 4.4.0-83-generic x86_64
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: Upgraded to xenial on 2016-12-05 (337 days ago)
UserGroups:

_MarkForUpload: False
dmi.bios.date: 12/28/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.00
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X9DBL-3F/X9DBL-iF
dmi.board.vendor: Supermicro
dmi.board.version: 0123456789
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.00:bd12/28/2012:svnSupermicro:pnX9DBL-3F/X9DBL-iF:pvr0123456789:rvnSupermicro:rnX9DBL-3F/X9DBL-iF:rvr0123456789:cvnSupermicro:ct3:cvr0123456789:
dmi.product.name: X9DBL-3F/X9DBL-iF
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Revision history for this message
david (w-david-cole) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1730550

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
david (w-david-cole)
tags: added: apport-collected
description: updated
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with linux-next commit 4aea7a5c5. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1730550/

To install this kernel, please install both the linux-image and linux-image-extra .deb packages.

Can you test this kernel and see if it resolves this bug?

Thanks in advance!

Revision history for this message
david (w-david-cole) wrote :

Thank you Joseph for talking a look at this bug request. Does your new kernel only include the commit 4aea7a5c5 ? It's my understanding that linux-next commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013 is also related, as that fixed a race condition due to get_link_status being used for multiple different states.

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=19110cfbb34d4af0cdfe14cd243f3b09dc95b013

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yes, I was trying to see if just one commit fixed the bug. It's always best to introduce the least number of commits to fix a bug in a stable release.

I'll build another test kernel with the other commit as well and post it shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a v2 Xenial test kernel. This test kernel has both commits 19110cfbb34 and 4aea7a5c5.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1730550/

Can you test this kernel and see if it resolves this bug?

Revision history for this message
david (w-david-cole) wrote :

Thanks for the updated kernel. I've installed the new kernel on a X9DBL-iF system and have rebooted.

# uname -a
Linux SM-X9DBL36B-S-11-LAP12-US 4.4.0-98-generic #121~lp1730550v2 SMP Wed Nov 8 15:08:45 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I will check back at the end of the week with an update.

Revision history for this message
Callum Lewis-Smith (picoutput.cls) wrote :

As far as I can work out I was having the issues described here at the end of last week.

On Friday afternoon I installed Joseph's patched kernel and I am no longer getting my NICs dropping out under heavy load. I would be interested to hear how others have been getting on with this patch?

Revision history for this message
david (w-david-cole) wrote :

After running the lp1730550v2 kernel for two weeks, I can confirm I am no longer seeing the link drops under heavy network load, as I was with the standard Xenial 4.4.0 kernel. Thanks for your help with this

Revision history for this message
Iain Buclaw (iainb) wrote :

Hi, will these patches also land in the HWE kernel?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yes, these commits will also land in the HWE kernels.

Changed in linux (Ubuntu Zesty):
status: New → In Progress
Changed in linux (Ubuntu Artful):
status: New → In Progress
Changed in linux (Ubuntu Zesty):
importance: Undecided → Medium
Changed in linux (Ubuntu Artful):
importance: Undecided → Medium
Changed in linux (Ubuntu Zesty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Artful):
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
description: updated
Revision history for this message
Iain Buclaw (iainb) wrote :

Thanks.

By the way, also confirming that the patch works here (have been running the patched kernel for 3 days now). Running on hardware Super Micro X9SCL with an Intel 82574L NIC.

Stefan Bader (smb)
Changed in linux (Ubuntu Zesty):
status: In Progress → Won't Fix
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-artful
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
david (w-david-cole) wrote :

I have verified the -proposed Xenial kernel on a production system (Supermicro X9DBL-iF). After 24hrs I have had no network link issues, so I have updated the tag.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Stefan Bader (smb) wrote :

Since the Xenial backport was verified, accepting this as verification for Artful.

tags: added: verification-done-artful
removed: verification-needed-artful
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (56.9 KiB)

This bug was fixed in the package linux - 4.4.0-119.143

---------------
linux (4.4.0-119.143) xenial; urgency=medium

  * linux: 4.4.0-119.143 -proposed tracker (LP: #1760327)

  * Dell XPS 13 9360 bluetooth scan can not detect any device (LP: #1759821)
    - Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"

linux (4.4.0-118.142) xenial; urgency=medium

  * linux: 4.4.0-118.142 -proposed tracker (LP: #1759607)

  * Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty) (LP: #1758869)
    - x86/microcode/AMD: Do not load when running on a hypervisor

  * CVE-2018-8043
    - net: phy: mdio-bcm-unimac: fix potential NULL dereference in
      unimac_mdio_probe()

linux (4.4.0-117.141) xenial; urgency=medium

  * linux: 4.4.0-117.141 -proposed tracker (LP: #1755208)

  * Xenial update to 4.4.114 stable release (LP: #1754592)
    - x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
    - usbip: prevent vhci_hcd driver from leaking a socket pointer address
    - usbip: Fix implicit fallthrough warning
    - usbip: Fix potential format overflow in userspace tools
    - x86/microcode/intel: Fix BDW late-loading revision check
    - x86/retpoline: Fill RSB on context switch for affected CPUs
    - sched/deadline: Use the revised wakeup rule for suspending constrained dl
      tasks
    - can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
    - can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
    - PM / sleep: declare __tracedata symbols as char[] rather than char
    - time: Avoid undefined behaviour in ktime_add_safe()
    - timers: Plug locking race vs. timer migration
    - Prevent timer value 0 for MWAITX
    - drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
    - drivers: base: cacheinfo: fix boot error message when acpi is enabled
    - PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID
    - PCI: layerscape: Fix MSG TLP drop setting
    - mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version
    - fs/select: add vmalloc fallback for select(2)
    - hwpoison, memcg: forcibly uncharge LRU pages
    - cma: fix calculation of aligned offset
    - mm, page_alloc: fix potential false positive in __zone_watermark_ok
    - ipc: msg, make msgrcv work with LONG_MIN
    - x86/ioapic: Fix incorrect pointers in ioapic_setup_resources()
    - ACPI / processor: Avoid reserving IO regions too early
    - ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
    - ACPICA: Namespace: fix operand cache leak
    - netfilter: x_tables: speed up jump target validation
    - netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed
      in 64bit kernel
    - netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags
    - netfilter: nf_ct_expect: remove the redundant slash when policy name is
      empty
    - netfilter: nfnetlink_queue: reject verdict request from different portid
    - netfilter: restart search if moved to other chain
    - netfilter: nf_conntrack_sip: extend request line validation
    - netfilter: use fwmark_reflect in nf_send_reset
    - ext2: Don't clear SGID when inheriting ACLs
    - reiserfs: fix race in prealloc discard
    - re...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I think the status for this bug should be "Fix Released" now.

Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.