Ethernet E1000 Controller Hangs

Bug #1766377 reported by Robert Dinse on 2018-04-23
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Bionic
High
Unassigned

Bug Description

     With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
Intel® I218V, 1 x Gigabit LAN Controller(s)
Intel® I211-AT, 1 x Gigabit LAN
Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
Support Teaming Technology
ASUS Turbo LAN Utility
The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
Here are the messages from dmesg:
1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                   TDH <ea>
                   TDT <2d>
                   next_to_use <2d>
                   next_to_clean <e9>
                 buffer_info[next_to_clean]:
                   time_stamp <13c8d0008>
                   next_to_watch <ea>
                   jiffies <13c8d0880>
                   next_to_watch.status <0>
                 MAC Status <80083>
                 PHY Status <796d>
                 PHY 1000BASE-T Status <3c00>
                 PHY Extended Status <3000>
                 PCI Status <10>
[1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                   TDH <ea>
                   TDT <2d>
                   next_to_use <2d>
                   next_to_clean <e9>
                 buffer_info[next_to_clean]:
                   time_stamp <13c8d0008>
                   next_to_watch <ea>
                   jiffies <13c8d1040>
                   next_to_watch.status <0>
                 MAC Status <80083>
                 PHY Status <796d>
                 PHY 1000BASE-T Status <3c00>
                 PHY Extended Status <3000>
                 PCI Status <10>
[1016202.413607] e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
[1016202.413701] bridge0: port 1(eno1) entered disabled state
[1016202.413732] bridge0: topology change detected, propagating
[1016206.666676] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[1016206.666708] bridge0: port 1(eno1) entered blocking state
[1016206.666712] bridge0: port 1(eno1) entered listening state
[1016216.750911] bridge0: port 1(eno1) entered learning state
[1016232.110291] bridge0: port 1(eno1) entered forwarding state
[1016232.110294] bridge0: topology change detected, sending tcn bpdu
[1017834.390579] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[1017834.390770] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[1017834.414792] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[1017834.414794] cfg80211: failed to load regulatory.db
If there is any other information I can provide to aid in resolution, please contact me, <email address hidden>. Thank you!

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-15-lowlatency 4.15.0-15.16
ProcVersionSignature: Ubuntu 4.15.0-15.16-lowlatency 4.15.15
Uname: Linux 4.15.0-15-lowlatency x86_64
ApportVersion: 2.20.9-0ubuntu6
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/hwC1D3', '/dev/snd/hwC1D2', '/dev/snd/hwC1D1', '/dev/snd/hwC1D0', '/dev/snd/pcmC1D9p', '/dev/snd/pcmC1D8p', '/dev/snd/pcmC1D7p', '/dev/snd/pcmC1D3p', '/dev/snd/controlC1', '/dev/snd/by-path', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D2c', '/dev/snd/pcmC0D1p', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/controlC0', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CurrentDesktop: MATE
Date: Mon Apr 23 16:45:30 2018
HibernationDevice: RESUME=UUID=963cb206-8962-4fc0-82a1-fc4f02a9b5c5
InstallationDate: Installed on 2017-05-05 (353 days ago)
InstallationMedia: Ubuntu-MATE 17.04 "Zesty Zapus" - Release amd64 (20170412)
MachineType: ASUS All Series
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.15.0-15-lowlatency root=UUID=28825f5b-a6fd-4e09-982c-0513ae4d2842 ro quiet splash vt.handoff=1
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-15-lowlatency N/A
 linux-backports-modules-4.15.0-15-lowlatency N/A
 linux-firmware 1.173
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2018-04-12 (11 days ago)
dmi.bios.date: 08/11/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1801
dmi.board.asset.tag: Default string
dmi.board.name: X99-E
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1801:bd08/11/2017:svnASUS:pnAllSeries:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnX99-E:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: ASUS MB
dmi.product.name: All Series
dmi.product.version: System Version
dmi.sys.vendor: ASUS

Robert Dinse (nanook) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the proposed kernel and post back if it resolves this bug?
See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.

Thank you in advance!

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Changed in linux (Ubuntu Bionic):
status: Confirmed → Incomplete
Download full text (7.2 KiB)

      Yes though I will need to boot at night when usage is low. Can you tell
me what the kernel version is so I an be sure to get the correct kernel?

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Tue, 24 Apr 2018, Joseph Salisbury wrote:

> Date: Tue, 24 Apr 2018 15:47:57 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> Would it be possible for you to test the proposed kernel and post back if it resolves this bug?
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.
>
> Thank you in advance!
>
> ** Changed in: linux (Ubuntu)
> Importance: Undecided => High
>
> ** Also affects: linux (Ubuntu Bionic)
> Importance: High
> Status: Confirmed
>
> ** Tags added: kernel-key
>
> ** Changed in: linux (Ubuntu Bionic)
> Status: Confirmed => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d0880>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT ...

Read more...

Robert Dinse (nanook) wrote :
Download full text (5.9 KiB)

     There is a 4.50.0-20.21 in Proposed, is this the correct kernel?

On Tue, April 24, 2018 8:47 am, Joseph Salisbury wrote:
> Would it be possible for you to test the proposed kernel and post back if
> it resolves this bug? See https://wiki.ubuntu.com/Testing/EnableProposed
> for documentation how to enable and use -proposed.
>
> Thank you in advance!
>
>
> ** Changed in: linux (Ubuntu)
> Importance: Undecided => High
>
>
> ** Also affects: linux (Ubuntu Bionic)
> Importance: High
> Status: Confirmed
>
>
> ** Tags added: kernel-key
>
>
> ** Changed in: linux (Ubuntu Bionic)
> Status: Confirmed => Incomplete
>
>
> --
> You received this bug notification because you are subscribed to the bug
> report. https://bugs.launchpad.net/bugs/1766377
>
>
> Title:
> Ethernet E1000 Controller Hangs
>
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic
> hanging of the LAN connection. This is happening on an Asus X99-DELUX
> motherboard, controller specifications: Intel® I218V, 1 x Gigabit LAN
> Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE)
> appliance Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a
> software bridge to share the interface. This did not happen with 17.10 and
> 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008> next_to_watch <ea> jiffies
> <13c8d0880>
> next_to_watch.status <0> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008> next_to_watch <ea> jiffies
> <13c8d1040>
> next_to_watch.status <0> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016202.413607] e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
> [1016202.413701] bridge0: port 1(eno1) entered disabled state
> [1016202.413732] bridge0: topology change detected, propagating
> [1016206.666676] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [1016206.666708] bridge0: port 1(eno1) entered blocking state
> [1016206.666712] bridge0: port 1(eno1) entered listening state
> [1016216.750911] bridge0: port 1(eno1) entered learning state
> [1016232.110291] bridge0: port 1(eno1) entered forwarding state
> [1016232.110294] bridge0: topolo...

Read more...

Joseph Salisbury (jsalisbury) wrote :

Yes, that is the correct kernel version.

Robert Dinse (nanook) wrote :
Download full text (6.7 KiB)

      Ok, will reboot tonight when traffic is low and let you know how it goes.

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Tue, 24 Apr 2018, Joseph Salisbury wrote:

> Date: Tue, 24 Apr 2018 23:59:59 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> Yes, that is the correct kernel version.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d0880>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d1040>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> ...

Read more...

Robert Dinse (nanook) wrote :
Download full text (6.7 KiB)

      Looks like you nailed it. Three machines running haven't barfed in over
11 hours, used to several times an hour.

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Tue, 24 Apr 2018, Joseph Salisbury wrote:

> Date: Tue, 24 Apr 2018 23:59:59 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> Yes, that is the correct kernel version.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d0880>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d1040>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> ...

Read more...

Robert Dinse (nanook) wrote :
Download full text (7.1 KiB)

Hate to say it but it happened again. Only once which is a lot better in terms of frequency but still happening, here are details:

[23144.764734] hrtimer: interrupt took 46767 ns
[41628.563552] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                 TDH <db>
                 TDT <65>
                 next_to_use <65>
                 next_to_clean <da>
               buffer_info[next_to_clean]:
                 time_stamp <1027691ea>
                 next_to_watch <db>
                 jiffies <102769c40>
                 next_to_watch.status <0>
               MAC Status <80083>
               PHY Status <796d>
               PHY 1000BASE-T Status <7c00>
               PHY Extended Status <3000>
               PCI Status <10>
[41630.611608] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                 TDH <db>
                 TDT <65>
                 next_to_use <65>
                 next_to_clean <da>
               buffer_info[next_to_clean]:
                 time_stamp <1027691ea>
                 next_to_watch <db>
                 jiffies <10276a440>
                 next_to_watch.status <0>
               MAC Status <80083>
               PHY Status <796d>
               PHY 1000BASE-T Status <7c00>
               PHY Extended Status <3000>
               PCI Status <10>
[41632.595800] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                 TDH <db>
                 TDT <65>
                 next_to_use <65>
                 next_to_clean <da>
               buffer_info[next_to_clean]:
                 time_stamp <1027691ea>
                 next_to_watch <db>
                 jiffies <10276ac00>
                 next_to_watch.status <0>
               MAC Status <80083>
               PHY Status <796d>
               PHY 1000BASE-T Status <7c00>
               PHY Extended Status <3000>
               PCI Status <10>
[41634.579772] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                 TDH <db>
                 TDT <65>
                 next_to_use <65>
                 next_to_clean <da>
               buffer_info[next_to_clean]:
                 time_stamp <1027691ea>
                 next_to_watch <db>
                 jiffies <10276b3c0>
                 next_to_watch.status <0>
               MAC Status <80083>
               PHY Status <796d>
               PHY 1000BASE-T Status <7c00>
               PHY Extended Status <3000>
               PCI Status <10>
[41635.667409] ------------[ cut here ]------------
[41635.667411] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
[41635.667424] WARNING: CPU: 9 PID: 65 at /build/linux-5s7Xkn/linux-4.15.0/net/sched/sch_generic.c:323 dev_watchdog+0...

Read more...

Robert Dinse (nanook) wrote :

Still happening, not sure why but far more frequently on i7-6850k platform than i7-6700k.

Robert Dinse (nanook) wrote :
Download full text (4.8 KiB)

I discovered a way to cause this instantly, I attempted to change the size of the ring buffers from 512 bytes to the hardware maximum of 4096 using: ethtool -G eno1 rx 4096 tx 4096, it instantly hung the interface with the following in dmesg:

[458611.154752] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                  TDH <48>
                  TDT <73>
                  next_to_use <73>
                  next_to_clean <47>
                buffer_info[next_to_clean]:
                  time_stamp <11b5117a3>
                  next_to_watch <48>
                  jiffies <11b511d40>
                  next_to_watch.status <0>
                MAC Status <80083>
                PHY Status <796d>
                PHY 1000BASE-T Status <7c00>
                PHY Extended Status <3000>
                PCI Status <10>
[458613.138731] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                  TDH <48>
                  TDT <73>
                  next_to_use <73>
                  next_to_clean <47>
                buffer_info[next_to_clean]:
                  time_stamp <11b5117a3>
                  next_to_watch <48>
                  jiffies <11b512500>
                  next_to_watch.status <0>
                MAC Status <80083>
                PHY Status <796d>
                PHY 1000BASE-T Status <7c00>
                PHY Extended Status <3000>
                PCI Status <10>
[458615.122888] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                  TDH <48>
                  TDT <73>
                  next_to_use <73>
                  next_to_clean <47>
                buffer_info[next_to_clean]:
                  time_stamp <11b5117a3>
                  next_to_watch <48>
                  jiffies <11b512cc0>
                  next_to_watch.status <0>
                MAC Status <80083>
                PHY Status <796d>
                PHY 1000BASE-T Status <7c00>
                PHY Extended Status <3000>
                PCI Status <10>
[458617.106832] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                  TDH <48>
                  TDT <73>
                  next_to_use <73>
                  next_to_clean <47>
                buffer_info[next_to_clean]:
                  time_stamp <11b5117a3>
                  next_to_watch <48>
                  jiffies <11b513480>
                  next_to_watch.status <0>
                MAC Status <80083>
                PHY Status <796d>
                PHY 1000BASE-T Status <7c00>
                PHY Extended Status <3000>
                PCI Status <10>[458619.154912] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                  TDH <48>
      ...

Read more...

Joseph Salisbury (jsalisbury) wrote :

Can you see if this bug also happens with the latest mainline kernel, or if it was already fixed upstream? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc3

Robert Dinse (nanook) wrote :
Download full text (13.8 KiB)

      This kernel did make it so I could not reproduce it on demand using
ethtool -G however it broke so many other things I could not leave it running
to see if it fixed spontaneous hangs.

      Strangely it broke nfs-kernel-server on the i7-6850k machine but not the
i7-6700k machine. I did much stare and compare to make sure they were
configured the same. This forced me to back out this kernel.

      But in addition to NFS, the nouveau drivers needed on the i7-6850k
machine had some bug that would pixelize much of the screen in a semi-random
fashion. Also for whatever reason x2goserver would not work properly with
that kernel.

      On the i7-6700k machines, one I had to restart lightdm several times to
get it to actually start, it did not start on boot up. On another I was unable
to get lightdm to start at all and only console graphics worked, and for some
reason they were in yellow instead of white. The i7-6700k machines are using
the internal graphics of the i7-6700k processor clocked real slow to minimize
the impact on heat budget.

      So on the kernel-developers's PPA I saw another test kernel, 4.15.0-21,
I installed it, it also made ethtool -G not induce Ethernet hang but on the
i7-6850 it's already hung once spontaneously:

[ 4112.809034] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                  TDH <4f>
                  TDT <6f>
                  next_to_use <6f>
                  next_to_clean <4e>
                buffer_info[next_to_clean]:
                  time_stamp <1003a2221>
                  next_to_watch <4f>
                  jiffies <1003a2d80>
                  next_to_watch.status <0>
                MAC Status <80083>
                PHY Status <796d>
                PHY 1000BASE-T Status <7c00>
                PHY Extended Status <3000>
                PCI Status <10>
[ 4114.793198] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                  TDH <4f>
                  TDT <6f>
                  next_to_use <6f>
                  next_to_clean <4e>
                buffer_info[next_to_clean]:
                  time_stamp <1003a2221>
                  next_to_watch <4f>
                  jiffies <1003a3540>
                  next_to_watch.status <0>
                MAC Status <80083>
                PHY Status <796d>
                PHY 1000BASE-T Status <7c00>
                PHY Extended Status <3000>
                PCI Status <10>
[ 4116.008748] ------------[ cut here ]------------
[ 4116.008750] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
[ 4116.008765] WARNING: CPU: 8 PID: 59 at
/build/linux-QLn4bB/linux-4.15.0/net/s
ched/sch_generic.c:323 dev_watchdog+0x21d/0x230
[ 4116.008765] Modules linked in: tcp_diag inet_diag vhost_net vhost tap
xt_CHEC
KSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
nf_nat_ipv
4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
n
f_reject_ipv4 xt...

Robert Dinse (nanook) wrote :

I read an article here https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa which stated some people had had success preventing this by disabling hardware offloading with ethtool -K eth0 gso off gro off tso off. With the 4.15.0-21-lowlatency #22-Ubuntu SMP PREEMPT Tue May 1 15:47:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux kernel on the i7-6850 machine which has been the most problematic, I did this and have not had a hang since. Obviously this comes with a performance penalty so is undesirable as a permanent fix but hoping this might help narrow the cause.

tags: added: kernel-da-key
removed: kernel-key
Robert Dinse (nanook) wrote :
Download full text (7.1 KiB)

      Just to make sure you got the latest, the 4.17.x kernel did not work well
enoguh to leave it running, it broke kernel-nfs-server among other things.

      I am pressenting running 4.15.0-21 and with this kernel I would still
get these hangs except that I discovered disabling certain hardware offload
functions stops it, so presently in my /etc/rc.local file on the affected
servers I have: /sbin/ethtool -K eno1 gso off gro off tso off

      With this in place no hangs, slight performance penalty but no hangs.

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Mon, 7 May 2018, Joseph Salisbury wrote:

> Date: Mon, 07 May 2018 18:26:12 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> ** Tags removed: kernel-key
> ** Tags added: kernel-da-key
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d0880>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d...

Read more...

Joseph Salisbury (jsalisbury) wrote :

We could perform a kernel bisect to identify the commit that introduced this regression. To perform a bisect, we need to identify the last kernel that did not have the bug and the first kernel version that did.

Do you recall the last kernel that didn't exhibit the bug? If not, would you be able to test some kernels to narrow it down? I could post a link to the kernels to test.

Robert Dinse (nanook) wrote :
Download full text (7.0 KiB)

      I did not see this behavior with 4.13.0 and did with 4.15.0.

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Thu, 10 May 2018, Joseph Salisbury wrote:

> Date: Thu, 10 May 2018 17:25:18 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> We could perform a kernel bisect to identify the commit that introduced
> this regression. To perform a bisect, we need to identify the last
> kernel that did not have the bug and the first kernel version that did.
>
> Do you recall the last kernel that didn't exhibit the bug? If not,
> would you be able to test some kernels to narrow it down? I could post
> a link to the kernels to test.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d0880>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> ...

Read more...

Joseph Salisbury (jsalisbury) wrote :

Could you test the following two upstream kernels, so we can narrow down the last good and first bad further:

v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/

Robert Dinse (nanook) wrote :
Download full text (6.9 KiB)

      I booted the most problematic machine, that's the i7-6850k machine,
probably because it has the most traffic, on the 4.14.0 kernel, so far so
good.

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Thu, 10 May 2018, Joseph Salisbury wrote:

> Date: Thu, 10 May 2018 19:26:46 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> Could you test the following two upstream kernels, so we can narrow down
> the last good and first bad further:
>
> v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
> v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d0880>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> ...

Read more...

Robert Dinse (nanook) wrote :
Download full text (6.9 KiB)

      4.14 has the problem. Do I need to try the 4.15rc0 kernel also since
4.14 isn't well?

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Thu, 10 May 2018, Joseph Salisbury wrote:

> Date: Thu, 10 May 2018 19:26:46 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> Could you test the following two upstream kernels, so we can narrow down
> the last good and first bad further:
>
> v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
> v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies <13c8d0880>
> next_to_watch.status <0>
> MAC Status <80083>
> PHY Status <796d>
> PHY 1000BASE-T Status <3c00>
> PHY Extended Status <3000>
> PCI Status <10>
> [1016200.942072] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_clean]:
> time_stamp <13c8d0008>
> next_to_watch <ea>
> jiffies ...

Read more...

Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. We should work backwards towards 4.13 now. Can you test the following:

4.14-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc1/
4.14-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc4/
4.14-rc7: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc7/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

Robert Dinse (nanook) wrote :
Download full text (7.8 KiB)

      4.14.0 final crashed hard after running 17 hours. Not only was the
ethernet not-responsive, neither was the console, not even the magic
sys-req key. I had to power cycle the machine to get it unhung. Then
I booted 4.14.0rc1 and it immediately exploded however I had set a 20 second
time out and the machine self booted back into 4.15.0-21 and I turned hardware
offloading back off.

      I can not continue testing this on production machines and the one Intel
machine I have with that interface chip is currently broken. I'll work on
getting that working and setup a web server on it and use some test software
to put a load on it.

      One thing I found digging in github is that there has been only one
commit against the E1000 driver in the last three years and that was on
February 13th, so might be worth looking at.

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Fri, 11 May 2018, Joseph Salisbury wrote:

> Date: Fri, 11 May 2018 12:31:14 -0000
> From: Joseph Salisbury <email address hidden>
> Reply-To: Bug 1766377 <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1766377] Re: Ethernet E1000 Controller Hangs
>
> Thanks for testing. We should work backwards towards 4.13 now. Can you
> test the following:
>
> 4.14-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc1/
> 4.14-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc4/
> 4.14-rc7: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc7/
>
>
> You don't have to test every kernel, just up until the kernel that first has this bug.
>
> Thanks in advance!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1766377
>
> Title:
> Ethernet E1000 Controller Hangs
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Incomplete
>
> Bug description:
> With Bionic kernel 4.15.0-15 and 4.15.0-17 I am experiencing periodic hanging of the LAN connection. This is happening on an Asus X99-DELUX motherboard, controller specifications:
> Intel® I218V, 1 x Gigabit LAN Controller(s)
> Intel® I211-AT, 1 x Gigabit LAN
> Dual Gigabit LAN controllers- 802.3az Energy Efficient Ethernet (EEE) appliance
> Support Teaming Technology
> ASUS Turbo LAN Utility
> The CPU is an i7-6850 and it is configured with 128GB of DDR4 RAM.
> This machine has a number of Qemu/KVM virtual guests and is using a software bridge to share the interface.
> This did not happen with 17.10 and 4.13.0 kernel. It is happening on multiple machines here.
> Here are the messages from dmesg:
> 1016198.957850] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
> TDH <ea>
> TDT <2d>
> next_to_use <2d>
> next_to_clean <e9>
> buffer_info[next_to_cl...

Read more...

Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Bionic) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Bionic):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers