e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang (Intel I219-LM )

Bug #1750165 reported by Jarod
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux-lts-xenial (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I am running Ubuntu 16.04 on a Supermicro X11SAE mainboard.
The machines was put into service several months ago. Ever since that time, the NIC eno1 is causing problems from time to time. When the machine boots up everything seems to be fine for a while. For no particular reason (from what I could see so far) the device seems to hang and is thus causing performance drop downs and lots of log messages.

This is the setup:

root@server2:~# lsb_release -rd
Description: Ubuntu 16.04.3 LTS
Release: 16.04

root@server2:~# uname -a
Linux server2 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

There are two onboard NICs put together in a bond0 plus a third gateway NIC:

This is my /etc/network/interfaces file:

auto lo
iface lo inet loopback

auto enp7s0
iface enp7s0 inet static
    address 192.168.178.254
    netmask 255.255.255.0
    network 192.168.178.0
    broadcast 192.168.178.255
    gateway 192.168.178.1

auto eno1
iface eno1 inet manual
    bond-master bond0

auto eno2
iface eno2 inet manual
    bond-master bond0

auto bond0
iface bond0 inet manual
    bond-mode 802.3ad
    bond-miimon 100
    bond-lacp-rate 1
    slaves eno1 eno2
    post-up ifup br0

iface br0 inet static
    address 192.168.3.254
    netmask 255.255.255.0
    network 192.168.3.0
    dns-nameserver 192.168.3.254
    dns-search localdomain
    broadcast 192.168.3.255
    bridge_ports bond0
    bridge_stp off
    bridge_fd 0
    bridge_maxwait 0

The onboard NIC eno1 and enp7s0 are served by the e1000e driver. The eno0 uses the igb driver. (see below)

root@server2:~# lspci |grep Ether
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

root@server2:~# ethtool -i eno1
driver: e1000e
version: 3.2.6-k
firmware-version: 0.8-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

root@server2:~# ethtool -i eno2
driver: igb
version: 5.3.0-k
firmware-version: 3.25, 0x800005cc
expansion-rom-version:
bus-info: 0000:06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

root@server2:~# ethtool -i enp7s0
driver: e1000e
version: 3.2.6-k
firmware-version: 1.8-0
expansion-rom-version:
bus-info: 0000:07:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

After every reboot it takes a while for the eno1 to start hanging (between hours and days).

dmesg then shows messages like this every few seconds:

[1874222.304742] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                   TDH <0>
                   TDT <2>
                   next_to_use <2>
                   next_to_clean <0>
                 buffer_info[next_to_clean]:
                   time_stamp <11bece461>
                   next_to_watch <0>
                   jiffies <11becebd3>
                   next_to_watch.status <0>
                 MAC Status <80083>
                 PHY Status <796d>
                 PHY 1000BASE-T Status <3800>
                 PHY Extended Status <3000>
                 PCI Status <10>
[1874224.304604] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                   TDH <0>
                   TDT <2>
                   next_to_use <2>
                   next_to_clean <0>
                 buffer_info[next_to_clean]:
                   time_stamp <11bece461>
                   next_to_watch <0>
                   jiffies <11becedc7>
                   next_to_watch.status <0>
                 MAC Status <80083>
                 PHY Status <796d>
                 PHY 1000BASE-T Status <3800>
                 PHY Extended Status <3000>
                 PCI Status <10>
[1874224.308396] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
[1874224.308447] e1000e 0000:00:1f.6 eno1: speed changed to 0 for port eno1
[1874224.396431] bond0: link status definitely down for interface eno1, disabling it
[1874228.310205] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[1874228.312176] bond0: link status definitely up for interface eno1, 1000 Mbps full duplex

Those are the standard settings for the affected NIC:

root@server2:~# ethtool -k eno1
Features for eno1:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]
hw-tc-offload: off [fixed]

root@server2:~# lspci -vv -s 0000:00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
        DeviceName: Intel Ethernet i219 #1
        Subsystem: Super Micro Computer Inc Ethernet Connection (2) I219-LM
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 147
        Region 0: Memory at df800000 (32-bit, non-prefetchable) [size=128K]
        Capabilities: [c8] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00598 Data: 0000
        Capabilities: [e0] PCI Advanced Features
                AFCap: TP+ FLR+
                AFCtrl: FLR-
                AFStatus: TP+
        Kernel driver in use: e1000e
        Kernel modules: e1000e

06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
        DeviceName: Intel Ethernet i210 #2
        Subsystem: Super Micro Computer Inc I210 Gigabit Network Connection
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at df400000 (32-bit, non-prefetchable) [size=512K]
        Region 2: I/O ports at d000 [size=32]
        Region 3: Memory at df480000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000 Data: 0000
                Masking: 00000000 Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] Device Serial Number ac-1f-6b-ff-ff-21-b1-8f
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                Steering table in TPH capability structure
        Kernel driver in use: igb
        Kernel modules: igb

root@server2:~# lspci -vv -s 0000:07:00.0
07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
        Subsystem: Intel Corporation Gigabit CT Desktop Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at df3c0000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at df300000 (32-bit, non-prefetchable) [size=512K]
        Region 2: I/O ports at c000 [size=32]
        Region 3: Memory at df3e0000 (32-bit, non-prefetchable) [size=16K]
        Expansion ROM at df380000 [disabled] [size=256K]
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000 Data: 0000
        Capabilities: [e0] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <128ns, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-1b-21-ff-ff-36-06-40
        Kernel driver in use: e1000e
        Kernel modules: e1000e

I looked around to see if there was a chance to mitigate the problem.
Someone mentioned that "ethtool -K eno1 sg off tso off gro off" should help circumvent the problem.
Unfortunately it did not help.

Jarod (jarod42)
affects: ubuntu → linux-lts-xenial (Ubuntu)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-xenial (Ubuntu):
status: New → Confirmed
Revision history for this message
Dominik Röttsches (drott) wrote :

This affects me with kernel 5.8.0-36-generic and Ubuntu 20.10.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.