"Reset adapter unexpectedly" - NIC hangs using e1000e driver under average I/O

Bug #1391674 reported by gianfilippo
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

Using ubuntu 14.04 LTS

> uname -a
Linux argeste 3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

/var/log/syslog:

Nov 11 23:29:00 argeste kernel: [846163.967778] e1000e 0000:00:19.0 eth1: Reset adapter unexpectedly
Nov 11 23:29:01 argeste kernel: [846163.982417] xen-br1: port 1(eth1) entered disabled state
Nov 11 23:29:05 argeste kernel: [846168.028959] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Nov 11 23:29:05 argeste kernel: [846168.029085] xen-br1: port 1(eth1) entered forwarding state
Nov 11 23:29:05 argeste kernel: [846168.029094] xen-br1: port 1(eth1) entered forwarding state
Nov 11 23:29:08 argeste kernel: [846171.956810] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
Nov 11 23:29:08 argeste kernel: [846171.956810] TDH <0>
Nov 11 23:29:08 argeste kernel: [846171.956810] TDT <1>
Nov 11 23:29:08 argeste kernel: [846171.956810] next_to_use <1>
Nov 11 23:29:08 argeste kernel: [846171.956810] next_to_clean <0>
Nov 11 23:29:08 argeste kernel: [846171.956810] buffer_info[next_to_clean]:
Nov 11 23:29:08 argeste kernel: [846171.956810] time_stamp <10c9ab819>
Nov 11 23:29:08 argeste kernel: [846171.956810] next_to_watch <0>
Nov 11 23:29:08 argeste kernel: [846171.956810] jiffies <10c9ab9d1>
Nov 11 23:29:08 argeste kernel: [846171.956810] next_to_watch.status <0>
Nov 11 23:29:08 argeste kernel: [846171.956810] MAC Status <40080183>
Nov 11 23:29:08 argeste kernel: [846171.956810] PHY Status <796d>
Nov 11 23:29:08 argeste kernel: [846171.956810] PHY 1000BASE-T Status <3800>
Nov 11 23:29:08 argeste kernel: [846171.956810] PHY Extended Status <3000>
Nov 11 23:29:08 argeste kernel: [846171.956810] PCI Status <10>
Nov 11 23:29:10 argeste kernel: [846173.956722] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
Nov 11 23:29:10 argeste kernel: [846173.956722] TDH <0>
Nov 11 23:29:10 argeste kernel: [846173.956722] TDT <1>
Nov 11 23:29:10 argeste kernel: [846173.956722] next_to_use <1>
Nov 11 23:29:10 argeste kernel: [846173.956722] next_to_clean <0>
Nov 11 23:29:10 argeste kernel: [846173.956722] buffer_info[next_to_clean]:
Nov 11 23:29:10 argeste kernel: [846173.956722] time_stamp <10c9ab819>
Nov 11 23:29:10 argeste kernel: [846173.956722] next_to_watch <0>
Nov 11 23:29:10 argeste kernel: [846173.956722] jiffies <10c9abbc5>
Nov 11 23:29:10 argeste kernel: [846173.956722] next_to_watch.status <0>
Nov 11 23:29:10 argeste kernel: [846173.956722] MAC Status <40080183>
Nov 11 23:29:10 argeste kernel: [846173.956722] PHY Status <796d>
Nov 11 23:29:10 argeste kernel: [846173.956722] PHY 1000BASE-T Status <3800>
Nov 11 23:29:10 argeste kernel: [846173.956722] PHY Extended Status <3000>
Nov 11 23:29:10 argeste kernel: [846173.956722] PCI Status <10>
Nov 11 23:29:12 argeste kernel: [846175.956759] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
Nov 11 23:29:12 argeste kernel: [846175.956759] TDH <0>
Nov 11 23:29:12 argeste kernel: [846175.956759] TDT <1>
Nov 11 23:29:12 argeste kernel: [846175.956759] next_to_use <1>
Nov 11 23:29:12 argeste kernel: [846175.956759] next_to_clean <0>
Nov 11 23:29:12 argeste kernel: [846175.956759] buffer_info[next_to_clean]:
Nov 11 23:29:12 argeste kernel: [846175.956759] time_stamp <10c9ab819>
Nov 11 23:29:12 argeste kernel: [846175.956759] next_to_watch <0>
Nov 11 23:29:12 argeste kernel: [846175.956759] jiffies <10c9abdb9>
Nov 11 23:29:12 argeste kernel: [846175.956759] next_to_watch.status <0>
Nov 11 23:29:12 argeste kernel: [846175.956759] MAC Status <40080183>
Nov 11 23:29:12 argeste kernel: [846175.956759] PHY Status <796d>
Nov 11 23:29:12 argeste kernel: [846175.956759] PHY 1000BASE-T Status <3800>
Nov 11 23:29:12 argeste kernel: [846175.956759] PHY Extended Status <3000>
Nov 11 23:29:12 argeste kernel: [846175.956759] PCI Status <10>
Nov 11 23:29:14 argeste kernel: [846177.956686] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
Nov 11 23:29:14 argeste kernel: [846177.956686] TDH <0>
Nov 11 23:29:14 argeste kernel: [846177.956686] TDT <1>
Nov 11 23:29:14 argeste kernel: [846177.956686] next_to_use <1>
Nov 11 23:29:14 argeste kernel: [846177.956686] next_to_clean <0>
Nov 11 23:29:14 argeste kernel: [846177.956686] buffer_info[next_to_clean]:
Nov 11 23:29:14 argeste kernel: [846177.956686] time_stamp <10c9ab819>
Nov 11 23:29:14 argeste kernel: [846177.956686] next_to_watch <0>
Nov 11 23:29:14 argeste kernel: [846177.956686] jiffies <10c9abfad>
Nov 11 23:29:14 argeste kernel: [846177.956686] next_to_watch.status <0>
Nov 11 23:29:14 argeste kernel: [846177.956686] MAC Status <40080183>
Nov 11 23:29:14 argeste kernel: [846177.956686] PHY Status <796d>
Nov 11 23:29:14 argeste kernel: [846177.956686] PHY 1000BASE-T Status <3800>
Nov 11 23:29:14 argeste kernel: [846177.956686] PHY Extended Status <3000>
Nov 11 23:29:14 argeste kernel: [846177.956686] PCI Status <10>
Nov 11 23:29:14 argeste kernel: [846177.967764] e1000e 0000:00:19.0 eth1: Reset adapter unexpectedly
Nov 11 23:29:15 argeste kernel: [846177.987961] xen-br1: port 1(eth1) entered disabled state

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-39-generic 3.13.0-39.66
ProcVersionSignature: Ubuntu 3.13.0-39.66-generic 3.13.11.8
Uname: Linux 3.13.0-39-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 nov 2 04:26 seq
 crw-rw---- 1 root audio 116, 33 nov 2 04:26 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.14.1-0ubuntu3.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory: 'iw'
Date: Tue Nov 11 23:30:17 2014
HibernationDevice: RESUME=UUID=327b5850-ea70-4b94-8205-9c64aeb99e19
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 0557:2221 ATEN International Co., Ltd Winbond Hermon
 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Supermicro X9SCL/X9SCM
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=it_IT.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: placeholder root=UUID=74a1154f-b6bc-49fa-bcf4-4e8edf793248 ro quiet splash
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-39-generic N/A
 linux-backports-modules-3.13.0-39-generic N/A
 linux-firmware 1.127.7
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to trusty on 2014-11-02 (9 days ago)
dmi.bios.date: 09/17/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.0b
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X9SCL/X9SCM
dmi.board.vendor: Supermicro
dmi.board.version: 1.11A
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.0b:bd09/17/2012:svnSupermicro:pnX9SCL/X9SCM:pvr0123456789:rvnSupermicro:rnX9SCL/X9SCM:rvr1.11A:cvnSupermicro:ct3:cvr0123456789:
dmi.product.name: X9SCL/X9SCM
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Revision history for this message
gianfilippo (gianfi) wrote :
description: updated
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
penalvch (penalvch)
Changed in linux (Ubuntu):
importance: Undecided → Low
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
gianfilippo (gianfi) wrote :

Hello Cristopher,
I have managed to update the firmware of all the servers. The issue is still happening:

# dmidecode -s bios-version && sudo dmidecode -s bios-release-date
2.10
01/09/2014

# uname -a
Linux <hostname> 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

After starting an high bandwidth load (backups):

[95407.726267] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
[95407.726267] TDH <0>
[95407.726267] TDT <1>
[95407.726267] next_to_use <1>
[95407.726267] next_to_clean <0>
[95407.726267] buffer_info[next_to_clean]:
[95407.726267] time_stamp <1016abfd3>
[95407.726267] next_to_watch <0>
[95407.726267] jiffies <1016ac6eb>
[95407.726267] next_to_watch.status <0>
[95407.726267] MAC Status <40080183>
[95407.726267] PHY Status <796d>
[95407.726267] PHY 1000BASE-T Status <3800>
[95407.726267] PHY Extended Status <3000>
[95407.726267] PCI Status <10>
[95409.726297] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
[95409.726297] TDH <0>
[95409.726297] TDT <1>
[95409.726297] next_to_use <1>
[95409.726297] next_to_clean <0>
[95409.726297] buffer_info[next_to_clean]:
[95409.726297] time_stamp <1016abfd3>
[95409.726297] next_to_watch <0>
[95409.726297] jiffies <1016ac8df>
[95409.726297] next_to_watch.status <0>
[95409.726297] MAC Status <40080183>
[95409.726297] PHY Status <796d>
[95409.726297] PHY 1000BASE-T Status <3800>
[95409.726297] PHY Extended Status <3000>
[95409.726297] PCI Status <10>
[95409.729969] e1000e 0000:00:19.0 eth1: Reset adapter unexpectedly

Changed in linux (Ubuntu):
status: Expired → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

gianfilippo, could you please test the latest upstream kernel available from the very top line at the top of the page (the release names are irrelevant for testing, and please do not test the daily folder) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue.

If the test did not allow you to test to the issue (ex. you couldn't boot into the OS) please make a comment in your report about this, and continue to test the next most recent kernel version until you can test to the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags by clicking on the yellow circle with a black pencil icon, next to the word Tags, located at the bottom of the report description:
kernel-fixed-upstream
kernel-fixed-upstream-3.XY-rcZ

Where XY and Z are numbers corresponding to the kernel version.

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-3.XY-rcZ

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results.

Thank you for your understanding.

tags: added: latest-bios-2.10
Changed in linux (Ubuntu):
importance: Low → High
status: Confirmed → Incomplete
Revision history for this message
gianfilippo (gianfi) wrote :

I switched my 3 Supermicro X9SCL/X9SCM servers to 3.19.0-031900rc6-generic and the issue is repeating again under heavy I/O (during backups on DRBD-backed xen guests). The issue happens on all the servers.

Feb 1 00:47:55 hostname kernel: [179433.583173] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
Feb 1 00:47:55 hostname kernel: [179433.583173] TDH <7c>
Feb 1 00:47:55 hostname kernel: [179433.583173] TDT <89>
Feb 1 00:47:55 hostname kernel: [179433.583173] next_to_use <89>
Feb 1 00:47:55 hostname kernel: [179433.583173] next_to_clean <7a>
Feb 1 00:47:55 hostname kernel: [179433.583173] buffer_info[next_to_clean]:
Feb 1 00:47:55 hostname kernel: [179433.583173] time_stamp <102ab4e95>
Feb 1 00:47:55 hostname kernel: [179433.583173] next_to_watch <7c>
Feb 1 00:47:55 hostname kernel: [179433.583173] jiffies <102ab5089>
Feb 1 00:47:55 hostname kernel: [179433.583173] next_to_watch.status <0>
Feb 1 00:47:55 hostname kernel: [179433.583173] MAC Status <40080183>
Feb 1 00:47:55 hostname kernel: [179433.583173] PHY Status <796d>
Feb 1 00:47:55 hostname kernel: [179433.583173] PHY 1000BASE-T Status <3800>
Feb 1 00:47:55 hostname kernel: [179433.583173] PHY Extended Status <3000>
Feb 1 00:47:55 hostname kernel: [179433.583173] PCI Status <10>

>> uname -a
Linux hostname 3.19.0-031900rc6-generic #201501261152 SMP Mon Jan 26 16:53:27 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

tags: added: kernel-bug-exists-upstream kernel-bug-exists-upstream-3.19-rc6
Revision history for this message
gianfilippo (gianfi) wrote :

On a side note: the issue doesn't happen using old kernel version linux-image-3.5.0-45-generic_3.5.0-45.68~precise1

gianfilippo (gianfi)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

gianfilippo, the next step is to fully commit bisect from kernel 3.5 to 3.13 in order to identify the last good kernel commit, followed immediately by the first bad one. This will allow for a more expedited analysis of the root cause of your issue. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection ?

Please note, finding adjacent kernel versions is not fully commit bisecting.

Thank you for your understanding.

Helpful bug reporting tips:
https://wiki.ubuntu.com/ReportingBugs

tags: added: needs-bisect regression-release
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Halvor Lyche Strandvoll (halvors) wrote :

Any progress on this, i know a lot of users still is experiencing this bug.

Changed in linux (Ubuntu):
status: Expired → New
Revision history for this message
Brad Figg (brad-figg) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Halvor Lyche Strandvoll, it will help immensely if you filed a new report with the Ubuntu repository kernel (not mainline/upstream) via a terminal:
ubuntu-bug linux

Please feel free to subscribe me to it.

For more on why this is helpful, please see https://wiki.ubuntu.com/ReportingBugs.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.