igb Detected Tx Unit Hang
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | linux-lts-utopic (Ubuntu) |
High
|
Unassigned | ||
| | Trusty |
High
|
Luis Henriques | ||
Bug Description
Hello!
Have a:
>lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty
with kenel 3.16.0-46-generic.
Today i do dist-upgrade and kernel was upgraded to 3.16.0-48-generic version.
After reboot i've got this:
Sep 4 09:02:52 mail kernel: [ 310.616324] igb 0000:02:00.0 em1: Reset adapter Sep 4 09:02:52 mail kernel: [ 310.831157] igb 0000:02:00.1 em2: Reset adapter Sep 4 09:02:56 mail kernel: [ 315.154686] igb 0000:02:00.0 em1: igb: em1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Sep 4 09:02:56 mail kernel: [ 315.202651] igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Sep 4 09:03:02 mail kernel: [ 321.608099] igb 0000:02:00.0: Detected Tx Unit Hang
Sep 4 09:03:02 mail kernel: [ 321.608099] Tx Queue <6>
Sep 4 09:03:02 mail kernel: [ 321.608099] TDH <23>
Sep 4 09:03:02 mail kernel: [ 321.608099] TDT <23>
Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_use <25>
Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_clean <23>
Sep 4 09:03:02 mail kernel: [ 321.608099] buffer_
Sep 4 09:03:02 mail kernel: [ 321.608099] time_stamp <1000012af>
Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_watch <ffff880272571240>
Sep 4 09:03:02 mail kernel: [ 321.608099] jiffies <100001531>
Sep 4 09:03:02 mail kernel: [ 321.608099] desc.status <120200>
Sep 4 09:03:04 mail kernel: [ 323.607349] igb 0000:02:00.0: Detected Tx Unit Hang
Sep 4 09:03:04 mail kernel: [ 323.607349] Tx Queue <6>
Sep 4 09:03:04 mail kernel: [ 323.607349] TDH <23>
Sep 4 09:03:04 mail kernel: [ 323.607349] TDT <23>
Sep 4 09:03:04 mail kernel: [ 323.607349] next_to_use <25>
Sep 4 09:03:04 mail kernel: [ 323.607349] next_to_clean <23>
Sep 4 09:03:04 mail kernel: [ 323.607349] buffer_
Sep 4 09:03:04 mail kernel: [ 323.607349] time_stamp <1000012af>
Sep 4 09:03:04 mail kernel: [ 323.607349] next_to_watch <ffff880272571240>
Sep 4 09:03:04 mail kernel: [ 323.607349] jiffies <100001725>
Sep 4 09:03:04 mail kernel: [ 323.607349] desc.status <120200>
Sep 4 09:03:06 mail kernel: [ 325.606602] igb 0000:02:00.0: Detected Tx Unit Hang
Sep 4 09:03:06 mail kernel: [ 325.606602] Tx Queue <6>
Sep 4 09:03:06 mail kernel: [ 325.606602] TDH <23>
Sep 4 09:03:06 mail kernel: [ 325.606602] TDT <23>
Sep 4 09:03:06 mail kernel: [ 325.606602] next_to_use <25>
Sep 4 09:03:06 mail kernel: [ 325.606602] next_to_clean <23>
Sep 4 09:03:06 mail kernel: [ 325.606602] buffer_
Sep 4 09:03:06 mail kernel: [ 325.606602] time_stamp <1000012af>
Sep 4 09:03:06 mail kernel: [ 325.606602] next_to_watch <ffff880272571240>
Sep 4 09:03:06 mail kernel: [ 325.606602] jiffies <100001919>
Sep 4 09:03:06 mail kernel: [ 325.606602] desc.status <120200>
All network connections droped after that. System still unusable.
Only after boot with old linux-image-
It's a critical bug for me, can anybody help me?
ethtool -i em1
driver: igb
version: 5.2.13-k
firmware-version: 1.61, 0x80000cd5, 1.949.0
bus-info: 0000:02:00.0
supports-
supports-test: yes
supports-
supports-
supports-
| Dzmitry Shykuts (boot0user) wrote : | #1 |
| description: | updated |
| description: | updated |
| description: | updated |
| Launchpad Janitor (janitor) wrote : | #2 |
| Changed in linux-lts-utopic (Ubuntu): | |
| status: | New → Confirmed |
| gollum53 (smid) wrote : | #3 |
I have the same problem. The same driver, distro, kernel. Had to revert to older kernel. My motherboard with the NICs is X9DRD-7JLN4F.
| Mark Sapiro (msapiro) wrote : | #4 |
I have the same issue with similar kern.log entries after upgrading to kernel 3.16.0-48. Removing that and falling back to 3.16.0-46 fixed it for me.
| wizhippo (wizhippo) wrote : | #5 |
I have very similar issue running in hyper-v. Networking stop after a minute or two. Reverting back to 3.16.0-46 fixes he issue.
| Guy Baconniere (lordbaco) wrote : | #6 |
This bug should be a top priority because people will suffer from it as soon as they reboot
their 14.04 LTS with an Intel Gigabit NIC and the "current" Utopic kernel (3.16.0-
I had the same problem with HP ProLiant DL380e Gen8 which an Intel I350 Gigabit NIC
(Hewlett-Packard Company Ethernet 1Gb 4-port 366i Adapter)
It was hard to get a shell with 30-50% packets drop and igb driver resetting ALL NICs...
"blind-typing" on the shell and wait 1-2 minutes to get the output... of dmesg ;-)
I updated the kernel of 14.04 LTS from Utopic to Vivid and everything is working again.
Workaround is:
screen -S kernel
apt-get -y purge linux-{
apt-get -y install linux-image-
reboot
Output of dmesg:
igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1: Detected Tx Unit Hang
Tx Queue <3>
TDH <0>
TDT <0>
next_to_use <4>
next_to_clean <0>
buffer_
time_stamp <100039540>
next_to_watch <ffff880230b0e030>
jiffies <1000397b6>
desc.status <0>
igb 0000:02:00.1 em2: Reset adapter
igb 0000:02:00.2 em3: Reset adapter
igb 0000:02:00.0 em1: Reset adapter
igb 0000:02:00.0 em1: igb: em1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1: Detected Tx Unit Hang
| Guy Baconniere (lordbaco) wrote : | #7 |
| Guy Baconniere (lordbaco) wrote : | #8 |
Maybe it's another kernel regression with Intel NIC and TSO
so you can try this :
# tso => tcp-segmentatio
# gso => generic-
# gro => generic-
# sg => scatter-gather
# ufo => udp-fragmentati
# lro => large-receive-
ethtool -K em1 tso off gso off gro off sg off
ethtool -K em2 tso off gso off gro off sg off
ethtool -K em3 tso off gso off gro off sg off
ethtool -K em4 tso off gso off gro off sg off
# ethtool -K eth0 tso off gso off gro off sg off
# ...
Add this to each iface in /etc/network/
pre-up /sbin/ethtool -K $IFACE tso off gso off gro off sg off || true
| Dzmitry Shykuts (boot0user) wrote : | #9 |
I'm trying ethtool -K em1 tso off gso off lro off and it's doesn't help.
igb driver version is the same in 46 and 48 version of kernel. Seems that something changed in the kernel but not in the igb driver.
| Guy Baconniere (lordbaco) wrote : | #10 |
If directly related to igb module maybe this
LP: #1465653
https:/
https:/
https:/
if not directly related to igb modules maybe something linked to
hv_netvsc (Microsot Hyper-V Network Virtual Service Consumer)
LP: #1454892
http://
http://
https:/
| Kunzhou (likunzhou) wrote : | #11 |
I have the same problem with Intel I210AT.
| Dzmitry Shykuts (boot0user) wrote : | #12 |
I'm, personally, prefer to install a new kernel by running "apt-get install linux-signed-
| Tedesco (tedesco-z) wrote : | #13 |
I have the same problem
FUJITSU Server PRIMERGY RX1330 M1
-cpu
product: Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
vendor: Intel Corp.
physical id: 1
bus info: cpu@0
size: 3100MHz
capacity: 3100MHz
width: 64 bits
-network
bus info: pci@0000:02:00.0
-network
bus info: pci@0000:03:00.0
| Torsten Gollnick (tngk) wrote : | #14 |
Same problem here with
Dell Inc. PowerEdge R730/0H21J3, BIOS 1.2.10 03/09/2015
and
Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
Renders the machine useless.
Kernel 3.16.0-43 is OK
| Guy Baconniere (lordbaco) wrote : | #15 |
@boot0user I agree with you. The best workaround for now is to update kernel to Vivid!
# Physical Server (with EFI):
sudo apt-get -y purge linux-{
sudo apt-get -y install linux-signed-
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-signed-
sudo apt-get -y purge linux-{
# Physical Server (without EFI, but signed is also fine):
sudo apt-get -y purge linux-{
sudo apt-get -y install linux-generic-
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-generic-
sudo apt-get -y purge linux-{
# Virtual Server:
sudo apt-get -y purge linux-{
sudo apt-get -y install linux-virtual-
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-virtual-
sudo apt-get -y purge linux-{
# (optional)
# If you want to clean old kernels after the reboot (issue 1267059, 1089195) :
dpkg --get-selections | awk '/linux-
| sort -r -V -t- -k3 | tail -n+4 \
| grep -v "$(uname -r | sed -e 's/-generic//')" \
| xargs -r apt-get -qq -y purge
| Luis Henriques (henrix) wrote : | #16 |
I believe the problem lies in a bad backport in a set of patches for hyper-v. I've uploaded a test kernel that simply reverts this hyper-v patchset. Here's the URL:
http://
Could anyone please see if this kernel solves the issue? Thanks!
| Changed in linux-lts-utopic (Ubuntu Trusty): | |
| status: | New → Confirmed |
| assignee: | nobody → Luis Henriques (henrix) |
| Changed in linux-lts-utopic (Ubuntu Trusty): | |
| importance: | Undecided → High |
| Changed in linux-lts-utopic (Ubuntu): | |
| importance: | Undecided → High |
| tags: | added: kernel-key |
| rozie (rozie) wrote : | #18 |
3.16.0-48-generic #64~14.
| Rudy (rudys) wrote : | #19 |
sudo apt-get -y purge linux-{
sudo apt-get -y install linux-generic-
sudo reboot [0]
-------------
linux-
linux-
Use 'apt-get autoremove' to remove them.
| Changed in linux-lts-utopic (Ubuntu Trusty): | |
| status: | Confirmed → Fix Committed |
| Luis Henriques (henrix) wrote : | #20 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
| tags: | added: verification-needed-trusty |
| rozie (rozie) wrote : | #21 |
Tested 3.16.0-49-generic #65~14.04.1-Ubuntu SMP Wed Sep 9 10:03:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Looks stable for this issue: 523 packets transmitted, 523 received, 0% packet loss, time 526808ms
| Luis Henriques (henrix) wrote : | #22 |
As per comment #21, I'm tagging this bug as verified.
| tags: |
added: verification-done-trusty removed: verification-needed-trusty |
| Dzmitry Shykuts (boot0user) wrote : | #23 |
Tested 3.16.0-49-generic #65~14.04.1-Ubuntu SMP Wed Sep 9 10:03:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux from trusty/proposed. Looks stable.
| Launchpad Janitor (janitor) wrote : | #24 |
This bug was fixed in the package linux-lts-utopic - 3.16.0-
---------------
linux-lts-utopic (3.16.0-
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1493759
[ Upstream Kernel Changes ]
* Revert "hv_netvsc: Use the xmit_more skb flag to optimize signaling the
host"
- LP: #1492146
* Revert "Drivers: hv: vmbus: Export the
vmbus_
- LP: #1492146
* Revert "Drivers: hv: vmbus: Suport an API to send pagebuffers with
additional control"
- LP: #1492146
* Revert "Drivers: hv: vmbus: Suport an API to send packet with
additional control"
- LP: #1492146
* Revert "hv_netvsc: Fix a bug in netvsc_
- LP: #1492146
* Revert "hv_netvsc: Implement partial copy into send buffer"
- LP: #1492146
* Revert "hv_netvsc: Fix the packet free when it is in skb headroom"
- LP: #1492146
* Revert "hv_netvsc: Eliminate memory allocation in the packet send path"
- LP: #1492146
* Revert "hv_netvsc: Cleanup the test for freeing skb when we use sendbuf
mechanism"
- LP: #1492146
* Revert "hv_netvsc: Implement batching in send buffer"
- LP: #1492146
* Revert "hyperv: fix sparse warnings"
- LP: #1492146
* Revert "hyperv: Add support for vNIC hot removal"
- LP: #1492146
* Revert "hyperv: Increase the buffer length for netvsc_
- LP: #1492146
* Revert "net: Remove ndo_xmit_flush netdev operation, use signalling
instead."
- LP: #1492146
-- Luis Henriques <email address hidden> Wed, 09 Sep 2015 10:28:29 +0100
| Changed in linux-lts-utopic (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |
| status: | Fix Committed → Fix Released |
| Joseph Salisbury (jsalisbury) wrote : | #26 |
The commits that caused this bug were introduced by the fixes for bug 1454892.
I've created a new test kernel for bug 1454892, but I would like to ensure it does not introduce this regression again. Could folks affected by this bug test my new test kernel? It can be downloaded from:
http://
Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.
Thanks in advance!
| Mark Sapiro (msapiro) wrote : | #27 |
I have installed
linux-image-
linux-image-
from http://
| Mark Sapiro (msapiro) wrote : | #28 |
I spoke too soon. It took about 20 minutes for the issue to develop, but it has reappeared just as before with the new kernel.
| Joseph Salisbury (jsalisbury) wrote : | #29 |
Thanks for your help testing, Mark. I'll investigate further.
| mrk (cvs-src) wrote : | #30 |
Hello,
any news on this one? We also expecting this problem on two servers - with kernel 3.19.0-33-generic #38~14.04.1-Ubuntu. Anything we can do to make this fixed asap? I'm open to any tests. Thank you!
| mrk (cvs-src) wrote : | #31 |
The igb clash appears sporadically, once in two days or so. I can't reliably reproduce it - only by waiting for couple of days until it breaks.
We seeing that in Ubuntu 14.04 with xen hypervisor 4.4.2-0ubuntu0.
| Joseph Salisbury (jsalisbury) wrote : | #32 |
@Mark
Thanks for testing my last kernel and confirming the regression still exists.
I've created one more test kernel for bug 1454892. Could you and any other folks affected by this bug test my new test kernel? It can be downloaded from:
http://
Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.
Thanks again!
| Mark Sapiro (msapiro) wrote : | #33 |
Sorry to report that I have the same issue after installing linux-image-
| Joseph Salisbury (jsalisbury) wrote : | #34 |
Thanks again for testing, Mark. I've created on more test kernel. This kernel makes no changes to the igb code at all. So if the bug does not exist with your current up to date kernel, it shouldn't occur with the test kernel.
Could you and any other folks affected by this bug test my new test kernel? It can be downloaded from:
http://
Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.
| Mark Sapiro (msapiro) wrote : | #35 |
I have installed linux-image-
It's been running without issues for significantly longer than the versions with problems ever did. I will continue to monitor and will report again.
| Mark Sapiro (msapiro) wrote : | #36 |
My server has been running on this kernel (3.16.0-56-generic) for almost 24 hours now with no recurrence of the igb Tx Unit Hang.
I'm still monitoring, but it looks like this kernel is stable on my server.
| Mark Sapiro (msapiro) wrote : | #37 |
My server has now been running on this kernel (3.16.0-56-generic) for over 48 hours with no recurrence of the igb Tx Unit Hang.
I think we can say it's working for me.
| Changed in linux-lts-utopic (Ubuntu): | |
| status: | Confirmed → Fix Released |
| Roland Sommer (rsommer) wrote : | #38 |
Hi, i'm encountering the same/similar bug on xenial 4.4.0-28-generic. If i apply network load via iperf i get the unit (Intel 210i) reproducible to hang. Maybe this is a regression or another bug. The network interface does not recover, i have to reboot the machine to get it back online. dmesg outout attached.
| Manuel Hilbing (manuel-hilbing) wrote : | #39 |
Hi rsommer,
you use the Asrock C2550D4I?
Currently i am hunting the same problem on Ubuntu and on Debian
Some related links: ...
https:/
http://
http://
http://
It can be a hardware problem ... on this specific board ... Asrock C2550D4I
| Roland Sommer (rsommer) wrote : | #40 |
I am using the C2550. I just tried the "disable intel speedstep and C-state" hint but within 60 seconds i got the tx unit hang again.
| Manuel Hilbing (manuel-hilbing) wrote : | #41 |
You can try to compile a dkms igb driver.
My solution is to run the working 3.2 kernel on Debian wheezy
I read something that the kernel pcie code was updated on nwer kernel. The igb on the bridge chip PLX 8608 has problems
You can try the following
pcie_aspm=off
https:/
Today I contact the asrock(rack) support... and ask about the problem
| Roland Sommer (rsommer) wrote : | #42 |
I just tried booting with pcie_aspm=off. It took 7 seconds until freeze after starting iperf. The funny thing is, that i'm using the igb-driver on the other side of the test, but on an I354 controller.
| no longer affects: | linux-lts-xenial (Ubuntu) |
| no longer affects: | linux-lts-xenial (Ubuntu Trusty) |
| Roland Sommer (rsommer) wrote : | #43 |
The "no longer affects" is not correct, but the assignment to the correct source package was wrong.
| Manuel Hilbing (manuel-hilbing) wrote : | #44 |
Answer from asrock...
i think that this is more of an issue with the Kernel. I can ask the engineers to look into it but as this OS is not on the tested list this does not warranty an rma.
| Roland Sommer (rsommer) wrote : | #45 |
I got a replacement board and the error seems to have gone. I did run iperf for over an hour and no hanging was detected.
| Manuel Hilbing (manuel-hilbing) wrote : | #46 |
@Roland Sommer
Can you check which board revision do you get?


Status changed to 'Confirmed' because the bug affects multiple users.