igb Detected Tx Unit Hang

Bug #1492146 reported by Dzmitry Shykuts
96
This bug affects 17 people
Affects Status Importance Assigned to Milestone
linux-lts-utopic (Ubuntu)
Fix Released
High
Unassigned
Trusty
Fix Released
High
Luis Henriques

Bug Description

Hello!

Have a:

>lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

with kenel 3.16.0-46-generic.

Today i do dist-upgrade and kernel was upgraded to 3.16.0-48-generic version.

After reboot i've got this:

Sep 4 09:02:52 mail kernel: [ 310.616324] igb 0000:02:00.0 em1: Reset adapter Sep 4 09:02:52 mail kernel: [ 310.831157] igb 0000:02:00.1 em2: Reset adapter Sep 4 09:02:56 mail kernel: [ 315.154686] igb 0000:02:00.0 em1: igb: em1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Sep 4 09:02:56 mail kernel: [ 315.202651] igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Sep 4 09:03:02 mail kernel: [ 321.608099] igb 0000:02:00.0: Detected Tx Unit Hang
Sep 4 09:03:02 mail kernel: [ 321.608099] Tx Queue <6>
Sep 4 09:03:02 mail kernel: [ 321.608099] TDH <23>
Sep 4 09:03:02 mail kernel: [ 321.608099] TDT <23>
Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_use <25>
Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_clean <23>
Sep 4 09:03:02 mail kernel: [ 321.608099] buffer_info[next_to_clean]
Sep 4 09:03:02 mail kernel: [ 321.608099] time_stamp <1000012af>
Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_watch <ffff880272571240>
Sep 4 09:03:02 mail kernel: [ 321.608099] jiffies <100001531>
Sep 4 09:03:02 mail kernel: [ 321.608099] desc.status <120200>
Sep 4 09:03:04 mail kernel: [ 323.607349] igb 0000:02:00.0: Detected Tx Unit Hang
Sep 4 09:03:04 mail kernel: [ 323.607349] Tx Queue <6>
Sep 4 09:03:04 mail kernel: [ 323.607349] TDH <23>
Sep 4 09:03:04 mail kernel: [ 323.607349] TDT <23>
Sep 4 09:03:04 mail kernel: [ 323.607349] next_to_use <25>
Sep 4 09:03:04 mail kernel: [ 323.607349] next_to_clean <23>
Sep 4 09:03:04 mail kernel: [ 323.607349] buffer_info[next_to_clean]
Sep 4 09:03:04 mail kernel: [ 323.607349] time_stamp <1000012af>
Sep 4 09:03:04 mail kernel: [ 323.607349] next_to_watch <ffff880272571240>
Sep 4 09:03:04 mail kernel: [ 323.607349] jiffies <100001725>
Sep 4 09:03:04 mail kernel: [ 323.607349] desc.status <120200>
Sep 4 09:03:06 mail kernel: [ 325.606602] igb 0000:02:00.0: Detected Tx Unit Hang
Sep 4 09:03:06 mail kernel: [ 325.606602] Tx Queue <6>
Sep 4 09:03:06 mail kernel: [ 325.606602] TDH <23>
Sep 4 09:03:06 mail kernel: [ 325.606602] TDT <23>
Sep 4 09:03:06 mail kernel: [ 325.606602] next_to_use <25>
Sep 4 09:03:06 mail kernel: [ 325.606602] next_to_clean <23>
Sep 4 09:03:06 mail kernel: [ 325.606602] buffer_info[next_to_clean]
Sep 4 09:03:06 mail kernel: [ 325.606602] time_stamp <1000012af>
Sep 4 09:03:06 mail kernel: [ 325.606602] next_to_watch <ffff880272571240>
Sep 4 09:03:06 mail kernel: [ 325.606602] jiffies <100001919>
Sep 4 09:03:06 mail kernel: [ 325.606602] desc.status <120200>

All network connections droped after that. System still unusable.

Only after boot with old linux-image-3.16.0-46-generic my production mail server can work.

It's a critical bug for me, can anybody help me?

ethtool -i em1
driver: igb
version: 5.2.13-k
firmware-version: 1.61, 0x80000cd5, 1.949.0
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Revision history for this message
Dzmitry Shykuts (boot0user) wrote :
description: updated
description: updated
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-utopic (Ubuntu):
status: New → Confirmed
Revision history for this message
gollum53 (smid) wrote :

I have the same problem. The same driver, distro, kernel. Had to revert to older kernel. My motherboard with the NICs is X9DRD-7JLN4F.

Revision history for this message
Mark Sapiro (msapiro) wrote :

I have the same issue with similar kern.log entries after upgrading to kernel 3.16.0-48. Removing that and falling back to 3.16.0-46 fixed it for me.

Revision history for this message
wizhippo (wizhippo) wrote :

I have very similar issue running in hyper-v. Networking stop after a minute or two. Reverting back to 3.16.0-46 fixes he issue.

Revision history for this message
B. (b-deactivatedaccount-deactivatedaccount) wrote :

This bug should be a top priority because people will suffer from it as soon as they reboot
their 14.04 LTS with an Intel Gigabit NIC and the "current" Utopic kernel (3.16.0-48-generic).

I had the same problem with HP ProLiant DL380e Gen8 which an Intel I350 Gigabit NIC
(Hewlett-Packard Company Ethernet 1Gb 4-port 366i Adapter)

It was hard to get a shell with 30-50% packets drop and igb driver resetting ALL NICs...
"blind-typing" on the shell and wait 1-2 minutes to get the output... of dmesg ;-)

I updated the kernel of 14.04 LTS from Utopic to Vivid and everything is working again.

Workaround is:
screen -S kernel
apt-get -y purge linux-{headers,image,image-extra}-3.16.0-48-generic
apt-get -y install linux-image-generic-lts-vivid linux-headers-generic-lts-vivid
reboot

Output of dmesg:
 igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
 igb 0000:02:00.1: Detected Tx Unit Hang
   Tx Queue <3>
   TDH <0>
   TDT <0>
   next_to_use <4>
   next_to_clean <0>
 buffer_info[next_to_clean]
   time_stamp <100039540>
   next_to_watch <ffff880230b0e030>
   jiffies <1000397b6>
   desc.status <0>
igb 0000:02:00.1 em2: Reset adapter
igb 0000:02:00.2 em3: Reset adapter
igb 0000:02:00.0 em1: Reset adapter
igb 0000:02:00.0 em1: igb: em1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1: Detected Tx Unit Hang

Revision history for this message
B. (b-deactivatedaccount-deactivatedaccount) wrote :
Revision history for this message
B. (b-deactivatedaccount-deactivatedaccount) wrote :

Maybe it's another kernel regression with Intel NIC and TSO
so you can try this :

# tso => tcp-segmentation-offload
# gso => generic-segmentation-offload
# gro => generic-receive-offload
# sg => scatter-gather
# ufo => udp-fragmentation-offload (Cannot change)
# lro => large-receive-offload (Cannot change)
ethtool -K em1 tso off gso off gro off sg off
ethtool -K em2 tso off gso off gro off sg off
ethtool -K em3 tso off gso off gro off sg off
ethtool -K em4 tso off gso off gro off sg off
# ethtool -K eth0 tso off gso off gro off sg off
# ...

Add this to each iface in /etc/network/interfaces
pre-up /sbin/ethtool -K $IFACE tso off gso off gro off sg off || true

Revision history for this message
Dzmitry Shykuts (boot0user) wrote :

I'm trying ethtool -K em1 tso off gso off lro off and it's doesn't help.

igb driver version is the same in 46 and 48 version of kernel. Seems that something changed in the kernel but not in the igb driver.

Revision history for this message
B. (b-deactivatedaccount-deactivatedaccount) wrote :
Revision history for this message
Kunzhou (likunzhou) wrote :

I have the same problem with Intel I210AT.

Revision history for this message
Dzmitry Shykuts (boot0user) wrote :

I'm, personally, prefer to install a new kernel by running "apt-get install linux-signed-generic-lts-vivid".

Revision history for this message
Tedesco (tedesco-z) wrote :

I have the same problem
FUJITSU Server PRIMERGY RX1330 M1
-cpu
          product: Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
          vendor: Intel Corp.
          physical id: 1
          bus info: cpu@0
          size: 3100MHz
          capacity: 3100MHz
          width: 64 bits
-network
                description: Ethernet interface
                product: I210 Gigabit Network Connection
                vendor: Intel Corporation
                physical id: 0
                bus info: pci@0000:02:00.0
                logical name: em1
                version: 03
                serial: 90:1b:0e:10:34:96
                size: 100Mbit/s
                capacity: 1Gbit/s
                width: 32 bits
                clock: 33MHz
-network
                description: Ethernet interface
                product: I210 Gigabit Network Connection
                vendor: Intel Corporation
                physical id: 0
                bus info: pci@0000:03:00.0
                logical name: em2
                version: 03
                serial: 90:1b:0e:10:32:82
                size: 100Mbit/s
                capacity: 1Gbit/s
                width: 32 bits
                clock: 33MHz

Revision history for this message
Torsten Gollnick (tngk) wrote :

Same problem here with
     Dell Inc. PowerEdge R730/0H21J3, BIOS 1.2.10 03/09/2015
and
Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

Renders the machine useless.

Kernel 3.16.0-43 is OK

Revision history for this message
B. (b-deactivatedaccount-deactivatedaccount) wrote :

@boot0user I agree with you. The best workaround for now is to update kernel to Vivid!

# Physical Server (with EFI):
sudo apt-get -y purge linux-{headers,image}-3.16.0-48-generic
sudo apt-get -y install linux-signed-generic-lts-vivid
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-signed-generic-lts-utopic
sudo apt-get -y purge linux-{image,headers}-generic-lts-utopic

# Physical Server (without EFI, but signed is also fine):
sudo apt-get -y purge linux-{headers,image}-3.16.0-48-generic
sudo apt-get -y install linux-generic-lts-vivid
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-generic-lts-utopic
sudo apt-get -y purge linux-{image,headers}-generic-lts-utopic

# Virtual Server:
sudo apt-get -y purge linux-{headers,image,image-extra}-3.16.0-48-generic
sudo apt-get -y install linux-virtual-lts-vivid
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-virtual-lts-utopic
sudo apt-get -y purge linux-{image,headers}-virtual-lts-utopic

# (optional)
# If you want to clean old kernels after the reboot (issue 1267059, 1089195) :
dpkg --get-selections | awk '/linux-(headers|image)-[0-9]\./ { print $1 }' \
| sort -r -V -t- -k3 | tail -n+4 \
| grep -v "$(uname -r | sed -e 's/-generic//')" \
| xargs -r apt-get -qq -y purge

Revision history for this message
Luis Henriques (henrix) wrote :

I believe the problem lies in a bad backport in a set of patches for hyper-v. I've uploaded a test kernel that simply reverts this hyper-v patchset. Here's the URL:

http://people.canonical.com/~henrix/lp1492146/v1/amd64/

Could anyone please see if this kernel solves the issue? Thanks!

Changed in linux-lts-utopic (Ubuntu Trusty):
status: New → Confirmed
assignee: nobody → Luis Henriques (henrix)
Changed in linux-lts-utopic (Ubuntu Trusty):
importance: Undecided → High
Changed in linux-lts-utopic (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Revision history for this message
rozie (rozie) wrote :

3.16.0-48-generic #64~14.04.1~lp1492146v1 SMP Tue Sep 8 13:08:54 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux runs stable for ~1h.

Revision history for this message
Rudy (rudys) wrote :

sudo apt-get -y purge linux-{headers,image}-3.16.0-48-generic
sudo apt-get -y install linux-generic-lts-vivid

sudo reboot [0]

-------------

  linux-headers-3.16.0-46 linux-headers-3.16.0-46-generic
  linux-headers-3.16.0-48 linux-image-3.16.0-46-generic
Use 'apt-get autoremove' to remove them.

Luis Henriques (henrix)
Changed in linux-lts-utopic (Ubuntu Trusty):
status: Confirmed → Fix Committed
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Revision history for this message
rozie (rozie) wrote :

Tested 3.16.0-49-generic #65~14.04.1-Ubuntu SMP Wed Sep 9 10:03:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Looks stable for this issue: 523 packets transmitted, 523 received, 0% packet loss, time 526808ms

Revision history for this message
Luis Henriques (henrix) wrote :

As per comment #21, I'm tagging this bug as verified.

tags: added: verification-done-trusty
removed: verification-needed-trusty
Revision history for this message
Dzmitry Shykuts (boot0user) wrote :

Tested 3.16.0-49-generic #65~14.04.1-Ubuntu SMP Wed Sep 9 10:03:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux from trusty/proposed. Looks stable.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-lts-utopic - 3.16.0-49.65~14.04.1

---------------
linux-lts-utopic (3.16.0-49.65~14.04.1) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1493759

  [ Upstream Kernel Changes ]

  * Revert "hv_netvsc: Use the xmit_more skb flag to optimize signaling the
    host"
    - LP: #1492146
  * Revert "Drivers: hv: vmbus: Export the
    vmbus_sendpacket_pagebuffer_ctl()"
    - LP: #1492146
  * Revert "Drivers: hv: vmbus: Suport an API to send pagebuffers with
    additional control"
    - LP: #1492146
  * Revert "Drivers: hv: vmbus: Suport an API to send packet with
    additional control"
    - LP: #1492146
  * Revert "hv_netvsc: Fix a bug in netvsc_start_xmit()"
    - LP: #1492146
  * Revert "hv_netvsc: Implement partial copy into send buffer"
    - LP: #1492146
  * Revert "hv_netvsc: Fix the packet free when it is in skb headroom"
    - LP: #1492146
  * Revert "hv_netvsc: Eliminate memory allocation in the packet send path"
    - LP: #1492146
  * Revert "hv_netvsc: Cleanup the test for freeing skb when we use sendbuf
    mechanism"
    - LP: #1492146
  * Revert "hv_netvsc: Implement batching in send buffer"
    - LP: #1492146
  * Revert "hyperv: fix sparse warnings"
    - LP: #1492146
  * Revert "hyperv: Add support for vNIC hot removal"
    - LP: #1492146
  * Revert "hyperv: Increase the buffer length for netvsc_channel_cb()"
    - LP: #1492146
  * Revert "net: Remove ndo_xmit_flush netdev operation, use signalling
    instead."
    - LP: #1492146

 -- Luis Henriques <email address hidden> Wed, 09 Sep 2015 10:28:29 +0100

Changed in linux-lts-utopic (Ubuntu Trusty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The commits that caused this bug were introduced by the fixes for bug 1454892.

I've created a new test kernel for bug 1454892, but I would like to ensure it does not introduce this regression again. Could folks affected by this bug test my new test kernel? It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Revision history for this message
Mark Sapiro (msapiro) wrote :

I have installed

linux-image-3.16.0-52-generic_3.16.0-52.71~14.04.1_amd64.deb and
linux-image-extra-3.16.0-52-generic_3.16.0-52.71~14.04.1_amd64.deb

from http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/ and rebooted and I do not see the issue reported in this bug. It appears at least for me that the above kernel does not have the regression.

Revision history for this message
Mark Sapiro (msapiro) wrote :

I spoke too soon. It took about 20 minutes for the issue to develop, but it has reappeared just as before with the new kernel.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for your help testing, Mark. I'll investigate further.

Revision history for this message
mrk (cvs-src) wrote :

Hello,

any news on this one? We also expecting this problem on two servers - with kernel 3.19.0-33-generic #38~14.04.1-Ubuntu. Anything we can do to make this fixed asap? I'm open to any tests. Thank you!

Revision history for this message
mrk (cvs-src) wrote :

The igb clash appears sporadically, once in two days or so. I can't reliably reproduce it - only by waiting for couple of days until it breaks.
We seeing that in Ubuntu 14.04 with xen hypervisor 4.4.2-0ubuntu0.14.04.3 installed.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Mark

Thanks for testing my last kernel and confirming the regression still exists.

I've created one more test kernel for bug 1454892. Could you and any other folks affected by this bug test my new test kernel? It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.

Thanks again!

Revision history for this message
Mark Sapiro (msapiro) wrote :

Sorry to report that I have the same issue after installing linux-image-3.16.0-55-generic_3.16.0-55.74~14.04.1_amd64.deb and linux-image-extra-3.16.0-55-generic_3.16.0-55.74~14.04.1_amd64.deb from http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks again for testing, Mark. I've created on more test kernel. This kernel makes no changes to the igb code at all. So if the bug does not exist with your current up to date kernel, it shouldn't occur with the test kernel.

Could you and any other folks affected by this bug test my new test kernel? It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.

Revision history for this message
Mark Sapiro (msapiro) wrote :

I have installed linux-image-3.16.0-56-generic_3.16.0-56.75~14.04.1_amd64.deb and linux-image-extra-3.16.0-56-generic_3.16.0-56.75~14.04.1_amd64.deb from http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

It's been running without issues for significantly longer than the versions with problems ever did. I will continue to monitor and will report again.

Revision history for this message
Mark Sapiro (msapiro) wrote :

My server has been running on this kernel (3.16.0-56-generic) for almost 24 hours now with no recurrence of the igb Tx Unit Hang.

I'm still monitoring, but it looks like this kernel is stable on my server.

Revision history for this message
Mark Sapiro (msapiro) wrote :

My server has now been running on this kernel (3.16.0-56-generic) for over 48 hours with no recurrence of the igb Tx Unit Hang.

I think we can say it's working for me.

penalvch (penalvch)
Changed in linux-lts-utopic (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Roland Sommer (rsommer) wrote :

Hi, i'm encountering the same/similar bug on xenial 4.4.0-28-generic. If i apply network load via iperf i get the unit (Intel 210i) reproducible to hang. Maybe this is a regression or another bug. The network interface does not recover, i have to reboot the machine to get it back online. dmesg outout attached.

Revision history for this message
Manuel Hilbing (manuel-hilbing) wrote :

Hi rsommer,

you use the Asrock C2550D4I?

Currently i am hunting the same problem on Ubuntu and on Debian

Some related links: ...

https://sourceforge.net/p/e1000/bugs/424/
http://enira.net/?p=709
http://forums.tweaktown.com/asrock/56730-c2750d4i-stability-problems-2.html
http://comments.gmane.org/gmane.linux.drivers.e1000.devel/14111

It can be a hardware problem ... on this specific board ... Asrock C2550D4I

Revision history for this message
Roland Sommer (rsommer) wrote :

I am using the C2550. I just tried the "disable intel speedstep and C-state" hint but within 60 seconds i got the tx unit hang again.

Revision history for this message
Manuel Hilbing (manuel-hilbing) wrote :

You can try to compile a dkms igb driver.

My solution is to run the working 3.2 kernel on Debian wheezy

I read something that the kernel pcie code was updated on nwer kernel. The igb on the bridge chip PLX 8608 has problems

You can try the following
pcie_aspm=off

https://sourceforge.net/p/e1000/bugs/410/

Today I contact the asrock(rack) support... and ask about the problem

Revision history for this message
Roland Sommer (rsommer) wrote :

I just tried booting with pcie_aspm=off. It took 7 seconds until freeze after starting iperf. The funny thing is, that i'm using the igb-driver on the other side of the test, but on an I354 controller.

Roland Sommer (rsommer)
no longer affects: linux-lts-xenial (Ubuntu)
no longer affects: linux-lts-xenial (Ubuntu Trusty)
Revision history for this message
Roland Sommer (rsommer) wrote :

The "no longer affects" is not correct, but the assignment to the correct source package was wrong.

Revision history for this message
Manuel Hilbing (manuel-hilbing) wrote :

Answer from asrock...
i think that this is more of an issue with the Kernel. I can ask the engineers to look into it but as this OS is not on the tested list this does not warranty an rma.

Revision history for this message
Roland Sommer (rsommer) wrote :

I got a replacement board and the error seems to have gone. I did run iperf for over an hour and no hanging was detected.

Revision history for this message
Manuel Hilbing (manuel-hilbing) wrote :

@Roland Sommer
Can you check which board revision do you get?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.