Ubuntu
linux-lts-utopic package

igb Detected Tx Unit Hang

Bug #1492146 reported by Dzmitry Shykuts on 2015-09-04

This bug affects 17 people

Affects		Status	Importance	Assigned to	Milestone
	linux-lts-utopic (Ubuntu)	Fix Released	High	Unassigned
	Trusty	Fix Released	High	Luis Henriques

Bug Description

Hello!

Have a:

>lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

with kenel 3.16.0-46-generic.

Today i do dist-upgrade and kernel was upgraded to 3.16.0-48-generic version.

After reboot i've got this:

Sep 4 09:02:52 mail kernel: [ 310.616324] igb Sep 4 09:03:02 mail kernel: [ 321.608099] Tx Queue Sep 4 09:03:02 mail kernel: [ 321.608099] TDH Sep 4 09:03:02 mail kernel: [ 321.608099] TDT Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_use Sep 4 09:03:02 mail kernel: [ 321.608099] next_to_clean Sep 4 09:03:02 mail kernel: [ 321.608099] buffer_ <6>
<23>
<23>
<25>
<23>
/>info[next_to_clean]
<1000012af>
<ffff880272571240>
<100001531>
<120200>
0000:02:00.0: Detected Tx Unit Hang
<6>
<23>
<23>
<25>
<23>
/>info[next_to_clean]
<1000012af>
<ffff880272571240>
<100001725>
<120200>
0000:02:00.0: Detected Tx Unit Hang
<6>
<23>
<23>
<25>
<23>
/>info[next_to_clean]
<1000012af>
<ffff880272571240>
<100001919>
<120200>

All network connections droped after that. System still unusable.

Only after boot with old linux-image-3.16.0-46-generic my production mail server can work.

It's a critical bug for me, can anybody help me?

ethtool -i em1
driver: igb
version: 5.2.13-k
firmware-version: 1.61, 0x80000cd5, 1.949.0
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

See original description

Tags:

Revision history for this message

Dzmitry Shykuts (boot0user) wrote on 2015-09-04:

kernel log file Edit (629.8 KiB, text/plain)

Dzmitry Shykuts (boot0user) on 2015-09-04

description:

updated

Dzmitry Shykuts (boot0user) on 2015-09-04

description:

updated

Dzmitry Shykuts (boot0user) on 2015-09-04

description:

updated

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-09-04:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-utopic (Ubuntu):
status:	New → Confirmed

Revision history for this message

gollum53 (smid) wrote on 2015-09-04:

I have the same problem. The same driver, distro, kernel. Had to revert to older kernel. My motherboard with the NICs is X9DRD-7JLN4F.

Revision history for this message

Mark Sapiro (msapiro) wrote on 2015-09-04:

Excerps from kern.log, lshw, ifconfig Edit (9.3 KiB, text/plain)

I have the same issue with similar kern.log entries after upgrading to kernel 3.16.0-48. Removing that and falling back to 3.16.0-46 fixed it for me.

Revision history for this message

wizhippo (wizhippo) wrote on 2015-09-05:

I have very similar issue running in hyper-v. Networking stop after a minute or two. Reverting back to 3.16.0-46 fixes he issue.

Revision history for this message

B. (b-deactivatedaccount-deactivatedaccount) wrote on 2015-09-06:

This bug should be a top priority because people will suffer from it as soon as they reboot
their 14.04 LTS with an Intel Gigabit NIC and the "current" Utopic kernel (3.16.0-48-generic).

I had the same problem with HP ProLiant DL380e Gen8 which an Intel I350 Gigabit NIC
(Hewlett-Packard Company Ethernet 1Gb 4-port 366i Adapter)

It was hard to get a shell with 30-50% packets drop and igb driver resetting ALL NICs...
"blind-typing" on the shell and wait 1-2 minutes to get the output... of dmesg ;-)

I updated the kernel of 14.04 LTS from Utopic to Vivid and everything is working again.

Workaround is:
screen -S kernel
apt-get -y purge linux-{headers,image,image-extra}-3.16.0-48-generic
apt-get -y install linux-image-generic-lts-vivid linux-headers-generic-lts-vivid
reboot

Output of dmesg:
igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1: Detected Tx Unit Hang
   Tx Queue <3>
   TDH <0>
   TDT <0>
   next_to_use <4>
   next_to_clean <0>
buffer_info[next_to_clean]
   time_stamp <100039540>
   next_to_watch <ffff880230b0e030>
   jiffies <1000397b6>
   desc.status <0>
igb 0000:02:00.1 em2: Reset adapter
igb 0000:02:00.2 em3: Reset adapter
igb 0000:02:00.0 em1: Reset adapter
igb 0000:02:00.0 em1: igb: em1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1 em2: igb: em2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
igb 0000:02:00.1: Detected Tx Unit Hang

Revision history for this message

B. (b-deactivatedaccount-deactivatedaccount) wrote on 2015-09-06:

Same bug as https://bugs.launchpad.net/ubuntu/+source/linux-lts-utopic/+bug/1488024

Revision history for this message

B. (b-deactivatedaccount-deactivatedaccount) wrote on 2015-09-06:

Maybe it's another kernel regression with Intel NIC and TSO
so you can try this :

# tso => tcp-segmentation-offload
# gso => generic-segmentation-offload
# gro => generic-receive-offload
# sg => scatter-gather
# ufo => udp-fragmentation-offload (Cannot change)
# lro => large-receive-offload (Cannot change)
ethtool -K em1 tso off gso off gro off sg off
ethtool -K em2 tso off gso off gro off sg off
ethtool -K em3 tso off gso off gro off sg off
ethtool -K em4 tso off gso off gro off sg off
# ethtool -K eth0 tso off gso off gro off sg off
# ...

Add this to each iface in /etc/network/interfaces
pre-up /sbin/ethtool -K $IFACE tso off gso off gro off sg off || true

Revision history for this message

Dzmitry Shykuts (boot0user) wrote on 2015-09-06:

I'm trying ethtool -K em1 tso off gso off lro off and it's doesn't help.

igb driver version is the same in 46 and 48 version of kernel. Seems that something changed in the kernel but not in the igb driver.

Revision history for this message

B. (b-deactivatedaccount-deactivatedaccount) wrote on 2015-09-06:

#10

If directly related to igb module maybe this
LP: #1465653
https://lists.ubuntu.com/archives/kernel-team/2015-June/058671.html
https://lists.ubuntu.com/archives/kernel-team/2015-June/058586.html
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1465653

if not directly related to igb modules maybe something linked to
hv_netvsc (Microsot Hyper-V Network Virtual Service Consumer)
LP: #1454892
http://lists.openwall.net/netdev/2015/09/01/18
http://marc.info/?l=linux-netdev&m=140900971718712&w=2
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1454892

Revision history for this message

Kunzhou (likunzhou) wrote on 2015-09-07:

#11

I have the same problem with Intel I210AT.

Revision history for this message

Dzmitry Shykuts (boot0user) wrote on 2015-09-07:

#12

I'm, personally, prefer to install a new kernel by running "apt-get install linux-signed-generic-lts-vivid".

Revision history for this message

Tedesco (tedesco-z) wrote on 2015-09-08:

#13

Only the log data of the syslog of ubuntu server Ubuntu 14.04.3 LTS (GNU/Linux 3.16.0-48-generic x86_64) Edit (197.2 KiB, text/plain)

I have the same problem
FUJITSU Server PRIMERGY RX1330 M1
-cpu
          product: Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
          vendor: Intel Corp.
          physical id: 1
          bus info: cpu@0
          size: 3100MHz
          capacity: 3100MHz
          width: 64 bits
-network
                description: Ethernet interface
                product: I210 Gigabit Network Connection
                vendor: Intel Corporation
                physical id: 0
                bus info: pci@0000:02:00.0
                logical name: em1
                version: 03
                serial: 90:1b:0e:10:34:96
                size: 100Mbit/s
                capacity: 1Gbit/s
                width: 32 bits
                clock: 33MHz
-network
                description: Ethernet interface
                product: I210 Gigabit Network Connection
                vendor: Intel Corporation
                physical id: 0
                bus info: pci@0000:03:00.0
                logical name: em2
                version: 03
                serial: 90:1b:0e:10:32:82
                size: 100Mbit/s
                capacity: 1Gbit/s
                width: 32 bits
                clock: 33MHz

Revision history for this message

Torsten Gollnick (tngk) wrote on 2015-09-08:

#14

Same problem here with
Dell Inc. PowerEdge R730/0H21J3, BIOS 1.2.10 03/09/2015
and
Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

Renders the machine useless.

Kernel 3.16.0-43 is OK

Revision history for this message

B. (b-deactivatedaccount-deactivatedaccount) wrote on 2015-09-08:

#15

@boot0user I agree with you. The best workaround for now is to update kernel to Vivid!

# Physical Server (with EFI):
sudo apt-get -y purge linux-{headers,image}-3.16.0-48-generic
sudo apt-get -y install linux-signed-generic-lts-vivid
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-signed-generic-lts-utopic
sudo apt-get -y purge linux-{image,headers}-generic-lts-utopic

# Physical Server (without EFI, but signed is also fine):
sudo apt-get -y purge linux-{headers,image}-3.16.0-48-generic
sudo apt-get -y install linux-generic-lts-vivid
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-generic-lts-utopic
sudo apt-get -y purge linux-{image,headers}-generic-lts-utopic

# Virtual Server:
sudo apt-get -y purge linux-{headers,image,image-extra}-3.16.0-48-generic
sudo apt-get -y install linux-virtual-lts-vivid
sudo reboot
uname -r # 3.19.0-26-generic
sudo apt-get -y purge linux-virtual-lts-utopic
sudo apt-get -y purge linux-{image,headers}-virtual-lts-utopic

# (optional)
# If you want to clean old kernels after the reboot (issue 1267059, 1089195) :
dpkg --get-selections | awk '/linux-(headers|image)-[0-9]\./ { print $1 }' \
| sort -r -V -t- -k3 | tail -n+4 \
| grep -v "$(uname -r | sed -e 's/-generic//')" \
| xargs -r apt-get -qq -y purge

Revision history for this message

Luis Henriques (henrix) wrote on 2015-09-08:

#16

I believe the problem lies in a bad backport in a set of patches for hyper-v. I've uploaded a test kernel that simply reverts this hyper-v patchset. Here's the URL:

http://people.canonical.com/~henrix/lp1492146/v1/amd64/

Could anyone please see if this kernel solves the issue? Thanks!

Changed in linux-lts-utopic (Ubuntu Trusty):
status:	New → Confirmed
assignee:	nobody → Luis Henriques (henrix)

Joseph Salisbury (jsalisbury) on 2015-09-08

Changed in linux-lts-utopic (Ubuntu Trusty):
importance:	Undecided → High
Changed in linux-lts-utopic (Ubuntu):
importance:	Undecided → High
tags:	added: kernel-key

Revision history for this message

rozie (rozie) wrote on 2015-09-08:

#18

3.16.0-48-generic #64~14.04.1~lp1492146v1 SMP Tue Sep 8 13:08:54 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux runs stable for ~1h.

Revision history for this message

Rudy (rudys) wrote on 2015-09-08:

#19

sudo apt-get -y purge linux-{headers,image}-3.16.0-48-generic
sudo apt-get -y install linux-generic-lts-vivid

sudo reboot [0]

-------------

linux-headers-3.16.0-46 linux-headers-3.16.0-46-generic
linux-headers-3.16.0-48 linux-image-3.16.0-46-generic
Use 'apt-get autoremove' to remove them.

Luis Henriques (henrix) on 2015-09-09

Changed in linux-lts-utopic (Ubuntu Trusty):
status:	Confirmed → Fix Committed

Revision history for this message

Luis Henriques (henrix) wrote on 2015-09-10:

#20

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-trusty

Revision history for this message

rozie (rozie) wrote on 2015-09-10:

#21

Tested 3.16.0-49-generic #65~14.04.1-Ubuntu SMP Wed Sep 9 10:03:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Looks stable for this issue: 523 packets transmitted, 523 received, 0% packet loss, time 526808ms

Revision history for this message

Luis Henriques (henrix) wrote on 2015-09-10:

#22

As per comment #21, I'm tagging this bug as verified.

tags:

added: verification-done-trusty
removed: verification-needed-trusty

Revision history for this message

Dzmitry Shykuts (boot0user) wrote on 2015-09-10:

#23

Tested 3.16.0-49-generic #65~14.04.1-Ubuntu SMP Wed Sep 9 10:03:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux from trusty/proposed. Looks stable.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-09-10:

#24

This bug was fixed in the package linux-lts-utopic - 3.16.0-49.65~14.04.1

---------------
linux-lts-utopic (3.16.0-49.65~14.04.1) trusty; urgency=low

[ Luis Henriques ]

* Release Tracking Bug
- LP: #1493759

[ Upstream Kernel Changes ]

  * Revert "hv_netvsc: Use the xmit_more skb flag to optimize signaling the
    host"
    - LP: #1492146
  * Revert "Drivers: hv: vmbus: Export the
    vmbus_sendpacket_pagebuffer_ctl()"
    - LP: #1492146
  * Revert "Drivers: hv: vmbus: Suport an API to send pagebuffers with
    additional control"
    - LP: #1492146
  * Revert "Drivers: hv: vmbus: Suport an API to send packet with
    additional control"
    - LP: #1492146
  * Revert "hv_netvsc: Fix a bug in netvsc_start_xmit()"
    - LP: #1492146
  * Revert "hv_netvsc: Implement partial copy into send buffer"
    - LP: #1492146
  * Revert "hv_netvsc: Fix the packet free when it is in skb headroom"
    - LP: #1492146
  * Revert "hv_netvsc: Eliminate memory allocation in the packet send path"
    - LP: #1492146
  * Revert "hv_netvsc: Cleanup the test for freeing skb when we use sendbuf
    mechanism"
    - LP: #1492146
  * Revert "hv_netvsc: Implement batching in send buffer"
    - LP: #1492146
  * Revert "hyperv: fix sparse warnings"
    - LP: #1492146
  * Revert "hyperv: Add support for vNIC hot removal"
    - LP: #1492146
  * Revert "hyperv: Increase the buffer length for netvsc_channel_cb()"
    - LP: #1492146
  * Revert "net: Remove ndo_xmit_flush netdev operation, use signalling
    instead."
    - LP: #1492146

-- Luis Henriques <email address hidden> Wed, 09 Sep 2015 10:28:29 +0100

Changed in linux-lts-utopic (Ubuntu Trusty):
status:	Fix Committed → Fix Released
status:	Fix Committed → Fix Released

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-11-03:

#26

The commits that caused this bug were introduced by the fixes for bug 1454892.

I've created a new test kernel for bug 1454892, but I would like to ensure it does not introduce this regression again. Could folks affected by this bug test my new test kernel? It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Revision history for this message

Mark Sapiro (msapiro) wrote on 2015-11-03:

#27

I have installed

linux-image-3.16.0-52-generic_3.16.0-52.71~14.04.1_amd64.deb and
linux-image-extra-3.16.0-52-generic_3.16.0-52.71~14.04.1_amd64.deb

from http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/ and rebooted and I do not see the issue reported in this bug. It appears at least for me that the above kernel does not have the regression.

Revision history for this message

Mark Sapiro (msapiro) wrote on 2015-11-03:

#28

I spoke too soon. It took about 20 minutes for the issue to develop, but it has reappeared just as before with the new kernel.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-11-10:

#29

Thanks for your help testing, Mark. I'll investigate further.

Revision history for this message

mrk (cvs-src) wrote on 2015-11-21:

#30

Hello,

any news on this one? We also expecting this problem on two servers - with kernel 3.19.0-33-generic #38~14.04.1-Ubuntu. Anything we can do to make this fixed asap? I'm open to any tests. Thank you!

Revision history for this message

mrk (cvs-src) wrote on 2015-11-21:

#31

The igb clash appears sporadically, once in two days or so. I can't reliably reproduce it - only by waiting for couple of days until it breaks.
We seeing that in Ubuntu 14.04 with xen hypervisor 4.4.2-0ubuntu0.14.04.3 installed.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-09:

#32

@Mark

Thanks for testing my last kernel and confirming the regression still exists.

I've created one more test kernel for bug 1454892. Could you and any other folks affected by this bug test my new test kernel? It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.

Thanks again!

Revision history for this message

Mark Sapiro (msapiro) wrote on 2015-12-11:

#33

Sorry to report that I have the same issue after installing linux-image-3.16.0-55-generic_3.16.0-55.74~14.04.1_amd64.deb and linux-image-extra-3.16.0-55-generic_3.16.0-55.74~14.04.1_amd64.deb from http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-17:

#34

Thanks again for testing, Mark. I've created on more test kernel. This kernel makes no changes to the igb code at all. So if the bug does not exist with your current up to date kernel, it shouldn't occur with the test kernel.

Could you and any other folks affected by this bug test my new test kernel? It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

Note, with this test kernel you would need to install both the linux-image and linux-image-extra .deb packages.

Revision history for this message

Mark Sapiro (msapiro) wrote on 2015-12-19:

#35

I have installed linux-image-3.16.0-56-generic_3.16.0-56.75~14.04.1_amd64.deb and linux-image-extra-3.16.0-56-generic_3.16.0-56.75~14.04.1_amd64.deb from http://kernel.ubuntu.com/~jsalisbury/lp1454892/lts-backport-utopic/

It's been running without issues for significantly longer than the versions with problems ever did. I will continue to monitor and will report again.

Revision history for this message

Mark Sapiro (msapiro) wrote on 2015-12-19:

#36

My server has been running on this kernel (3.16.0-56-generic) for almost 24 hours now with no recurrence of the igb Tx Unit Hang.

I'm still monitoring, but it looks like this kernel is stable on my server.

Revision history for this message

Mark Sapiro (msapiro) wrote on 2015-12-20:

#37

My server has now been running on this kernel (3.16.0-56-generic) for over 48 hours with no recurrence of the igb Tx Unit Hang.

I think we can say it's working for me.

penalvch (penalvch) on 2015-12-22

Changed in linux-lts-utopic (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Roland Sommer (rsommer) wrote on 2016-07-04:

#38

dmesg.log Edit (6.7 KiB, text/plain)

Hi, i'm encountering the same/similar bug on xenial 4.4.0-28-generic. If i apply network load via iperf i get the unit (Intel 210i) reproducible to hang. Maybe this is a regression or another bug. The network interface does not recover, i have to reboot the machine to get it back online. dmesg outout attached.

Revision history for this message

Manuel Hilbing (manuel-hilbing) wrote on 2016-07-04:

#39

Hi rsommer,

you use the Asrock C2550D4I?

Currently i am hunting the same problem on Ubuntu and on Debian

Some related links: ...

https://sourceforge.net/p/e1000/bugs/424/
http://enira.net/?p=709
http://forums.tweaktown.com/asrock/56730-c2750d4i-stability-problems-2.html
http://comments.gmane.org/gmane.linux.drivers.e1000.devel/14111

It can be a hardware problem ... on this specific board ... Asrock C2550D4I

Revision history for this message

Roland Sommer (rsommer) wrote on 2016-07-04:

#40

I am using the C2550. I just tried the "disable intel speedstep and C-state" hint but within 60 seconds i got the tx unit hang again.

Revision history for this message

Manuel Hilbing (manuel-hilbing) wrote on 2016-07-05:

#41

You can try to compile a dkms igb driver.

My solution is to run the working 3.2 kernel on Debian wheezy

I read something that the kernel pcie code was updated on nwer kernel. The igb on the bridge chip PLX 8608 has problems

You can try the following
pcie_aspm=off

https://sourceforge.net/p/e1000/bugs/410/

Today I contact the asrock(rack) support... and ask about the problem

Revision history for this message

Roland Sommer (rsommer) wrote on 2016-07-06:

#42

I just tried booting with pcie_aspm=off. It took 7 seconds until freeze after starting iperf. The funny thing is, that i'm using the igb-driver on the other side of the test, but on an I354 controller.

Roland Sommer (rsommer) on 2016-07-06

no longer affects:	linux-lts-xenial (Ubuntu)
no longer affects:	linux-lts-xenial (Ubuntu Trusty)

Revision history for this message

Roland Sommer (rsommer) wrote on 2016-07-06:

#43

The "no longer affects" is not correct, but the assignment to the correct source package was wrong.

Revision history for this message

Manuel Hilbing (manuel-hilbing) wrote on 2016-07-07:

#44

Answer from asrock...
i think that this is more of an issue with the Kernel. I can ask the engineers to look into it but as this OS is not on the tested list this does not warranty an rma.

Revision history for this message

Roland Sommer (rsommer) wrote on 2016-07-22:

#45

I got a replacement board and the error seems to have gone. I did run iperf for over an hour and no hanging was detected.

Revision history for this message

Manuel Hilbing (manuel-hilbing) wrote on 2016-08-03:

#46

@Roland Sommer
Can you check which board revision do you get?

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1488024

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux-lts-utopic package

igb Detected Tx Unit Hang

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux-lts-utopic package