Intel igb driver infinite loop in ksoftirqd, uses 100% of cpu 0

Bug #1291113 reported by Aaron Stone on 2014-03-12
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Precise
Undecided
Tim Gardner
Quantal
Undecided
Unassigned
Trusty
Medium
Unassigned
linux-lts-raring (Ubuntu)
Undecided
Unassigned
Precise
Undecided
Unassigned
Quantal
Undecided
Unassigned
Trusty
Undecided
Unassigned

Bug Description

This bug is present in 3.11.0-17 and 3.11.0-18:

https://lkml.org/lkml/2014/2/19/658

The afflicted network interface's lspci output:

02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
 Subsystem: Super Micro Computer Inc Device 1521
 Flags: bus master, fast devsel, latency 0, IRQ 27
 Memory at df920000 (32-bit, non-prefetchable) [size=128K]
 I/O ports at 8020 [size=32]
 Memory at df944000 (32-bit, non-prefetchable) [size=16K]
 Capabilities: [40] Power Management version 3
 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
 Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
 Capabilities: [a0] Express Endpoint, MSI 00
 Capabilities: [100] Advanced Error Reporting
 Capabilities: [140] Device Serial Number 00-xx-xx-xx-xx-xx-xx-xx
 Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
 Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
 Capabilities: [1a0] Transaction Processing Hints
 Capabilities: [1c0] Latency Tolerance Reporting
 Capabilities: [1d0] Access Control Services
 Kernel driver in use: igb
 Kernel modules: igb

[I have masked out the MAC address]

CVE References

Aaron Stone (sodabrew) wrote :

With no other major CPU usage, but CPU 0 pegged, "perf top" shows this as the top usage:

50.02% [kernel] [k] tasklet_action
23.48% [kernel] [k] __do_softirq
 5.66% [kernel] [k] __raise_softirq_irqoff

Aaron Stone (sodabrew) wrote :

I emailed Dan Williams, igb driver maintainer, and he pointed me at this commit. Said it had not been picked up in the 3.11 tree to his knowledge, but was in other 3.1x trees:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/dma/ioat?id=da87ca4d4ca101f177fffd84f1f0a5e4c0343557

Aaron Stone (sodabrew) wrote :
Aaron Stone (sodabrew) wrote :

Changing from package linux-lts-saucy to linux; this impacts the 3.11 kernel in general, not just the Precise build.

affects: linux-lts-saucy (Ubuntu) → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1291113

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Aaron Stone (sodabrew) wrote :

Does not require logs, this is a known issue upstream. Changing to Confirmed.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Can you give the 3.11.10.6 kernel a test, to confirm it fixes this bug? It can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11.10.6-saucy/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: bot-stop-nagging kernel-bug-exists-upstream saucy
Changed in linux (Ubuntu):
status: Confirmed → Triaged
tags: added: kernel-da-key
Aaron Stone (sodabrew) wrote :

I am now seeing this bug with the Precise linux-3.2.0-60 package on the same hardware.

I will try to create a lab-conditions reproduction of the issue to test the 3.11.10.6 kernel, but so far the problem only manifests on my production workload and/or production hardware.

Aaron Stone (sodabrew) wrote :

Hey, so heads up that this absolutely does affect kernel 3.2.0-60.

Per the changelog here:
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_3.2.0-60.91/changelog

This commit was added:
  * net_dma: mark broken
    - LP: #1281620

According to the commit message for "ioat: fix tasklet tear down" it is that net_dma commit which activates the bug. Please include the necessary fix in 3.2.0-61, or back out the triggering change as needed.

Aaron Stone (sodabrew) wrote :

(Hmm, you're simply tracking 3.2.55, that's how the net_dma commit got into the 3.2.0-60 build. Do we have to wait for 3.2.56 for a fix to be shipped in an Ubuntu build?)

Tim Gardner (timg-tpi) on 2014-03-15
Changed in linux (Ubuntu Trusty):
status: Triaged → Fix Released
Changed in linux (Ubuntu Precise):
status: New → In Progress
Changed in linux (Ubuntu Quantal):
status: New → In Progress
Changed in linux-lts-raring (Ubuntu Precise):
status: New → In Progress
Changed in linux-lts-raring (Ubuntu Quantal):
status: New → Invalid
Changed in linux-lts-raring (Ubuntu Trusty):
status: New → Invalid
Tim Gardner (timg-tpi) wrote :
Tim Gardner (timg-tpi) wrote :

sodabrew - please try the kernel at http://kernel.ubuntu.com/~rtg/3.2.0-61.92-ioat/

wget http://kernel.ubuntu.com/~rtg/3.2.0-61.92-ioat/linux-image-3.2.0-61-generic_3.2.0-61.92_amd64.deb
sudo dpkg -i linux-image-3.2.0-61-generic_3.2.0-61.92_amd64.deb

Changed in linux (Ubuntu Precise):
assignee: nobody → Tim Gardner (timg-tpi)
Aaron Stone (sodabrew) wrote :

Thank you! Testing the two kernels above:
3.2.0-61.92 does not exhibit the issue.
3.11.10.6 does not exhibit the issue.

Aaron Stone (sodabrew) wrote :

Bug 1300928 has the fixed linux kernel for Saucy.
Bug 1301505 has the fixes backport linux-lts-saucy for Precise

Tim Gardner (timg-tpi) wrote :

Ubuntu-3.5.0-49.73

Changed in linux (Ubuntu Quantal):
status: In Progress → Fix Released
Tim Gardner (timg-tpi) wrote :

Ubuntu-3.2.0-61.92

Changed in linux (Ubuntu Precise):
status: In Progress → Fix Released
Tim Gardner (timg-tpi) wrote :

Ubuntu-lts-3.8.0-39.57

Changed in linux-lts-raring (Ubuntu Precise):
status: In Progress → Fix Released
Aaron Stone (sodabrew) wrote :

Tim, could you add an "Also affects" section for linux-lts-saucy? That fix has not been released.

Aaron Stone (sodabrew) wrote :

Precise kernel package linux-3.2.0-61 is not released yet: https://bugs.launchpad.net/kernel-sru-workflow/+bug/1300455

Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-precise
tags: added: verification-needed-quantal
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-quantal' to 'verification-done-quantal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Aaron Stone (sodabrew) wrote :

Bug 1300455 has the fixed linux kernel for Precise (3.2.0-61)
Bug 1300928 has the fixed linux kernel for Saucy (3.11.0-20)
Bug 1301505 has the fixes backport linux-lts-saucy for Precise (3.11.0-20~precise)

None of the kernels above have been released, so this bug cannot be closed yet.

I don't understand you've marked any "Fix Released" states, this is very much an open issue in the latest kernels for Precise and Saucy.

Aaron Stone (sodabrew) wrote :

linux-3.2.0-61 is NOT released, I don't understand why it's been marked Fix Released.

Brad Figg (brad-figg) wrote :

sodabrew, can you test the -precise version and tell me if the issue is fixed?

Changed in linux (Ubuntu Precise):
status: Fix Released → Fix Committed
Changed in linux (Ubuntu Quantal):
status: Fix Released → Fix Committed
Changed in linux-lts-raring (Ubuntu Precise):
status: Fix Released → Fix Committed
Aaron Stone (sodabrew) wrote :

Testing 3.11.0-20~precise1 and this issue appears to be resolved. Thank you!

Brad Figg (brad-figg) on 2014-04-21
tags: added: verification-done-precise verification-done-quantal
removed: verification-needed-precise verification-needed-quantal
andrew mcintyre (amcintyre) wrote :

I have this issue on an IBM x3650 M4 so am anxiously waiting for the new kernel.

Has happened twice since 4/4 when I upgraded to 3.2.0.60.

Thanks!!

Aaron Stone (sodabrew) wrote :

Andrew, have you tested the 3.2.0-61 kernel from the proposed channel? These types of bugfix releases can often be helped along with test result comments of the form: "I tried the proposed package and it resolves this problem for me. Hope it is published soon." (or not if it doesn't, which is the most important thing to know if a proposed bugfix package does not fix the bug!)

andrew mcintyre (amcintyre) wrote :

Thanks but since the problem is intermittent on my box (only 2 in 20 days), and there is no test case to recreate issue that I can find, so I can't recreate...

and since I prefer to stay with apt-get, I will wait for the normal release, which hasn't happened yet:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1300455

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.2.0-61.92

---------------
linux (3.2.0-61.92) precise; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1300455

  [ Upstream Kernel Changes ]

  * cifs: set MAY_SIGN when sec=krb5
    - LP: #1285723
  * veth: reduce stat overhead
    - LP: #1201869
  * veth: extend device features
    - LP: #1201869
  * veth: avoid a NULL deref in veth_stats_one
    - LP: #1201869
  * veth: fix a NULL deref in netif_carrier_off
    - LP: #1201869
  * veth: fix NULL dereference in veth_dellink()
    - LP: #1201869
  * ioat: fix tasklet tear down
    - LP: #1291113
 -- Kamal Mostafa <email address hidden> Mon, 31 Mar 2014 14:33:18 -0700

Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (6.8 KiB)

This bug was fixed in the package linux-lts-raring - 3.8.0-39.57~precise1

---------------
linux-lts-raring (3.8.0-39.57~precise1) precise; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1300956

  [ Tim Gardner ]

  * [Config] updateconfigs after Linux 3.8.13.19 stable update
  * [Config] CONFIG_ARPD=y
    - LP: #1295666

  [ Upstream Kernel Changes ]

  * kernel.h: define u8, s8, u32, etc. limits
    - LP: #1300343
  * kernel.h: undef clashing U64_MAX, U32_MAX size limits
    - LP: #1300343
  * ata: enable quirk from jmicron JMB350 for JMB394
    - LP: #1300343
  * sata_sil: apply MOD15WRITE quirk to TOSHIBA MK2561GSYN
    - LP: #1300343
  * cgroup: fix locking in cgroup_cfts_commit()
    - LP: #1300343
  * xfs: ensure correct timestamp updates from truncate
    - LP: #1300343
  * ARM: 7953/1: mm: ensure TLB invalidation is complete before enabling
    MMU
    - LP: #1300343
  * ARM: barrier: allow options to be passed to memory barrier instructions
    - LP: #1300343
  * ARM: 7955/1: spinlock: ensure we have a compiler barrier before sev
    - LP: #1300343
  * ASoC: da9055: Fix device registration of PMIC and CODEC devices
    - LP: #1300343
  * ARM: dma-mapping: fix GFP_ATOMIC macro usage
    - LP: #1300343
  * x86: dma-mapping: fix GFP_ATOMIC macro usage
    - LP: #1300343
  * SUNRPC: Fix races in xs_nospace()
    - LP: #1300343
  * drm/i915: Add intel_ring_cachline_align()
    - LP: #1300343
  * drm/i915: Prevent MI_DISPLAY_FLIP straddling two cachelines on IVB
    - LP: #1300343
  * can: kvaser_usb: check number of channels returned by HW
    - LP: #1300343
  * ext4: don't try to modify s_flags if the the file system is read-only
    - LP: #1300343
  * drm/vmwgfx: Fix possible integer overflow
    - LP: #1300343
  * drm/i915/dp: increase native aux defer retry timeout
    - LP: #1300343
  * drm/i915/dp: add native aux defer retry limit
    - LP: #1300343
  * rtlwifi: rtl8192ce: Fix too long disable of IRQs
    - LP: #1300343
  * rtlwifi: Fix incorrect return from rtl_ps_enable_nic()
    - LP: #1300343
  * rtl8187: fix regression on MIPS without coherent DMA
    - LP: #1300343
  * PCI: Enable INTx if BIOS left them disabled
    - LP: #1300343
  * cifs: ensure that uncached writes handle unmapped areas correctly
    - LP: #1300343
  * CIFS: Fix too big maxBuf size for SMB3 mounts
    - LP: #1300343
  * ext4: fix online resize with very large inode tables
    - LP: #1300343
  * ext4: fix online resize with a non-standard blocks per group setting
    - LP: #1300343
  * ext4: don't leave i_crtime.tv_sec uninitialized
    - LP: #1300343
  * ALSA: usb-audio: Add a quirk for Plantronics Gamecom 780
    - LP: #1300343
  * ALSA: usb-audio: work around KEF X300A firmware bug
    - LP: #1300343
  * avr32: fix missing module.h causing build failure in mimc200/fram.c
    - LP: #1300343
  * avr32: Makefile: add '-D__linux__' flag for gcc-4.4.7 use
    - LP: #1300343
  * ARM: 7957/1: add DSB after icache flush in __flush_icache_all()
    - LP: #1300343
  * ACPI / PCI: Fix memory leak in acpi_pci_irq_enable()
    - LP: #1300343
  * ahci: disable NCQ on Samsung pci-e SSDs on macbooks
    - LP: #1300343
  * usb: gadget: bcm63xx_udc...

Read more...

Changed in linux-lts-raring (Ubuntu Precise):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (5.1 KiB)

This bug was fixed in the package linux - 3.5.0-49.73

---------------
linux (3.5.0-49.73) quantal; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1300894

  [ Kamal Mostafa ]

  * [config] updateconfigs after Linux 3.5.7.32 stable update

  [ Upstream Kernel Changes ]

  * ata: enable quirk from jmicron JMB350 for JMB394
    - LP: #1295768
  * sata_sil: apply MOD15WRITE quirk to TOSHIBA MK2561GSYN
    - LP: #1295768
  * ARM: 7953/1: mm: ensure TLB invalidation is complete before enabling
    MMU
    - LP: #1295768
  * x86: dma-mapping: fix GFP_ATOMIC macro usage
    - LP: #1295768
  * SUNRPC: Fix races in xs_nospace()
    - LP: #1295768
  * ext4: don't try to modify s_flags if the the file system is read-only
    - LP: #1295768
  * drm/i915/dp: increase native aux defer retry timeout
    - LP: #1295768
  * drm/i915/dp: add native aux defer retry limit
    - LP: #1295768
  * rtlwifi: rtl8192ce: Fix too long disable of IRQs
    - LP: #1295768
  * rtlwifi: Fix incorrect return from rtl_ps_enable_nic()
    - LP: #1295768
  * rtl8187: fix regression on MIPS without coherent DMA
    - LP: #1295768
  * PCI: Enable INTx if BIOS left them disabled
    - LP: #1295768
  * cifs: ensure that uncached writes handle unmapped areas correctly
    - LP: #1295768
  * ext4: fix online resize with a non-standard blocks per group setting
    - LP: #1295768
  * ext4: don't leave i_crtime.tv_sec uninitialized
    - LP: #1295768
  * ALSA: usb-audio: work around KEF X300A firmware bug
    - LP: #1295768
  * avr32: fix missing module.h causing build failure in mimc200/fram.c
    - LP: #1295768
  * avr32: Makefile: add '-D__linux__' flag for gcc-4.4.7 use
    - LP: #1295768
  * ARM: 7957/1: add DSB after icache flush in __flush_icache_all()
    - LP: #1295768
  * ahci: disable NCQ on Samsung pci-e SSDs on macbooks
    - LP: #1295768
  * USB: serial: option: blacklist interface 4 for Cinterion PHS8 and PXS8
    - LP: #1295768
  * workqueue: ensure @task is valid across kthread_stop()
    - LP: #1295768
  * cgroup: update cgroup_enable_task_cg_lists() to grab siglock
    - LP: #1295768
  * hwmon: (max1668) Fix writing the minimum temperature
    - LP: #1295768
  * cpufreq: powernow-k8: Initialize per-cpu data-structures properly
    - LP: #1295768
  * ACPI / video: Filter the _BCL table for duplicate brightness values
    - LP: #1295768
  * perf tools: Remove extraneous newline when parsing hardware cache
    events
    - LP: #1295768
  * perf tools: Fix cache event name generation
    - LP: #1295768
  * net: fix 'ip rule' iif/oif device rename
    - LP: #1295768
  * tg3: Fix deadlock in tg3_change_mtu()
    - LP: #1295768
  * bonding: 802.3ad: make aggregator_identifier bond-private
    - LP: #1295768
  * usbnet: remove generic hard_header_len check
    - LP: #1295768
  * net: sctp: fix sctp_connectx abi for ia32 emulation/compat mode
    - LP: #1295768
  * net: add and use skb_gso_transport_seglen()
    - LP: #1295768
  * net: ip, ipv6: handle gso skbs in forwarding path
    - LP: #1295768
  * net: asix: handle packets crossing URB boundaries
    - LP: #1295768
  * net: asix: add missing flag to struct driver_info
    - LP: #1295768
  * fs/proc/p...

Read more...

Changed in linux (Ubuntu Quantal):
status: Fix Committed → Fix Released
Aaron Stone (sodabrew) wrote :

Also Fix Released for linux-lts-saucy by https://bugs.launchpad.net/ubuntu/+source/linux-lts-saucy/+bug/1301505 - Thank you!

tafazzi87 (tafazzi-87) wrote :

how can i use this fix? i'm on trusty with 3.13 kernel and i've this bug, it's possible?

Aaron Stone (sodabrew) wrote :

This bug was fixed in upstream Linux and backported to the following Ubuntu linux kernel package revisions:

3.2.0-61
3.5.0-49
3.8.0-39
3.11.0-20

The fix was picked up in the upstream 3.13 tree in March, 2014: http://www.spinics.net/lists/stable/msg37786.html

If you are running a version of the 3.13.0-xx ubuntu package that is from around that time, you should upgrade.

tafazzi87 (tafazzi-87) wrote :

so with 3.13.0-32 this bug is fixed? because i've still this bug...

Aaron Stone (sodabrew) wrote :

Could you describe your symptoms more specifically to make sure that you actually have _this_ bug?

Per my comment #1, do you see the same issue here (and can you post your output):

"""
With no other major CPU usage, but CPU 0 pegged, "perf top" shows this as the top usage:

50.02% [kernel] [k] tasklet_action
23.48% [kernel] [k] __do_softirq
 5.66% [kernel] [k] __raise_softirq_irqoff
"""

tafazzi87 (tafazzi-87) wrote :

nope i've another output:
87,12% [kernel] [k] RTMPHandleTxRing8DmaDoneInterrupt
2,52% [kernel] [k] tasklet_action
1,93% [kernel] [k] __do_softirq

so it's another bug, sorry

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers