Precise kernel lockup with Intel Corporation 82579LM network interfaces

Bug #1425333 reported by Rafael David Tinoco on 2015-02-25
20
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Precise
Undecided
Unassigned

Bug Description

SRU Justification:

Impact: e1000e driver can lockup in machine HP Z210.
Fix: All the patches were taken from upstream.
Testcase: To use an HP Z210 machine and wait certain conditions.

####

Original Description:

It was brought to my attention the following lockup (and error message):

Feb 1 21:17:54 hostname kernel: [10461681.674619] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out

"""
Feb 1 21:17:54 hostname kernel: [10461681.674613] WARNING: at /build/buildd/linux-3.2.0/net/sched/sch_generic.c:255 dev_watchdog+0x25a/0x270()
Feb 1 21:17:54 hostname kernel: [10461681.674616] Hardware name: HP Z210 Workstation
Feb 1 21:17:54 hostname kernel: [10461681.674619] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Feb 1 21:17:54 hostname kernel: [10461681.674622] Modules linked in: nvidia(P) btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext2 openafs(P) bnep rfcomm bluetooth autofs4 parport_pc ppdev nfsd nfs binfmt_misc lockd fscache auth_rpcgss nfs_acl sunrpc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq joydev hp_wmi snd_timer snd_seq_device snd mei(C) psmouse soundcore sparse_keymap serio_raw mac_hid wmi snd_page_alloc lp parport dm_multipath ses enclosure hid_logitech_dj usb_storage usbhid hid e1000e [last unloaded: nvidia]
Feb 1 21:17:54 hostname kernel: [10461681.674683] Pid: 0, comm: swapper/0 Tainted: P C O 3.2.0-69-generic #103-Ubuntu
Feb 1 21:17:54 hostname kernel: [10461681.674684] Call Trace:
Feb 1 21:17:54 hostname kernel: [10461681.674685] <IRQ> [<ffffffff8106844f>] warn_slowpath_common+0x7f/0xc0
Feb 1 21:17:54 hostname kernel: [10461681.674692] [<ffffffff8104f634>] ? __enqueue_entity+0x74/0x80
Feb 1 21:17:54 hostname kernel: [10461681.674694] [<ffffffff81068546>] warn_slowpath_fmt+0x46/0x50
Feb 1 21:17:54 hostname kernel: [10461681.674696] [<ffffffff81025392>] ? x86_pmu_enable+0x1f2/0x270
Feb 1 21:17:54 hostname kernel: [10461681.674699] [<ffffffff815684fa>] dev_watchdog+0x25a/0x270
Feb 1 21:17:54 hostname kernel: [10461681.674701] [<ffffffff81113310>] ? perf_rotate_context+0x110/0x220
Feb 1 21:17:54 hostname kernel: [10461681.674703] [<ffffffff815682a0>] ? qdisc_reset+0x50/0x50
Feb 1 21:17:54 hostname kernel: [10461681.674705] [<ffffffff815682a0>] ? qdisc_reset+0x50/0x50
Feb 1 21:17:54 hostname kernel: [10461681.674708] [<ffffffff81077456>] call_timer_fn+0x46/0x160
Feb 1 21:17:54 hostname kernel: [10461681.674710] [<ffffffff815682a0>] ? qdisc_reset+0x50/0x50
Feb 1 21:17:54 hostname kernel: [10461681.674712] [<ffffffff81078da2>] run_timer_softirq+0x132/0x2a0
Feb 1 21:17:54 hostname kernel: [10461681.674715] [<ffffffff81096655>] ? ktime_get+0x65/0xe0
Feb 1 21:17:54 hostname kernel: [10461681.674717] [<ffffffff8106fce8>] __do_softirq+0xa8/0x210
Feb 1 21:17:54 hostname kernel: [10461681.674719] [<ffffffff8109d664>] ? tick_program_event+0x24/0x30
Feb 1 21:17:54 hostname kernel: [10461681.674723] [<ffffffff8166e4ac>] call_softirq+0x1c/0x30
Feb 1 21:17:54 hostname kernel: [10461681.674725] [<ffffffff81016495>] do_softirq+0x65/0xa0
Feb 1 21:17:54 hostname kernel: [10461681.674727] [<ffffffff810700ce>] irq_exit+0x8e/0xb0
Feb 1 21:17:54 hostname kernel: [10461681.674729] [<ffffffff8166ee5e>] smp_apic_timer_interrupt+0x6e/0x99
Feb 1 21:17:54 hostname kernel: [10461681.674731] [<ffffffff8166cd1e>] apic_timer_interrupt+0x6e/0x80
Feb 1 21:17:54 hostname kernel: [10461681.674732] <EOI> [<ffffffff8101d721>] ? __switch_to_xtra+0x171/0x180
Feb 1 21:17:54 hostname kernel: [10461681.674737] [<ffffffff813713ad>] ? intel_idle+0xed/0x150
Feb 1 21:17:54 hostname kernel: [10461681.674739] [<ffffffff8137138f>] ? intel_idle+0xcf/0x150
Feb 1 21:17:54 hostname kernel: [10461681.674742] [<ffffffff8150e001>] cpuidle_idle_call+0xc1/0x290
Feb 1 21:17:54 hostname kernel: [10461681.674744] [<ffffffff8101322a>] cpu_idle+0xca/0x120
Feb 1 21:17:54 hostname kernel: [10461681.674746] [<ffffffff8162a4fe>] rest_init+0x72/0x74
Feb 1 21:17:54 hostname kernel: [10461681.674750] [<ffffffff81cfcc06>] start_kernel+0x3b5/0x3c2
Feb 1 21:17:54 hostname kernel: [10461681.674752] [<ffffffff81cfc388>] x86_64_start_reservations+0x132/0x136
Feb 1 21:17:54 hostname kernel: [10461681.674754] [<ffffffff81cfc140>] ? early_idt_handlers+0x140/0x140
Feb 1 21:17:54 hostname kernel: [10461681.674756] [<ffffffff81cfc459>] x86_64_start_kernel+0xcd/0xdc
Feb 1 21:17:54 hostname kernel: [10461681.674757] ---[ end trace a84a3dbc98d19bff ]---
Feb 1 21:17:54 hostname kernel: [10461681.674762] e1000e 0000:00:19.0: eth0: Reset adapter
Feb 1 21:17:58 hostname kernel: [10461685.529232] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
"""

This happens with the e1000e module on HP z210 systems.

Rafael David Tinoco (inaddy) wrote :

I believe this problem can be due to a HW problem already described in an Intel errata:

DOCUMENT: http://goo.gl/fffMyi

23. Packet Loss on Intel® 82579 Gigabit Ethernet Controller

Problem: Systems with Intel 6 Series Chipset and Intel C200 Series Chipset using the Intel 82579 Gigabit Ethernet Controller may experience packet Loss at 100 Mbps and 1 Gbps speeds when the link between the Intel 82579 Gigabit Ethernet Controller and the PCH Integrated LAN Controller is exiting the Low Power Link (K1) State.
Implication: Implications are application and Internet Protocol dependent.
Workaround: A BIOS code change has been identified and may be implemented as a workaround for this erratum.
Status: No Plan to Fix.

Workarounded by the following kernel commit:

commit 77e61146c67765deae45faa7db088c64a9fbca00
Author: David Ertman <email address hidden>
Date: Tue Apr 22 05:25:53 2014 +0000

e1000e: Workaround for dropped packets in Gig/100 speeds on 82579

Rafael David Tinoco (inaddy) wrote :

I'm providing a hotfixed kernel backporting the following commits:

commit 4ddfadb6625675d2e105257b029621809eb61b62
Author: David Ertman <email address hidden>
Date: Tue Apr 22 05:25:53 2014 +0000

    e1000e: Workaround for dropped packets in Gig/100 speeds on 82579

    This is a workaround for a HW erratum on 82579 devices.
    Erratum is #23 in Intel 6 Series Chipset and Intel C200 Series Chipset
    specification Update June 2013.

    Problem: 82579 parts experience packet loss in Gig and 100 speeds
    when interconnect between PHY and MAC is exiting K1 power saving state.
    This was previously believed to only affect 1Gig speed, but has been observed
    at 100Mbs also.

    Workaround: Disable K1 for 82579 devices at Gig and 100 speeds.

    [Conflicts]
        - Did not include commit 1b41db3 with cosmetic changes

    Signed-off-by: Dave Ertman <email address hidden>
    Tested-by: Aaron Brown <email address hidden>
    Signed-off-by: Jeff Kirsher <email address hidden>

Rafael David Tinoco (inaddy) wrote :

commit e05871885af46af0e8e9e82c612ca3cdd501f5fc
Author: Bruce Allan <email address hidden>
Date: Tue Mar 20 03:47:47 2012 +0000

    e1000e: 82579 packet drop workaround

    In K1 mode (a MAC/PHY interconnect power mode), the 82579 device shuts down
    the Phase Lock Loop (PLL) of the interconnect to save power. When the PLL
    starts working, the 82579 device may start to transfer the packet through
    the interconnect before it is fully functional causing packet drops. This
    workaround disables shutting down the PLL in K1 mode for 1G link speed.

    Signed-off-by: Bruce Allan <email address hidden>
    Tested-by: Jeff Pieper <email address hidden>
    Signed-off-by: Jeff Kirsher <email address hidden>

Rafael David Tinoco (inaddy) wrote :

commit 46f82e6f8895dcdc44c806082b412f9de6211dd6
Author: Bruce Allan <email address hidden>
Date: Thu Apr 12 06:27:03 2012 +0000

    e1000e: issues in Sx on 82577/8/9

    A workaround was previously put in the driver to reset the device when
    transitioning to Sx in order to activate the changed settings of the PHY
    OEM bits (Low Power Link Up, or LPLU, and GbE disable configuration) for
    82577/8/9 devices. After further review, it was found such a reset can
    cause the 82579 to confuse which version of 82579 it actually is and broke
    LPLU on all 82577/8/9 devices. The workaround during an S0->Sx transition
    on 82579 (instead of resetting the PHY) is to restart auto-negotiation
    after the OEM bits are configured; the restart of auto-negotiation
    activates the new OEM bits as does the reset. With 82577/8, the reset is
    changed to a generic reset which fixes the LPLU bits getting set wrong.

    [Conflicts]
        Solved small conflict on removing hw->phy.ops.check_reset_block(hw)
        changed by a patch not backported.

    Signed-off-by: Bruce Allan <email address hidden>
    Tested-by: Aaron Brown <email address hidden>
    Signed-off-by: Jeff Kirsher <email address hidden>

Rafael David Tinoco (inaddy) wrote :

commit 9f82830d0b4f734e0bfb2ebf53b456eaa7f38728
Author: Bruce Allan <email address hidden>
Date: Fri Dec 16 00:46:33 2011 +0000

    e1000e: update workaround for 82579 intermittently disabled during S0->Sx

    The workaround which toggles the LANPHYPC (LAN PHY Power Control) value bit
    to force the MAC-Phy interconnect into PCIe mode from SMBus mode during
    driver load and resume should always be done except if PHY resets are
    blocked by the Manageability Engine (ME). Previously, the toggle was done
    only if PHY resets are blocked and the ME was disabled.

    The rest of the patch is just indentation changes as a consequence of the
    updated workaround.

    Signed-off-by: Bruce Allan <email address hidden>
    Tested-by: Aaron Brown <email address hidden>
    Signed-off-by: Jeff Kirsher <email address hidden>

Rafael David Tinoco (inaddy) wrote :

commit aa21a75313a52f18bbf3e86827133d7d73fe30ae
Author: Bruce Allan <email address hidden>
Date: Fri Dec 16 00:46:06 2011 +0000

    e1000e: 82579: workaround for link drop issue

    When connected to certain switches, the 82579 PHY might drop link
    unexpectedly. Work around the issue by setting the Mean Square Error
    higher than the hardware default.

    Signed-off-by: Bruce Allan <email address hidden>
    Tested-by: Aaron Brown <email address hidden>
    Signed-off-by: Jeff Kirsher <email address hidden>

Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → In Progress
Changed in linux (Ubuntu):
importance: Undecided → Medium
Rafael David Tinoco (inaddy) wrote :

PPA:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1425333

HOWTO:

# add-apt-repository ppa:inaddy/lp1425333
# apt-get update
# apt-get dist-upgrade
(make sure package linux - 3.2.0-77.112hf79117v20150224b1 is installed)

Waiting for users feedback.

Rafael David Tinoco (inaddy) wrote :

I got confirmation that these cherry-picks fix the issue.

I'm going to send the merge proposal to the kernel team.

Thank you

Rafael Tinoco

description: updated
Rafael David Tinoco (inaddy) wrote :

For all those interested,

Submitted fixes to the kernel team mailing list:

[Precise][PATCH 0/5] e1000e fixes for Intel 82579LM
[Precise][PATCH 1/5] e1000e: 82579: workaround for link drop issue
[Precise][PATCH 2/5] e1000e: update workaround for 82579 intermittently disabled during S0->Sx
[Precise][PATCH 3/5] e1000e: issues in Sx on 82577/8/9
[Precise][PATCH 4/5] e1000e: 82579 packet drop workaround
[Precise][PATCH 5/5] e1000e: Workaround for dropped packets in Gig/100 speeds on 82579

Waiting for SRU. If accepted fixes will be available on next stable kernel package.

Brad Figg (brad-figg) on 2015-03-31
Changed in linux (Ubuntu Precise):
status: New → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-precise
Rafael David Tinoco (inaddy) wrote :
Download full text (3.2 KiB)

With all those fixes backported we still got a kernel lockup:

Apr 21 11:55:25 kernel: [992864.965769] Pid: 0, comm: swapper/0 Tainted: P C O 3.2.0-77-generic #112hf79117v20150224b1-Ubuntu
Apr 21 11:55:25 kernel: [992864.965772] Call Trace:
Apr 21 11:55:25 kernel: [992864.965773] <IRQ> [<ffffffff8106845f>] warn_slowpath_common+0x7f/0xc0
Apr 21 11:55:25 kernel: [992864.965786] [<ffffffff8104f624>] ? __enqueue_entity+0x74/0x80
Apr 21 11:55:25 kernel: [992864.965790] [<ffffffff81068556>] warn_slowpath_fmt+0x46/0x50
Apr 21 11:55:25 kernel: [992864.965794] [<ffffffff81025712>] ? x86_pmu_enable+0x1f2/0x270
Apr 21 11:55:25 kernel: [992864.965800] [<ffffffff81568e0a>] dev_watchdog+0x25a/0x270
Apr 21 11:55:25 kernel: [992864.965805] [<ffffffff811133f0>] ? perf_rotate_context+0x110/0x220
Apr 21 11:55:25 kernel: [992864.965809] [<ffffffff81568bb0>] ? qdisc_reset+0x50/0x50
Apr 21 11:55:25 kernel: [992864.965813] [<ffffffff81568bb0>] ? qdisc_reset+0x50/0x50
Apr 21 11:55:25 kernel: [992864.965819] [<ffffffff81077466>] call_timer_fn+0x46/0x160
Apr 21 11:55:25 kernel: [992864.965823] [<ffffffff81568bb0>] ? qdisc_reset+0x50/0x50
Apr 21 11:55:25 kernel: [992864.965827] [<ffffffff81078db2>] run_timer_softirq+0x132/0x2a0
Apr 21 11:55:25 kernel: [992864.965833] [<ffffffff81096675>] ? ktime_get+0x65/0xe0
Apr 21 11:55:25 kernel: [992864.965836] [<ffffffff8106fcf8>] __do_softirq+0xa8/0x210
Apr 21 11:55:25 kernel: [992864.965841] [<ffffffff8109d714>] ? tick_program_event+0x24/0x30
Apr 21 11:55:25 kernel: [992864.965845] [<ffffffff8166f06c>] call_softirq+0x1c/0x30
Apr 21 11:55:25 kernel: [992864.965851] [<ffffffff810164e5>] do_softirq+0x65/0xa0
Apr 21 11:55:25 kernel: [992864.965854] [<ffffffff810700de>] irq_exit+0x8e/0xb0
Apr 21 11:55:25 kernel: [992864.965858] [<ffffffff8166fa1e>] smp_apic_timer_interrupt+0x6e/0x99
Apr 21 11:55:25 kernel: [992864.965863] [<ffffffff8166d8de>] apic_timer_interrupt+0x6e/0x80
Apr 21 11:55:25 kernel: [992864.965865] <EOI> [<ffffffff8109c184>] ? clockevents_program_event+0x74/0x100
Apr 21 11:55:25 kernel: [992864.965873] [<ffffffff81371a5d>] ? intel_idle+0xed/0x150
Apr 21 11:55:25 kernel: [992864.965877] [<ffffffff81371a3f>] ? intel_idle+0xcf/0x150
Apr 21 11:55:25 kernel: [992864.965883] [<ffffffff8150e921>] cpuidle_idle_call+0xc1/0x290
Apr 21 11:55:25 kernel: [992864.965887] [<ffffffff8101322a>] cpu_idle+0xca/0x120
Apr 21 11:55:25 kernel: [992864.965892] [<ffffffff8162af5e>] rest_init+0x72/0x74
Apr 21 11:55:25 kernel: [992864.965897] [<ffffffff81cfbc0b>] start_kernel+0x3ba/0x3c7
Apr 21 11:55:25 kernel: [992864.965902] [<ffffffff81cfb388>] x86_64_start_reservations+0x132/0x136
Apr 21 11:55:25 kernel: [992864.965907] [<ffffffff81cfb140>] ? early_idt_handlers+0x140/0x140
Apr 21 11:55:25 kernel: [992864.965911] [<ffffffff81cfb459>] x86_64_start_kernel+0xcd/0xdc
Apr 21 11:55:25 kernel: [992864.965914] ---[ end trace f20333a4474e470d ]---
Apr 21 11:55:25 kernel: [992864.965930] e1000e 0000:00:19.0: eth0: Reset adapter
Apr 21 11:55:29 kernel: [992868.720990] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

It looks like network interface was restored...

Read more...

tags: added: cts verification-failed-precise
removed: verification-needed-precise
Luis Henriques (henrix) on 2015-04-27
tags: added: verification-reverted-precise
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.2.0-82.119

---------------
linux (3.2.0-82.119) precise; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1449034

  [ Upstream Kernel Changes ]

  * Revert "e1000e: Workaround for dropped packets in Gig/100 speeds on
    82579"
  * Revert "e1000e: 82579 packet drop workaround"
  * Revert "e1000e: issues in Sx on 82577/8/9"
  * Revert "e1000e: update workaround for 82579 intermittently disabled
    during S0->Sx"
  * Revert "e1000e: 82579: workaround for link drop issue"

 -- Luis Henriques <email address hidden> Mon, 27 Apr 2015 14:12:20 +0100

Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Rafael David Tinoco (inaddy) wrote :

Small Clarification: This bug is still opened and being worked on. The fixes (cherry picked and proposed for SRU) did not fix the original issue so they were reverted (as default policy for SRUs). The investigation continues, if you do have this NIC and had suffered from the same issue please provide comments...

Changed in linux (Ubuntu Precise):
status: Fix Released → In Progress
Andy Whitcroft (apw) on 2015-05-21
Changed in linux (Ubuntu Precise):
status: In Progress → Fix Committed
tags: removed: verification-failed-precise
Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
status: In Progress → Invalid
status: Invalid → Incomplete
Changed in linux (Ubuntu Precise):
status: Fix Committed → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Precise) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Precise):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers