mellanox driver crash fix in leagcy EQ mode

Bug #1354242 reported by Ming Lei on 2014-08-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Trusty
Undecided
Ming Lei

Bug Description

1, reproduction steps:

- ethtool eth0 rx 128
- iperf -s
- in iperf client side, run below:
    iperf -c IP_SRV -P128 -t 120
- then mellanox driver crash:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x2a8/0x2b4()
NETDEV WATCHDOG: eth0 (mlx4_core): transmit queue 4 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.11.4-mustang_sw_1.12.10-beta+ #30
Call trace:
[<ffffffc0000882f0>] dump_backtrace+0x0/0x12c
[<ffffffc00008842c>] show_stack+0x10/0x1c
[<ffffffc0006bfd6c>] dump_stack+0x74/0x94
[<ffffffc0000956b8>] warn_slowpath_common+0x84/0xac
[<ffffffc00009572c>] warn_slowpath_fmt+0x4c/0x58
[<ffffffc0005c1dd4>] dev_watchdog+0x2a4/0x2b4
[<ffffffc0000a08f0>] call_timer_fn.isra.33+0x24/0x80
[<ffffffc0000a0ad0>] run_timer_softirq+0x184/0x1f4
[<ffffffc000099c5c>] __do_softirq+0x198/0x208
[<ffffffc000099fc0>] irq_exit+0x8c/0xc0
[<ffffffc0000850ac>] handle_IRQ+0x5c/0xc8
[<ffffffc00008128c>] gic_handle_irq+0x38/0x7c
Exception stack(0xffffffc0009cbdf0 to 0xffffffc0009cbf10)
bde0: 009c8000 ffffffc0 00a31300 ffffffc0
be00: 009cbf30 ffffffc0 000855e0 ffffffc0 0037356e 00000000 00000000 00000000
be20: fff87954 ffffffc7 00010000 00000000 a455f900 ffffffc7 00000000 00000000
be40: 00a35000 ffffffc0 005e1080 ffffffc0 009d4c90 ffffffc0 009cbd40 ffffffc0
be60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
be80: 00000000 00000000 aadd05a0 0000007f 0058f434 ffffffc0 aade3c80 0000007f
bea0: aabe8660 0000007f 009c8000 ffffffc0 00a31300 ffffffc0 00a278d8 ffffffc0
bec0: 009c8000 ffffffc0 00a27022 ffffffc0 00000001 00000000 008cbd20 ffffffc0
bee0: 006c9d70 ffffffc0 00080408 ffffffc0 00080200 00000040 009cbf30 ffffffc0
bf00: 000855dc ffffffc0 009cbf30 ffffffc0
[<ffffffc0000845a8>] el1_irq+0x68/0xc0
[<ffffffc0000d1cbc>] cpu_startup_entry+0x100/0x14c
[<ffffffc0006bb07c>] rest_init+0x6c/0x78
[<ffffffc0009937d4>] start_kernel+0x340/0x358
---[ end trace a66d2f499386c240 ]---
INFO: rcu_sched detected stalls on CPUs/tasks: { 4 5} (detected by 7, t=6002 jiffies, g=1924, c=1923, q=1316)
Task dump for CPU 4:
irqbalance R running task 0 999 1 0x00000000
Call trace:
[<ffffffc0000859b8>] __switch_to+0x74/0x8c
[<ffffffc000091dac>] do_page_fault+0x224/0x378
[<ffffffc000081100>] do_mem_abort+0x38/0x9c
Exception stack(0xffffffc7c3ecbe30 to 0xffffffc7c3ecbf50)
be20: c3ecbe80 ffffffc7 00163f4c ffffffc0
be40: c69e2f00 ffffffc7 a133f000 0000007f 00000000 00000000 a147db7c 0000007f
be60: 60000000 00000000 00000015 00000000 ffffffff ffffffff a148625c 0000007f
be80: fb6e3b70 0000007f 000849ec ffffffc0 00000008 00000000 a133f000 0000007f
bea0: ffffffff ffffffff 3b215828 00000000 53e18915 00000000 00000008 00000000
bec0: fb6e3ae0 0000007f 00000000 00000000 00000003 00000000 a133f000 0000007f
bee0: 00000008 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf00: 00000000 00000000 2f2f2f2f 302f2f2f 00000040 00000000 2e312dff 5e6f6c72
bf20: 7f7f7f7f 7f7f7f7f 01010101 01010101 00000000 00000000 00000000 00000000
bf40: 00000020 00000000 a15075a0 0000007f
Task dump for CPU 5:
swapper/5 R running task 0 0 1 0x00010002
Call trace:
[<ffffffc0000859b8>] __switch_to+0x74/0x8c
INFO: rcu_sched detected stalls on CPUs/tasks: { 4 5} (detected by 7, t=24007 jiffies, g=1924, c=1923, q=1404)
Task dump for CPU 4:
irqbalance R running task 0 999 1 0x00000000
Call trace:
[<ffffffc0000859b8>] __switch_to+0x74/0x8c
[<ffffffc000091dac>] do_page_fault+0x224/0x378
[<ffffffc000081100>] do_mem_abort+0x38/0x9c
Exception stack(0xffffffc7c3ecbe30 to 0xffffffc7c3ecbf50)
be20: c3ecbe80 ffffffc7 00163f4c ffffffc0
be40: c69e2f00 ffffffc7 a133f000 0000007f 00000000 00000000 a147db7c 0000007f
be60: 60000000 00000000 00000015 00000000 ffffffff ffffffff a148625c 0000007f
be80: fb6e3b70 0000007f 000849ec ffffffc0 00000008 00000000 a133f000 0000007f
bea0: ffffffff ffffffff 3b215828 00000000 53e18915 00000000 00000008 00000000
bec0: fb6e3ae0 0000007f 00000000 00000000 00000003 00000000 a133f000 0000007f
bee0: 00000008 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf00: 00000000 00000000 2f2f2f2f 302f2f2f 00000040 00000000 2e312dff 5e6f6c72
bf20: 7f7f7f7f 7f7f7f7f 01010101 01010101 00000000 00000000 00000000 00000000
bf40: 00000020 00000000 a15075a0 0000007f
Task dump for CPU 5:
swapper/5 R running task 0 0 1 0x00010002
Call trace:
[<ffffffc0000859b8>] __switch_to+0x74/0x8c
INFO: rcu_sched detected stalls on CPUs/tasks: { 4 5} (detected by 7, t=42012 jiffies, g=1924, c=1923, q=1553)
Task dump for CPU 4:
irqbalance R running task 0 999 1 0x00000002
Call trace:
[<ffffffc0000859b8>] __switch_to+0x74/0x8c
[<ffffffc000091dac>] do_page_fault+0x224/0x378
[<ffffffc000081100>] do_mem_abort+0x38/0x9c
Exception stack(0xffffffc7c3ecbe30 to 0xffffffc7c3ecbf50)
be20: c3ecbe80 ffffffc7 00163f4c ffffffc0
be40: c69e2f00 ffffffc7 a133f000 0000007f 00000000 00000000 a147db7c 0000007f
be60: 60000000 00000000 00000015 00000000 ffffffff ffffffff a148625c 0000007f
be80: fb6e3b70 0000007f 000849ec ffffffc0 00000008 00000000 a133f000 0000007f
bea0: ffffffff ffffffff 3b215828 00000000 53e18915 00000000 00000008 00000000
bec0: fb6e3ae0 0000007f 00000000 00000000 00000003 00000000 a133f000 0000007f
bee0: 00000008 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf00: 00000000 00000000 2f2f2f2f 302f2f2f 00000040 00000000 2e312dff 5e6f6c72
bf20: 7f7f7f7f 7f7f7f7f 01010101 01010101 00000000 00000000 00000000 00000000
bf40: 00000020 00000000 a15075a0 0000007f
Task dump for CPU 5:
swapper/5 R running task 0 0 1 0x00010002
Call trace:
[<ffffffc0000859b8>] __switch_to+0x74/0x8c

Ming Lei (tom-leiming) wrote :

With upstream 92df54ee3dde385 ( net/mlx4_en: Don't use irq_affinity_notifier to track changes
in IRQ affinity map), the issue is fixed.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1354242

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Chris J Arges (arges) on 2014-08-08
Changed in linux (Ubuntu Trusty):
assignee: nobody → Ming Lei (tom-leiming)
Changed in linux (Ubuntu):
status: Incomplete → Fix Released
tags: added: bot-stop-nagging trusty
Changed in linux (Ubuntu Trusty):
status: New → In Progress
Tim Gardner (timg-tpi) on 2014-08-11
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Ming Lei (tom-leiming) on 2014-08-20
tags: added: verification-done-trusty
Launchpad Janitor (janitor) wrote :
Download full text (38.9 KiB)

This bug was fixed in the package linux - 3.13.0-35.62

---------------
linux (3.13.0-35.62) trusty; urgency=low

  [ Joseph Salisbury ]

  * Release Tracking Bug
    - LP: #1357148

  [ Brad Figg ]

  * Start new release

  [ dann frazier ]

  * SAUCE: (no-up) Fix build failure on arm64
    - LP: #1353657
  * [debian] Allow for package revisions condusive for branching

  [ David Henningsson ]

  * SAUCE: Call broadwell specific functions from the hda driver
    - LP: #1317865

  [ Edward Lin ]

  * SAUCE: (no-up) Add use native backlight quirk for Dell Inspiron
    5547/5447
    - LP: #1332437

  [ Imre Deak ]

  * SAUCE: drm/i915: move power domain init earlier during system resume
    - LP: #1353405

  [ Jani Nikula ]

  * SAUCE: drm/i915: use lane count and link rate from VBT as minimums for
    eDP
    - LP: #1338582
  * SAUCE: drm/i915/dp: force eDP lane count to max available lanes on BDW
    - LP: #1338582
  * SAUCE: drm/i915: provide interface for audio driver to query cdclk
    - LP: #1188091
  * SAUCE: drm/i915: demote opregion excessive timeout WARN_ONCE to
    DRM_INFO_ONCE
    - LP: #1351014

  [ Joseph Salisbury ]

  * [Config] updateconfigs after Linux 3.13.11.6 updates

  [ Luis Henriques ]

  * Revert "[Packaging] linux-udeb-flavour -- standardise on linux prefix"

  [ Ming Lei ]

  * Revert "SAUCE: (no-up) ata: Fix the dma state machine lockup for the
    IDENTIFY DEVICE PIO mode command."
    - LP: #1335645

  [ Paulo Zanoni ]

  * SAUCE: drm/i915: consider the source max DP lane count too
    - LP: #1338582

  [ Tim Gardner ]

  * [Config] CONFIG_GPIO_SYSFS=y
    - LP: #1342153
  * [Config] CONFIG_KEYS_DEBUG_PROC_KEYS=y
    - LP: #1344405
  * [Config] updateconfigs
  * [Config] CONFIG_SCSI_IPR_TRACE=y, CONFIG_SCSI_IPR_DUMP=y
    - LP: #1343109
  * [Config] CONFIG_CONTEXT_TRACKING_FORCE=n
    - LP: #1349028

  [ Timo Aaltonen ]

  * SAUCE: Fix a typo in hda i915_bdw support.
    - LP: #1343140

  [ Upstream Kernel Changes ]

  * Revert "net/mlx4_en: Fix bad use of dev_id"
    - LP: #1347012
  * Revert "ACPI / AC: Remove AC's proc directory."
    - LP: #1356913
  * Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan"
    - LP: #1356913
  * mm, pcp: allow restoring percpu_pagelist_fraction default
    - LP: #1347088
  * net: Fix permission check in netlink_connect()
    - LP: #1312989
  * netlink: Rename netlink_capable netlink_allowed
    - LP: #1312989
  * net: Move the permission check in sock_diag_put_filterinfo to
    packet_diag_dump
    - LP: #1312989
  * net: Add variants of capable for use on on sockets
    - LP: #1312989
  * net: Add variants of capable for use on netlink messages
    - LP: #1312989
  * net: Use netlink_ns_capable to verify the permisions of netlink
    messages
    - LP: #1312989
  * netlink: Only check file credentials for implicit destinations
    - LP: #1312989
  * igb: fix stats for i210 rx_fifo_errors
    - LP: #1338893
  * HID: use multi input quirk for 22b9:2968
    - LP: #1339567
  * crypto/nx: disable NX on little endian builds
    - LP: #1338666
  * ACPI / video: Add Dell Inspiron 5737 to the blacklist
    - LP: #1250401
  * Input: elantech - deal with clickpads reportin...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers