ISST-LTE:pNV: system ben is hung during ST (nvme)

Bug #1620317 reported by bugproxy on 2016-09-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Xenial
Undecided
Tim Gardner
Yakkety
High
Unassigned

Bug Description

On when we are running I/O intensive tasks and CPU addition/removal, the block may hang stalling the entire machine.

The backtrace below is one of the symptoms:

[12747.111149] ---[ end trace b4d8d720952460b5 ]---
[12747.126885] Trying to free IRQ 357 from IRQ context!
[12747.146930] ------------[ cut here ]------------
[12747.166674] WARNING: at /build/linux-iLHNl3/linux-4.4.0/kernel/irq/manage.c:1438
[12747.184069] Modules linked in: minix nls_iso8859_1 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) mlx4_en(OE) mlx4_core(OE) binfmt_misc xfs joydev input_leds mac_hid ofpart cmdlinepart powernv_flash ipmi_powernv mtd ipmi_msghandler at24 opal_prd powernv_rng ibmpowernv uio_pdrv_genirq uio sunrpc knem(OE) autofs4 btrfs xor raid6_pq hid_generic usbhid hid uas usb_storage nouveau ast bnx2x i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core(OE) ahci drm mdio libcrc32c mlx_compat(OE) libahci vxlan nvme ip6_udp_tunnel udp_tunnel
[12747.349013] CPU: 80 PID: 0 Comm: swapper/80 Tainted: G W OEL 4.4.0-21-generic #37-Ubuntu
[12747.369046] task: c000000f1fab89b0 ti: c000000f1fb6c000 task.ti: c000000f1fb6c000
[12747.404848] NIP: c000000000131888 LR: c000000000131884 CTR: 00000000300303f0
[12747.808333] REGS: c000000f1fb6e550 TRAP: 0700 Tainted: G W OEL (4.4.0-21-generic)
[12747.867658] MSR: 9000000100029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28022222 XER: 20000000
[12747.884783] CFAR: c000000000aea8f4 SOFTE: 1
GPR00: c000000000131884 c000000f1fb6e7d0 c0000000015b4200 0000000000000028
GPR04: c000000f2a409c50 c000000f2a41b4e0 0000000f29480000 00000000000033da
GPR08: 0000000000000007 c000000000f8b27c 0000000f29480000 9000000100001003
GPR12: 0000000000002200 c000000007b6f800 c000000f2a40a938 0000000000000100
GPR16: c000000f11480000 0000000000003a98 0000000000000000 0000000000000000
GPR20: 0000000000000000 d000000009521008 d0000000095146a0 fffffffffffff000
GPR24: c000000004a19ef0 0000000000000000 0000000000000003 000000000000007d
GPR28: 0000000000000165 c000000eefeb1800 c000000eef830600 0000000000000165
[12748.243270] NIP [c000000000131888] __free_irq+0x238/0x370
[12748.254089] LR [c000000000131884] __free_irq+0x234/0x370
[12748.269738] Call Trace:
[12748.286740] [c000000f1fb6e7d0] [c000000000131884] __free_irq+0x234/0x370 (unreliable)
[12748.289687] [c000000f1fb6e860] [c000000000131af8] free_irq+0x88/0xb0
[12748.304594] [c000000f1fb6e890] [d000000009514528] nvme_suspend_queue+0xc8/0x150 [nvme]
[12748.333825] [c000000f1fb6e8c0] [d00000000951681c] nvme_dev_disable+0x3fc/0x400 [nvme]
[12748.340913] [c000000f1fb6e9a0] [d000000009516ae4] nvme_timeout+0xe4/0x260 [nvme]
[12748.357136] [c000000f1fb6ea60] [c000000000548a34] blk_mq_rq_timed_out+0x64/0x110
[12748.383939] [c000000f1fb6ead0] [c00000000054c540] bt_for_each+0x160/0x170
[12748.399292] [c000000f1fb6eb40] [c00000000054d4e8] blk_mq_queue_tag_busy_iter+0x78/0x110
[12748.402665] [c000000f1fb6eb90] [c000000000547358] blk_mq_rq_timer+0x48/0x140
[12748.438649] [c000000f1fb6ebd0] [c00000000014a13c] call_timer_fn+0x5c/0x1c0
[12748.468126] [c000000f1fb6ec60] [c00000000014a5fc] run_timer_softirq+0x31c/0x3f0
[12748.483367] [c000000f1fb6ed30] [c0000000000beb78] __do_softirq+0x188/0x3e0
[12748.498378] [c000000f1fb6ee20] [c0000000000bf048] irq_exit+0xc8/0x100
[12748.501048] [c000000f1fb6ee40] [c00000000001f954] timer_interrupt+0xa4/0xe0
[12748.516377] [c000000f1fb6ee70] [c000000000002714] decrementer_common+0x114/0x180
[12748.547282] --- interrupt: 901 at arch_local_irq_restore+0x74/0x90
[12748.547282] LR = arch_local_irq_restore+0x74/0x90
[12748.574141] [c000000f1fb6f160] [0000000000000001] 0x1 (unreliable)
[12748.592405] [c000000f1fb6f180] [c000000000aedc3c] dump_stack+0xd0/0xf0
[12748.596461] [c000000f1fb6f1c0] [c0000000001006fc] dequeue_task_idle+0x5c/0x90
[12748.611532] [c000000f1fb6f230] [c0000000000f6080] deactivate_task+0xc0/0x130
[12748.627685] [c000000f1fb6f270] [c000000000adcb10] __schedule+0x440/0x990
[12748.654416] [c000000f1fb6f300] [c000000000add0a8] schedule+0x48/0xc0
[12748.670558] [c000000f1fb6f330] [c000000000ae1474] schedule_timeout+0x274/0x350
[12748.673485] [c000000f1fb6f420] [c000000000ade23c] wait_for_common+0xec/0x240
[12748.699192] [c000000f1fb6f4a0] [c0000000000e6908] kthread_stop+0x88/0x210
[12748.718385] [c000000f1fb6f4e0] [d000000009514240] nvme_dev_list_remove+0x90/0x110 [nvme]
[12748.748925] [c000000f1fb6f510] [d000000009516498] nvme_dev_disable+0x78/0x400 [nvme]
[12748.752112] [c000000f1fb6f5f0] [d000000009516ae4] nvme_timeout+0xe4/0x260 [nvme]
[12748.775395] [c000000f1fb6f6b0] [c000000000548a34] blk_mq_rq_timed_out+0x64/0x110
[12748.821069] [c000000f1fb6f720] [c00000000054c540] bt_for_each+0x160/0x170
[12748.851733] [c000000f1fb6f790] [c00000000054d4e8] blk_mq_queue_tag_busy_iter+0x78/0x110
[12748.883093] [c000000f1fb6f7e0] [c000000000547358] blk_mq_rq_timer+0x48/0x140
[12748.918348] [c000000f1fb6f820] [c00000000014a13c] call_timer_fn+0x5c/0x1c0
[12748.934743] [c000000f1fb6f8b0] [c00000000014a5fc] run_timer_softirq+0x31c/0x3f0
[12748.938084] [c000000f1fb6f980] [c0000000000beb78] __do_softirq+0x188/0x3e0
[12748.960815] [c000000f1fb6fa70] [c0000000000bf048] irq_exit+0xc8/0x100
[12748.992175] [c000000f1fb6fa90] [c00000000001f954] timer_interrupt+0xa4/0xe0
[12749.019299] [c000000f1fb6fac0] [c000000000002714] decrementer_common+0x114/0x180
[12749.037168] --- interrupt: 901 at arch_local_irq_restore+0x74/0x90
[12749.037168] LR = arch_local_irq_restore+0x74/0x90
[12749.079044] [c000000f1fb6fdb0] [c000000f2a41d680] 0xc000000f2a41d680 (unreliable)
[12749.081736] [c000000f1fb6fdd0] [c000000000909a28] cpuidle_enter_state+0x1a8/0x410
[12749.127094] [c000000f1fb6fe30] [c000000000119a88] call_cpuidle+0x78/0xd0
[12749.144435] [c000000f1fb6fe70] [c000000000119e5c] cpu_startup_entry+0x37c/0x480
[12749.166156] [c000000f1fb6ff30] [c00000000004563c] start_secondary+0x33c/0x360
[12749.186929] [c000000f1fb6ff90] [c000000000008b6c] start_secondary_prolog+0x10/0x14
[12749.223828] Instruction dump:
[12749.223856] 4e800020 4bf83a5d 60000000 4bffff64 4bf83a51 60000000 4bffffa8 3c62ff7b
[12749.233245] 7f84e378 38630fe0 489b900d 60000000 <0fe00000> 4bfffe20 7d2903a6 387d0118
[12749.298371] ---[ end trace b4d8d720952460b6 ]---

== Comment: #184 - Gabriel Krisman Bertazi <email address hidden> - 2016-07-29 12:55:48 ==
I got it figured out. The nvme driver is not playing nice with the block timeout infrastructure, in a way that the timeout code goes into a live lock, waiting for the queue to be released. CPU hotplug, on the other hand, who is holding the queue freeze lock at the time, is waiting for an outstanding request to timeout (or complete). This request, in turn is stuck in the device, requiring a reset triggered by a timeout, which never happens due to the live lock.

I don't have the reason why the request is stuck inside the device requiring a timeout, but this could even be caused by the Leaf firmware itself. I also see some successful timeouts triggered under normal conditions. In the failure event, we should be able to abort the request normally, but this happens via the timeout infrastructure, which is blocked during cpu hotplug events.

I have a quirk to fully recover after the failure, by forcing a reset of the stucked IO, which allows the cpu hotplug completion and block layer recovery. I have a machine hitting the failure every few minutes in a loop, and recovering from it with my patch.

Patch submitted to linux-block

https://marc.info/?l=linux-block&m=146976739016592&w=2

== Comment: #207 - Gabriel Krisman Bertazi <email address hidden> - 2016-09-05 09:13:51 ==
Canonical,

This is fixed by:

e57690fe009b ("blk-mq: don't overwrite rq->mq_ctx")
0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
71f79fb3179e (" blk-mq: Allow timeouts to run while queue is freezing")

Which will apply cleanly on top of your kernel.

CVE References

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-140718 severity-critical targetmilestone-inin16041

Default Comment by Bridge

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Incomplete
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → nobody
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: Confirmed → Triaged
Tim Gardner (timg-tpi) wrote :

Merged in v4.8

Changed in linux (Ubuntu Yakkety):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
status: Triaged → Fix Released
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress

------- Comment From <email address hidden> 2016-09-12 03:10 EDT-------
*** Bug 141963 has been marked as a duplicate of this bug. ***

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed

------- Comment (attachment only) From <email address hidden> 2016-09-20 02:40 EDT-------

------- Comment (attachment only) From <email address hidden> 2016-09-23 12:09 EDT-------

------- Comment on attachment From <email address hidden> 2016-09-23 14:47 EDT-------

Looking at the blocked tasks, this looks like it could be an issue fixed recently upstream:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/kernel/sched/core.c?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf

Gabriel is building a kernel that has this fix added and we'll kick off a weekend run.

bugproxy (bugproxy) on 2016-09-24
tags: removed: bugnameltc-140718 severity-critical
bugproxy (bugproxy) on 2016-09-24
tags: added: bugnameltc-140718 severity-critical
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

------- Comment From <email address hidden> 2016-09-27 23:02 EDT-------
Hi Canonical,

Our test teams had problems with booting the Ubuntu-4.4.0-40.60 kernel. They hit an Oops in the NVMe probe path.

My current understanding is that this is already fixed in your kernel by the fixup commit: e9820e415895 (" UBUNTU: (fix) NVMe: Don't unmap controller registers on reset") which will be published in kernel Ubuntu-4.4.0-41.61.

We will try again with that version.

While we are here, I noticed another issue with the backport of 30d6592fce71 (" NVMe: Don't unmap controller registers on reset"). Looks like you are missing the fixup commit that went later into the 4.4.y tree:

81e9a969c441 ('nvme: Call pci_disable_device on the error path") #4.4.y tree.

I opened Bug 146899 to track this. It will be mirrored soon.

tags: added: verification-failed-xenial
removed: verification-needed-xenial
Kamal Mostafa (kamalmostafa) wrote :

Thanks for this information, Gabriel.

Your understanding is correct: the original probe path oops is fixed in 4.4.0-41.61, which will land in the -proposed archive by tomorrow (alternately, available now from our staging PPA: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+packages

And we'll pick up the missing error path fixup commit as well (for the next Xenial cycle).

------- Comment on attachment From <email address hidden> 2016-09-23 14:47 EDT-------

Looking at the blocked tasks, this looks like it could be an issue fixed recently upstream:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/kernel/sched/core.c?id=135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf

Gabriel is building a kernel that has this fix added and we'll kick off a weekend run.

Brad Figg (brad-figg) on 2016-09-30
tags: added: verification-needed-xenial
removed: verification-failed-xenial
bugproxy (bugproxy) on 2016-09-30
tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (17.5 KiB)

This bug was fixed in the package linux - 4.4.0-42.62

---------------
linux (4.4.0-42.62) xenial; urgency=low

  * Fix GRO recursion overflow for tunneling protocols (LP: #1631287)
    - tunnels: Don't apply GRO to multiple layers of encapsulation.
    - gro: Allow tunnel stacking in the case of FOU/GUE

  * CVE-2016-7039
    - SAUCE: net: add recursion limit to GRO

linux (4.4.0-41.61) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1628204

  * nvme drive probe failure (LP: #1626894)
    - (fix) NVMe: Don't unmap controller registers on reset

linux (4.4.0-40.60) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1627074

  * Permission denied in CIFS with kernel 4.4.0-38 (LP: #1626112)
    - Fix memory leaks in cifs_do_mount()
    - Compare prepaths when comparing superblocks
    - SAUCE: Fix regression which breaks DFS mounting

  * Backlight does not change when adjust it higher than 50% after S3
    (LP: #1625932)
    - SAUCE: i915_bpo: drm/i915/backlight: setup and cache pwm alternate
      increment value
    - SAUCE: i915_bpo: drm/i915/backlight: setup backlight pwm alternate
      increment on backlight enable

linux (4.4.0-39.59) xenial; urgency=low

  [ Joseph Salisbury ]

  * Release Tracking Bug
    - LP: #1625303

  * thunder: chip errata w/ multiple CQEs for a TSO packet (LP: #1624569)
    - net: thunderx: Fix for issues with multiple CQEs posted for a TSO packet

  * thunder: faulty TSO padding (LP: #1623627)
    - net: thunderx: Fix for HW issue while padding TSO packet

  * CVE-2016-6828
    - tcp: fix use after free in tcp_xmit_retransmit_queue()

  * Sennheiser Officerunner - cannot get freq at ep 0x83 (LP: #1622763)
    - SAUCE: (no-up) ALSA: usb-audio: Add quirk for sennheiser officerunner

  * Backport E3 Skylake Support in ie31200_edac to Xenial (LP: #1619766)
    - EDAC, ie31200_edac: Add Skylake support

  * Ubuntu 16.04 - Full EEH Recovery Support for NVMe devices (LP: #1602724)
    - SAUCE: nvme: Don't suspend admin queue that wasn't created

  * ISST-LTE:pNV: system ben is hung during ST (nvme) (LP: #1620317)
    - blk-mq: Allow timeouts to run while queue is freezing
    - blk-mq: improve warning for running a queue on the wrong CPU
    - blk-mq: don't overwrite rq->mq_ctx

  * lsattr 32bit does not work on 64bit kernel (Inappropriate ioctl error)
    (LP: #1619918)
    - btrfs: bugfix: handle FS_IOC32_{GETFLAGS, SETFLAGS, GETVERSION} in
      btrfs_ioctl

  * radeon: monitor connected to onboard VGA doesn't work with Xenial
    (LP: #1600092)
    - drm/radeon/dp: add back special handling for NUTMEG

  * initramfs includes qle driver, but not firmware (LP: #1623187)
    - qed: add MODULE_FIRMWARE()

  * [Hyper-V] Rebase Hyper-V to 4.7.2 (stable) (LP: #1616677)
    - hv_netvsc: Implement support for VF drivers on Hyper-V
    - hv_netvsc: Fix the list processing for network change event
    - Drivers: hv: vmbus: Introduce functions for estimating room in the ring
      buffer
    - Drivers: hv: vmbus: Use READ_ONCE() to read variables that are volatile
    - Drivers: hv: vmbus: Export the vmbus_set_event() API
    - lcoking/barriers, arch: Use smp barriers...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers