linux-kernel: Freeing IRQ from IRQ context

Bug #1597908 reported by Keith Busch on 2016-06-30
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Joseph Salisbury
Xenial
High
Joseph Salisbury

Bug Description

It looks like the Ubuntu 16.04 took the nvme driver from 4.5 kernel, but is missing some critical block updates that it was depending on. Specifically this one moving the timeout handler to a work queue instead of a irq context timer task:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=287922eb0b186e2a5bf54fdd04b734c25c90035c

This mismatch causes lots of warnings and errors during recovery from failure.

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1597908/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Paul White (paulw2u) on 2016-06-30
affects: ubuntu → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1597908

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Keith Busch (keith-busch) wrote :
Download full text (4.2 KiB)

This system crashes making apport-collect not possible after the fact, though I confirm it is a bug. As the upstream nvme driver maintainer, I can recommend either which driver commits need to be reverted, or which kernel commit needs to be cherry-picked (preferring the latter :)).

Here is a snippet of stack trace:

<3>[51827.132142] BUG: scheduling while atomic: swapper/19/0/0x00000100
<4>[51827.242686] Modules linked in: nvme binfmt_misc PlxSvc(OE) ipmi_devintf intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass input_leds joydev sb_edac ipmi_ssif edac_core mei_me mei lpc_ich ioatdma shpchp ipmi_si ipmi_msghandler 8250_fintek acpi_pad acpi_power_meter mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear igb dca ptp ahci crct10dif_pclmul crc32_pclmul hid_generic mxm_wmi aesni_intel aes_x86_64 lrw gf128mul usbhid glue_helper ablk_helper pps_core cryptd hid libahci i2c_algo_bit fjes wmi
<4>[51827.242743] CPU: 19 PID: 0 Comm: swapper/19 Tainted: G W OE 4.4.0-24-generic #43-Ubuntu
<4>[51827.242746] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.11.01.0132.060620160917 06/06/2016
<4>[51827.242748] 0000000000000286 374975818f2884ca ffff88105de43a98 ffffffff813eab23
<4>[51827.242752] ffff88105de56d00 0000000000000000 ffff88105de43aa8 ffffffff810a5ceb
<4>[51827.242762] ffff88105de43af8 ffffffff818217d6 ffff88105de43ac8 3749758100000013
<4>[51827.242765] Call Trace:
<4>[51827.242768] <IRQ> [<ffffffff813eab23>] dump_stack+0x63/0x90
<4>[51827.242781] [<ffffffff810a5ceb>] __schedule_bug+0x4b/0x60
<4>[51827.242788] [<ffffffff818217d6>] __schedule+0x726/0xa30
<4>[51827.242792] [<ffffffff81821b15>] schedule+0x35/0x80
<4>[51827.242797] [<ffffffff81824ba9>] schedule_timeout+0x129/0x270
<4>[51827.242802] [<ffffffff810ec480>] ? trace_event_raw_event_tick_stop+0x120/0x120
<4>[51827.242807] [<ffffffff810ec89d>] msleep+0x2d/0x40
<4>[51827.242813] [<ffffffffc02cd470>] nvme_wait_ready+0x90/0x100 [nvme]
<4>[51827.242818] [<ffffffffc02cee70>] nvme_disable_ctrl+0x40/0x50 [nvme]
<4>[51827.242823] [<ffffffffc02d1b3d>] nvme_disable_admin_queue+0x8d/0x90 [nvme]
<4>[51827.242828] [<ffffffffc02d1dde>] nvme_dev_disable+0x29e/0x2c0 [nvme]
<4>[51827.242833] [<ffffffffc02d03a0>] ? __nvme_process_cq+0x200/0x200 [nvme]
<4>[51827.242838] [<ffffffff8154955c>] ? dev_warn+0x6c/0x90
<4>[51827.242843] [<ffffffffc02d1ff0>] nvme_timeout+0x110/0x1d0 [nvme]
<4>[51827.242847] [<ffffffff813ea92f>] ? cpumask_next_and+0x2f/0x40
<4>[51827.242850] [<ffffffff810bd4bc>] ? load_balance+0x18c/0x980
<4>[51827.242854] [<ffffffff813c5cdf>] blk_mq_rq_timed_out+0x2f/0x70
<4>[51827.242857] [<ffffffff813c5d6e>] blk_mq_check_expired+0x4e/0x80
<4>[51827.242861] [<ffffffff813c86c8>] bt_for_each+0xd8/0xe0
<4>[51827.242864] [<ffffffff813c5d20>] ? blk_mq_rq_timed_out+0x70/0x70
<4>[51827.242868] [<ffffffff813c5d20>] ? blk_mq_rq_timed_out+0x70/0x70
<4>[51827.242871] [<ffffffff813c8ed7>] blk_mq_queue_tag_busy_iter+0x47/0xc0
<4>[51827.24...

Read more...

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Medium → High
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with a pick of commit: 287922eb0b The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1597908/

Can you test this kernel and see if it resolves this bug?

Keith Busch (keith-busch) wrote :

Thanks for the quick build! I forwarded this onto the original reporter and will send the results when I hear them. We'll get it tested though might be next week for the results due to holiday weekend.

Joe Gruher (joseph-r-gruher) wrote :

Hello, there are a number of packages at http://kernel.ubuntu.com/~jsalisbury/lp1597908/. Should we install the whole set to test this fix, or just specific one(s)?

Joe Gruher (joseph-r-gruher) wrote :

After installing all the packages on a system that was reliably showing the problem I have been unable to reproduce the problem again. Fix seems good in initial testing. No new problems identified.

Kamil Gardziejczyk (belussi-pl) wrote :

Hi,

I have installed these packages, but I still observer this issue:

kamil@host:~$ sudo lspci | grep Volatile
[sudo] password for kamil:
03:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
04:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
05:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
06:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
07:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
08:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
kamil@host:~$ ls /dev/ | grep nvme
nvme0
nvme0n1
nvme1
nvme1n1
nvme2
nvme2n1
nvme3
nvme3n1
nvme4
nvme5
nvme5n1
kamil@host:~$ [ 69.542046] nvme 0000:07:00.0: Identify Controller failed (-4)

kamil@host:~$ sudo lspci | grep Volatile
03:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
04:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
05:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
06:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
08:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
kamil@host:~$ ls /dev/ | grep nvme
nvme0
nvme0n1
nvme1
nvme1n1
nvme2
nvme2n1
nvme3
nvme3n1
nvme5
nvme5n1
kamil@host:~$ uname -a
Linux host 4.4.0-28-generic #47~lp1597908 SMP Thu Jun 30 22:46:45 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Dmesg output:
[ 2.648607] pci 0000:07:00.0: [8086:0953] type 00 class 0x010802
[ 2.649074] pci 0000:07:00.0: reg 0x10: [mem 0xc6a10000-0xc6a13fff 64bit]
[ 2.649668] pci 0000:07:00.0: reg 0x30: [mem 0xc6a00000-0xc6a0ffff pref]
[ 5.211815] pci_bus 0000:07: resource 1 [mem 0xc6a00000-0xc6afffff]
[ 6.365162] pci 0000:07:00.0: Signaling PME through PCIe PME interrupt
...
[ 69.437927] nvme 0000:07:00.0: I/O 0 QID 0 timeout, disable controller
[ 69.542017] nvme 0000:07:00.0: Cancelling I/O 0 QID 0
[ 69.542046] nvme 0000:07:00.0: Identify Controller failed (-4)
[ 69.553756] nvme 0000:07:00.0: Removing after probe failure

Kamil,

Keith Busch (keith-busch) wrote :

Hi Kamil,

It looks like the kernel issue is resolved. Your drive just appears to be in a bad state and the driver reacts accordingly.

I think this launchpad's issue is resolved and should request the fix go forward.

Keith Busch (keith-busch) wrote :

Kamil,

For your issue, this is potentially a bug specific to the 4.5 driver. We used to poll once a second, which would hide legacy irq issues during initialization, but 4.5 removed it. A patch[*] to 4.6 removed all use of legacy interrupts, and should have been applied to 4.5 stable (they were cc'ed on the mailing list).

 * https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=a5229050b69cfffb690b546c357ca5a60434c0c8

Joseph Salisbury (jsalisbury) wrote :

Per testing results in comment #7, I'll submit a SRU request for commit 287922eb0. Thanks for testing!

I have udpdated kernel to 4.7 and everything seems to be working fine.

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Stefan Bader (smb) wrote :

Taking the test kernel results from comment #8 and #9 as verification.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 4.4.0-36.55

---------------
linux (4.4.0-36.55) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1612305

  * I2C touchpad does not work on AMD platform (LP: #1612006)
    - SAUCE: pinctrl/amd: Remove the default de-bounce time

  * CVE-2016-5696
    - tcp: make challenge acks less predictable

linux (4.4.0-35.54) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1611215

  * [i915_bpo] Sync with v4.7 (LP: #1609742)
    - SAUCE: i915_bpo: Sync with v4.7

  * s390/cio: fix reset of channel measurement block (LP: #1609415)
    - s390/cio: allow to reset channel measurement block

  * in Ubuntu16.10: Hit on Call traces and system goes down when transactional
    memory tests are running in 32TB Brazos system (LP: #1606786)
    - powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
    - powerpc/tm: Fix stack pointer corruption in __tm_recheckpoint()

  * Power Menu does not display after press the Power Button (LP: #1609204)
    - intel-vbtn: new driver for Intel Virtual Button
    - [config] enable CONFIG_INTEL_VBTN=m

  * OptiPlex 7450 AIO hangs when rebooting (LP: #1608762)
    - x86/reboot: Add Dell Optiplex 7450 AIO reboot quirk

  * virtualbox+usb 3.0 breaks boot, -28 kernel works (LP: #1604058)
    - SAUCE: xhci: Fix soft lockup in xhci_pci_probe path when XHCI_STATE_HALTED

  * linux-kernel: Freeing IRQ from IRQ context (LP: #1597908)
    - block: defer timeouts to a workqueue

  * Tunnel offload indications not stripped from encapsulated packets, causing
    performance overhead (LP: #1602755)
    - tunnels: Remove encapsulation offloads on decap.

  * lm-sensors is throwing "ERROR: Can't get value of subfeature temp1_input:
    I/O error" for be2net driver (LP: #1607387)
    - be2net: perform temperature query in adapter regardless of its interface
      state

  * Dell dock MAC Address pass through doesn't work in Ubuntu (LP: #1579984)
    - r8152: Add support for setting pass through MAC address on RTL8153-AD

  * vmxnet3 LRO IPv6 performance issues (stalling TCP) (LP: #1605494)
    - Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets

  * ISST-LTE:pVM:monklp5:Ubuntu16.04.1:system crashed at
    lpfc_sli4_scmd_to_wqidx_distr (LP: #1597974)
    - SAUCE: lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from
      lpfc_send_taskmgmt()

  * Backport cxlflash shutdown patch to Xenial SRU (LP: #1605405)
    - SAUCE: cxlflash: Verify problem state area is mapped before notifying
      shutdown

  * Xenial update to v4.4.16 stable release (LP: #1607404)
    - mac80211: fix fast_tx header alignment
    - mac80211: mesh: flush mesh paths unconditionally
    - mac80211_hwsim: Add missing check for HWSIM_ATTR_SIGNAL
    - mac80211: Fix mesh estab_plinks counting in STA removal case
    - EDAC, sb_edac: Fix rank lookup on Broadwell
    - IB/cm: Fix a recently introduced locking bug
    - IB/mlx4: Properly initialize GRH TClass and FlowLabel in AHs
    - powerpc/pseries: Fix IBM_ARCH_VEC_NRCORES_OFFSET since POWER8NVL was added
    - powerpc/tm: Always reclaim in start_thread() for exec() class syscalls
    - usb: dwc2: fix reg...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers