CPU hard lockup with rigorous writes to NVMe drive

Bug #1810998 reported by Mauricio Faria de Oliveira
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Mauricio Faria de Oliveira
Bionic
Fix Released
Medium
Unassigned
Cosmic
Fix Released
Medium
Unassigned

Bug Description

[Impact]

 * Users may experience cpu hard lockups when performing
   rigorous writes to NVMe drives.

 * The fix addresses an scheduling issue in the original
   implementation of wbt/writeback throttling

 * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
   contention and thundering herd issue in wbt_wait"),
   plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
   waking of tasks").

 * Plus a few dependency commits for each fix.

 * Backports are trivial: mainly replace rq_wait_inc_below()
   with the equivalent atomic_inc_below(), and maintain the
   __wbt_done() signature, both due to the lack of commit
   a79050434b45 ("blk-rq-qos: refactor out common elements
   of blk-wbt"), that changes lots of other/unrelated code.

[Test Case]

 * This command has been reported to reproduce the problem:

   $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128

 * It generates stack traces as below in the original kernel,
   and does not generate them in the modified/patched kernel.

 * The user/reporter verified the test kernel with these patches
   resolved the problem.

 * The developer verified in 2 systems (4-core and 24-core but
   no NVMe) for regressions, and no error messages were logged
   to dmesg.

[Regression Potential]

 * The regression potential is contained within writeback
   throttling mechanism (block/blk-wbt.*).

 * The commits have been verified for fixes in later commits in
   linux-next as of 2019-01-08 and all known fix commits are in.

[Other Info]

 * The problem has been introduced with the blk-wbt mechanism,
   in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
   so only Bionic and Cosmic needs this.

[Stack Traces]

[ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
...
[ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 393.628720] Call Trace:
[ 393.628721] <IRQ>
[ 393.628724] enqueue_task_fair+0x6c/0x7f0
[ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
[ 393.628731] activate_task+0x57/0xc0
[ 393.628735] ? sched_clock+0x9/0x10
[ 393.628736] ? sched_clock+0x9/0x10
[ 393.628738] ttwu_do_activate+0x49/0x90
[ 393.628739] try_to_wake_up+0x1df/0x490
[ 393.628741] default_wake_function+0x12/0x20
[ 393.628743] autoremove_wake_function+0x12/0x40
[ 393.628744] __wake_up_common+0x73/0x130
[ 393.628745] __wake_up_common_lock+0x80/0xc0
[ 393.628746] __wake_up+0x13/0x20
[ 393.628749] __wbt_done.part.21+0xa4/0xb0
[ 393.628749] wbt_done+0x72/0xa0
[ 393.628753] blk_mq_free_request+0xca/0x1a0
[ 393.628755] blk_mq_end_request+0x48/0x90
[ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 393.628764] __blk_mq_complete_request+0xd2/0x140
[ 393.628766] blk_mq_complete_request+0x18/0x20
[ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 393.628768] nvme_irq+0x23/0x50 [nvme]
[ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
[ 393.628773] handle_irq_event_percpu+0x32/0x80
[ 393.628774] handle_irq_event+0x3b/0x60
[ 393.628778] handle_edge_irq+0x7c/0x190
[ 393.628779] handle_irq+0x20/0x30
[ 393.628783] do_IRQ+0x46/0xd0
[ 393.628784] common_interrupt+0x84/0x84
[ 393.628785] </IRQ>
...
[ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
[ 393.628796] cpuidle_enter+0x17/0x20
[ 393.628797] call_cpuidle+0x23/0x40
[ 393.628798] do_idle+0x18c/0x1f0
[ 393.628799] cpu_startup_entry+0x73/0x80
[ 393.628802] start_secondary+0x1a6/0x200
[ 393.628804] secondary_startup_64+0xa5/0xb0
[ 393.628805] Code: ...

[ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled

[ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
[ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
[ 435.617025] Sending NMI from CPU 8 to CPUs 30:
[ 435.617029] NMI backtrace for cpu 30
[ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu
...
[ 435.617047] Call Trace:
[ 435.617048] <IRQ>
[ 435.617051] enqueue_entity+0x9f/0x6b0
[ 435.617053] enqueue_task_fair+0x6c/0x7f0
[ 435.617056] activate_task+0x57/0xc0
[ 435.617059] ? sched_clock+0x9/0x10
[ 435.617060] ? sched_clock+0x9/0x10
[ 435.617061] ttwu_do_activate+0x49/0x90
[ 435.617063] try_to_wake_up+0x1df/0x490
[ 435.617065] default_wake_function+0x12/0x20
[ 435.617067] autoremove_wake_function+0x12/0x40
[ 435.617068] __wake_up_common+0x73/0x130
[ 435.617069] __wake_up_common_lock+0x80/0xc0
[ 435.617070] __wake_up+0x13/0x20
[ 435.617073] __wbt_done.part.21+0xa4/0xb0
[ 435.617074] wbt_done+0x72/0xa0
[ 435.617077] blk_mq_free_request+0xca/0x1a0
[ 435.617079] blk_mq_end_request+0x48/0x90
[ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
[ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
[ 435.617088] __blk_mq_complete_request+0xd2/0x140
[ 435.617090] blk_mq_complete_request+0x18/0x20
[ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
[ 435.617093] nvme_irq+0x23/0x50 [nvme]
[ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
[ 435.617097] handle_irq_event_percpu+0x32/0x80
[ 435.617098] handle_irq_event+0x3b/0x60
[ 435.617101] handle_edge_irq+0x7c/0x190
[ 435.617102] handle_irq+0x20/0x30
[ 435.617106] do_IRQ+0x46/0xd0
[ 435.617107] common_interrupt+0x84/0x84
[ 435.617108] </IRQ>
...
[ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
[ 435.617118] cpuidle_enter+0x17/0x20
[ 435.617119] call_cpuidle+0x23/0x40
[ 435.617121] do_idle+0x18c/0x1f0
[ 435.617122] cpu_startup_entry+0x73/0x80
[ 435.617125] start_secondary+0x1a6/0x200
[ 435.617127] secondary_startup_64+0xa5/0xb0
[ 435.617128] Code: ...

Changed in linux (Ubuntu):
assignee: nobody → Mauricio Faria de Oliveira (mfo)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1810998

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
description: updated
description: updated
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Patch series posted to kernel-team mailing list:

[SRU B][PATCH 00/13] blk-wbt: fix for LP#1810998
https://lists.ubuntu.com/archives/kernel-team/2019-January/097675.html

[SRU C][PATCH 0/8] blk-wbt: fix for LP#1810998
https://lists.ubuntu.com/archives/kernel-team/2019-January/097689.html

Stefan Bader (smb)
Changed in linux (Ubuntu Cosmic):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu):
importance: Undecided → Medium
description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Patch v2 series posted to the kernel-team mailing list:

[SRU B][PATCH v2 0/7] blk-wbt: fix for LP#1810998
https://lists.ubuntu.com/archives/kernel-team/2019-January/097831.html

[SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998
https://lists.ubuntu.com/archives/kernel-team/2019-January/097840.html

Changed in linux (Ubuntu Bionic):
status: New → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: New → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-cosmic' to 'verification-done-cosmic'. If the problem still exists, change the tag 'verification-needed-cosmic' to 'verification-failed-cosmic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-cosmic
tags: added: verification-needed-bionic
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification done on Cosmic (no errors seen on dmesg).
Waiting for verification on Bionic by the reporter.

root@shuckle:~# uname -a
Linux shuckle 4.18.0-14-generic #15-Ubuntu SMP Mon Jan 14 09:01:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

root@shuckle:~# fdisk /dev/nvme0n1 # create one partition
root@shuckle:~# mkfs.ext4 /dev/nvme0n1p1
root@shuckle:~# mkdir /nvme
root@shuckle:~# mount /dev/nvme0n1p1 /nvme/
root@shuckle:~# cd /nvme

root@shuckle:/nvme# iozone -R -s 2G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128

root@shuckle:/nvme# dmesg | tail
<...>
[ 586.338241] nvme0n1: p1
[ 586.366049] nvme0n1: p1
[ 609.549600] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)

tags: added: verification-done-cosmic
removed: verification-needed-cosmic
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification done on Bionic (no errors seen on dmesg).
Still waiting on verification by the reporter (different hardware), but this verification shows no regression in dmesg.

Same steps as described in the previous comment.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

The reporter verified the Bionic kernel successfully as well.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (47.0 KiB)

This bug was fixed in the package linux - 4.15.0-44.47

---------------
linux (4.15.0-44.47) bionic; urgency=medium

  * linux: 4.15.0-44.47 -proposed tracker (LP: #1811419)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * CPU hard lockup with rigorous writes to NVMe drive (LP: #1810998)
    - blk-wbt: pass in enum wbt_flags to get_rq_wait()
    - blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
    - blk-wbt: move disable check into get_limit()
    - blk-wbt: use wq_has_sleeper() for wq active check
    - blk-wbt: fix has-sleeper queueing check
    - blk-wbt: abstract out end IO completion handler
    - blk-wbt: improve waking of tasks

  * To reduce the Realtek USB cardreader power consumption (LP: #1811337)
    - mmc: sdhci: Disable 1.8v modes (HS200/HS400/UHS) if controller can't support
      1.8v
    - mmc: core: Introduce MMC_CAP_SYNC_RUNTIME_PM
    - mmc: rtsx_usb_sdmmc: Don't runtime resume the device while changing led
    - mmc: rtsx_usb: Use MMC_CAP2_NO_SDIO
    - mmc: rtsx_usb: Enable MMC_CAP_ERASE to allow erase/discard/trim requests
    - mmc: rtsx_usb_sdmmc: Re-work runtime PM support
    - mmc: rtsx_usb_sdmmc: Re-work card detection/removal support
    - memstick: rtsx_usb_ms: Add missing pm_runtime_disable() in probe function
    - misc: rtsx_usb: Use USB remote wakeup signaling for card insertion detection
    - memstick: Prevent memstick host from getting runtime suspended during card
      detection
    - memstick: rtsx_usb_ms: Use ms_dev() helper
    - memstick: rtsx_usb_ms: Support runtime power management

  * Support non-strict iommu mode on arm64 (LP: #1806488)
    - iommu/io-pgtable-arm: Fix race handling in split_blk_unmap()
    - iommu/arm-smmu-v3: Implement flush_iotlb_all hook
    - iommu/dma: Add support for non-strict mode
    - iommu: Add "iommu.strict" command line option
    - iommu/io-pgtable-arm: Add support for non-strict mode
    - iommu/arm-smmu-v3: Add support for non-strict mode
    - iommu/io-pgtable-arm-v7s: Add support for non-strict mode
    - iommu/arm-smmu: Support non-strict mode

  * ELAN900C:00 04F3:2844 touchscreen doesn't work (LP: #1811335)
    - pinctrl: cannonlake: Fix community ordering for H variant
    - pinctrl: cannonlake: Fix HOSTSW_OWN register offset of H variant

  * Add Cavium ThunderX2 SoC UNCORE PMU driver (LP: #1811200)
    - perf: Export perf_event_update_userpage
    - Documentation: perf: Add documentation for ThunderX2 PMU uncore driver
    - drivers/perf: Add Cavium ThunderX2 SoC UNCORE PMU driver
    - [Config] New config CONFIG_THUNDERX2_PMU=m

  * Update hisilicon SoC-specific drivers (LP: #1810457)
    - SAUCE: Revert "net: hns3: Updates RX packet info fetch in case of multi BD"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: separate roce from nic when
      resetting"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: Use roce handle when calling roce
      callback function"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: Add calling roce callback
      function when link status change"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: optimize the process of notifying
      roce client"
    - Revert "UBUNTU: S...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (56.3 KiB)

This bug was fixed in the package linux - 4.18.0-14.15

---------------
linux (4.18.0-14.15) cosmic; urgency=medium

  * linux: 4.18.0-14.15 -proposed tracker (LP: #1811406)

  * CPU hard lockup with rigorous writes to NVMe drive (LP: #1810998)
    - blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
    - blk-wbt: move disable check into get_limit()
    - blk-wbt: use wq_has_sleeper() for wq active check
    - blk-wbt: fix has-sleeper queueing check
    - blk-wbt: abstract out end IO completion handler
    - blk-wbt: improve waking of tasks

  * To reduce the Realtek USB cardreader power consumption (LP: #1811337)
    - mmc: core: Introduce MMC_CAP_SYNC_RUNTIME_PM
    - mmc: rtsx_usb_sdmmc: Don't runtime resume the device while changing led
    - mmc: rtsx_usb_sdmmc: Re-work runtime PM support
    - mmc: rtsx_usb_sdmmc: Re-work card detection/removal support
    - memstick: rtsx_usb_ms: Add missing pm_runtime_disable() in probe function
    - misc: rtsx_usb: Use USB remote wakeup signaling for card insertion detection
    - memstick: Prevent memstick host from getting runtime suspended during card
      detection
    - memstick: rtsx_usb_ms: Use ms_dev() helper
    - memstick: rtsx_usb_ms: Support runtime power management

  * Support non-strict iommu mode on arm64 (LP: #1806488)
    - iommu/io-pgtable-arm: Fix race handling in split_blk_unmap()
    - iommu/arm-smmu-v3: Implement flush_iotlb_all hook
    - iommu/dma: Add support for non-strict mode
    - iommu: Add "iommu.strict" command line option
    - iommu/io-pgtable-arm: Add support for non-strict mode
    - iommu/arm-smmu-v3: Add support for non-strict mode
    - iommu/io-pgtable-arm-v7s: Add support for non-strict mode
    - iommu/arm-smmu: Support non-strict mode

  * [Regression] crashkernel fails on HiSilicon D05 (LP: #1806766)
    - efi: honour memory reservations passed via a linux specific config table
    - efi/arm: libstub: add a root memreserve config table
    - efi: add API to reserve memory persistently across kexec reboot
    - irqchip/gic-v3-its: Change initialization ordering for LPIs
    - irqchip/gic-v3-its: Simplify LPI_PENDBASE_SZ usage
    - irqchip/gic-v3-its: Split property table clearing from allocation
    - irqchip/gic-v3-its: Move pending table allocation to init time
    - irqchip/gic-v3-its: Keep track of property table's PA and VA
    - irqchip/gic-v3-its: Allow use of pre-programmed LPI tables
    - irqchip/gic-v3-its: Use pre-programmed redistributor tables with kdump
      kernels
    - irqchip/gic-v3-its: Check that all RDs have the same property table
    - irqchip/gic-v3-its: Register LPI tables with EFI config table
    - irqchip/gic-v3-its: Allow use of LPI tables in reserved memory
    - arm64: memblock: don't permit memblock resizing until linear mapping is up
    - efi/arm: Defer persistent reservations until after paging_init()
    - efi: Permit calling efi_mem_reserve_persistent() from atomic context
    - efi: Prevent GICv3 WARN() by mapping the memreserve table before first use

  * ELAN900C:00 04F3:2844 touchscreen doesn't work (LP: #1811335)
    - pinctrl: cannonlake: Fix community ordering for H variant
    - pinctrl: c...

Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
status: Fix Released → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.