Intel NVMe drives timeout when nvme format is attempted

Bug #1797587 reported by Sujith Pandel
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Joseph Salisbury
Bionic
Fix Released
Medium
Joseph Salisbury

Bug Description

== SRU Justification ==
Dell reports that every time they run a nvme format on an Intel
NVMe drive (whether specifying the secure/crypto erase options or not), the
format apparently works (drive is wiped clean). The drive's admin queue
(queue 0) times out, making the nvme device driver flag an error and reset the controller.

This bug is fixed by commit 62843c2e4226, which is in mainline as of
v4.17-rc1.

== Fix ==
62843c2e4226 ("nvme: Use admin command effects for admin commands")

== Regression Potential ==
Low. Limited to nvme driver.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

Every time you run an nvme format on a Intel NVMe drive (whether specifying the secure/crypto erase options or not), the format apparently works (drive is wiped clean) but the drive's admin queue (queue 0) times out, making the nvme device driver flag an error and reset the controller.

Here's an example:

nvme-cli-1.5 # ./nvme format -l 0 /dev/nvme1
NVME Admin command error:ABORT_REQ(7)
nvme-cli-1.5 # dmesg|tail -1
[79725.404910] nvme nvme1: I/O 151 QID 0 timeout, reset controller

Patch required:
https://github.com/torvalds/linux/commit/62843c2e4226057c83f520c74fe9c81a1891c331

Request you to include the patch for LTS kernel 4.15 of Ubuntu 18.04

information type: Public → Private
Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → Medium
tags: added: bionic
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Changed in linux (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 62843c2e42. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1797587

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

@jsalisbury -
Update -
Since there was no difference in the output with and without the patch, we are approaching Intel for more details on the fix.
Will keep you guys updated once we get a response.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

@jsalisbury -
Inclusion of the above patch is enough from the kernel side. Rest aspects will be taken care by the userspace tools.
Please include the requested patch.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can I now changed this bug to Public? That will allow me to submit the SRU request. If we need, we can removed any sensitive information.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

@Joseph - Yes, You can open this to public.

information type: Private → Public
description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Jerry Clement (jerry-clement) wrote :

Brad, Sujith is on vacation until 1/22/2019. We will try to find someone to verify. But if we are not able to locate, What is the plan to mitigate waiting for Sujith to return?
--jerry

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (47.0 KiB)

This bug was fixed in the package linux - 4.15.0-44.47

---------------
linux (4.15.0-44.47) bionic; urgency=medium

  * linux: 4.15.0-44.47 -proposed tracker (LP: #1811419)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * CPU hard lockup with rigorous writes to NVMe drive (LP: #1810998)
    - blk-wbt: pass in enum wbt_flags to get_rq_wait()
    - blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
    - blk-wbt: move disable check into get_limit()
    - blk-wbt: use wq_has_sleeper() for wq active check
    - blk-wbt: fix has-sleeper queueing check
    - blk-wbt: abstract out end IO completion handler
    - blk-wbt: improve waking of tasks

  * To reduce the Realtek USB cardreader power consumption (LP: #1811337)
    - mmc: sdhci: Disable 1.8v modes (HS200/HS400/UHS) if controller can't support
      1.8v
    - mmc: core: Introduce MMC_CAP_SYNC_RUNTIME_PM
    - mmc: rtsx_usb_sdmmc: Don't runtime resume the device while changing led
    - mmc: rtsx_usb: Use MMC_CAP2_NO_SDIO
    - mmc: rtsx_usb: Enable MMC_CAP_ERASE to allow erase/discard/trim requests
    - mmc: rtsx_usb_sdmmc: Re-work runtime PM support
    - mmc: rtsx_usb_sdmmc: Re-work card detection/removal support
    - memstick: rtsx_usb_ms: Add missing pm_runtime_disable() in probe function
    - misc: rtsx_usb: Use USB remote wakeup signaling for card insertion detection
    - memstick: Prevent memstick host from getting runtime suspended during card
      detection
    - memstick: rtsx_usb_ms: Use ms_dev() helper
    - memstick: rtsx_usb_ms: Support runtime power management

  * Support non-strict iommu mode on arm64 (LP: #1806488)
    - iommu/io-pgtable-arm: Fix race handling in split_blk_unmap()
    - iommu/arm-smmu-v3: Implement flush_iotlb_all hook
    - iommu/dma: Add support for non-strict mode
    - iommu: Add "iommu.strict" command line option
    - iommu/io-pgtable-arm: Add support for non-strict mode
    - iommu/arm-smmu-v3: Add support for non-strict mode
    - iommu/io-pgtable-arm-v7s: Add support for non-strict mode
    - iommu/arm-smmu: Support non-strict mode

  * ELAN900C:00 04F3:2844 touchscreen doesn't work (LP: #1811335)
    - pinctrl: cannonlake: Fix community ordering for H variant
    - pinctrl: cannonlake: Fix HOSTSW_OWN register offset of H variant

  * Add Cavium ThunderX2 SoC UNCORE PMU driver (LP: #1811200)
    - perf: Export perf_event_update_userpage
    - Documentation: perf: Add documentation for ThunderX2 PMU uncore driver
    - drivers/perf: Add Cavium ThunderX2 SoC UNCORE PMU driver
    - [Config] New config CONFIG_THUNDERX2_PMU=m

  * Update hisilicon SoC-specific drivers (LP: #1810457)
    - SAUCE: Revert "net: hns3: Updates RX packet info fetch in case of multi BD"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: separate roce from nic when
      resetting"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: Use roce handle when calling roce
      callback function"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: Add calling roce callback
      function when link status change"
    - Revert "UBUNTU: SAUCE: {topost} net: hns3: optimize the process of notifying
      roce client"
    - Revert "UBUNTU: S...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in dell-poweredge:
status: New → Fix Committed
Po-Hsu Lin (cypressyew)
Changed in dell-poweredge:
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.