NVMe drives in Amazon AWS instance fail to initialize

Bug #1648449 reported by Dan Streetman on 2016-12-08
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Dan Streetman
Xenial
Undecided
Dan Streetman
Yakkety
Undecided
Dan Streetman
Zesty
Undecided
Dan Streetman

Bug Description

[Impact]

On an Amazon AWS instance that has NVMe drives, the NVMe drives fail to initialize, and so aren't usable by the system. If one of the NVMe drives contains the root filesystem, the instance won't boot.

[Test Case]

Boot an AWS instance with a NVMe drive. It will fail to initialize the NVMe drive(s), and errors will appear in the system log (if the system boots at all). With a patched kernel, all NVMe drives are initialized and enumerated and work properly.

[Regression Potential]

Patching the NVMe driver may cause problems on other systems using NVMe drives.

[Other Info]

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1648449

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Dan Streetman (ddstreet) wrote :

This is a problem only in xenial and later (yakkety, zesty, upstream). The issue is the driver configures an MSIX interrupt to use with the admin queue during initialization, and then when setting up its I/O queues, it releases the MSIX interrupt for the admin queue, and then immediately configures lots of MSIX interrupts for its I/O queues (the admin queue then shares the MSIX interrupt of the first I/O queue). This MSIX release/request is what causes the failure; if the MSIX interrupts are configured only once, the driver initializes all controllers ok.

Before the xenial kernel (e.g. in trusty), the driver works. However, the reason it works is because it uses a polling kthread instead of a MSIX interrupt for the admin queue, and so doesn't request the first single MSIX interrupt, thus this specific problem isn't present there. Later kernels, starting with xenial, do not use the polling kthread (as it was never actually needed), and do use MSIX interrupts for the admin queue and I/O queues.

Changed in linux (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
Tim Gardner (timg-tpi) on 2016-12-08
Changed in linux (Ubuntu Xenial):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Dan Streetman (ddstreet)
Brad Figg (brad-figg) on 2016-12-08
Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Dan Streetman (ddstreet) wrote :

I verified kernel linux-image-generic 4.4.0.57.60 (uname 4.4.0-57-generic #78-Ubuntu) does fix this.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (17.0 KiB)

This bug was fixed in the package linux - 4.4.0-57.78

---------------
linux (4.4.0-57.78) xenial; urgency=low

  * Release Tracking Bug
    - LP: #1648867

  * Miscellaneous Ubuntu changes
    - SAUCE: Do not build the xr-usb-serial driver for s390

linux (4.4.0-56.77) xenial; urgency=low

  * Release Tracking Bug
    - LP: #1648867

  * Release Tracking Bug
    - LP: #1648579

  * CONFIG_NR_CPUS=256 is too low (LP: #1579205)
    - [Config] Increase the NR_CPUS to 512 for amd64 to support systems with a
      large number of cores.

  * NVMe drives in Amazon AWS instance fail to initialize (LP: #1648449)
    - SAUCE: (no-up) NVMe: only setup MSIX once

linux (4.4.0-55.76) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1648503

  * NVMe driver accidentally reverted to use GSI instead of MSIX (LP: #1647887)
    - (fix) NVMe: restore code to always use MSI/MSI-x interrupts

linux (4.4.0-54.75) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1648017

  * Update hio driver to 2.1.0.28 (LP: #1646643)
    - SAUCE: hio: update to Huawei ES3000_V2 (2.1.0.28)

  * linux: Enable live patching for all supported architectures (LP: #1633577)
    - [Config] CONFIG_LIVEPATCH=y for s390x

  * Botched backport breaks level triggered EOIs in QEMU guests with --machine
    kernel_irqchip=split (LP: #1644394)
    - kvm/irqchip: kvm_arch_irq_routing_update renaming split

  * Xenial update to v4.4.35 stable release (LP: #1645453)
    - x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems
    - KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
    - KVM: Disable irq while unregistering user notifier
    - fuse: fix fuse_write_end() if zero bytes were copied
    - mfd: intel-lpss: Do not put device in reset state on suspend
    - can: bcm: fix warning in bcm_connect/proc_register
    - i2c: mux: fix up dependencies
    - kbuild: add -fno-PIE
    - scripts/has-stack-protector: add -fno-PIE
    - x86/kexec: add -fno-PIE
    - kbuild: Steal gcc's pie from the very beginning
    - ext4: sanity check the block and cluster size at mount time
    - crypto: caam - do not register AES-XTS mode on LP units
    - drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5)
    - clk: mmp: pxa910: fix return value check in pxa910_clk_init()
    - clk: mmp: pxa168: fix return value check in pxa168_clk_init()
    - clk: mmp: mmp2: fix return value check in mmp2_clk_init()
    - rtc: omap: Fix selecting external osc
    - iwlwifi: pcie: fix SPLC structure parsing
    - mfd: core: Fix device reference leak in mfd_clone_cell
    - uwb: fix device reference leaks
    - PM / sleep: fix device reference leak in test_suspend
    - PM / sleep: don't suspend parent when async child suspend_{noirq, late}
      fails
    - IB/mlx4: Check gid_index return value
    - IB/mlx4: Fix create CQ error flow
    - IB/mlx5: Use cache line size to select CQE stride
    - IB/mlx5: Fix fatal error dispatching
    - IB/core: Avoid unsigned int overflow in sg_alloc_table
    - IB/uverbs: Fix leak of XRC target QPs
    - IB/cm: Mark stale CM id's whenever the mad agent was unregistered
    - netfilter: nft_dynset: fix element timeou...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Luis Henriques (henrix) on 2017-01-09
Changed in linux (Ubuntu Yakkety):
status: New → Fix Committed

The verification of the Stable Release Update for linux-aws has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Dan Streetman (ddstreet) on 2017-01-13
Changed in linux (Ubuntu Zesty):
status: Incomplete → Won't Fix
Dan Streetman (ddstreet) wrote :

It appears these workaround patches introduced NVMe initialization problems on some non-Xen systems, see bug 1626894. I sent patches to revert this for xenial and yakkety. Instead, this problem should be fixed by a patch to Xen kernel code, in bug 1656831.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in linux (Ubuntu Zesty):
status: Won't Fix → Invalid
John Donnelly (jpdonnelly) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Dan Streetman (ddstreet) wrote :

Verified with the 4.8.0-36-generic kernel on an AWS i3 instance, all the NVMe drives initialized successfully. Checked the git log and verified the initial commit had been fully reverted (see comment 7).

tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.8.0-37.39

---------------
linux (4.8.0-37.39) yakkety; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1659381

  * Mouse cursor invisible or does not move (LP: #1646574)
    - drm/nouveau/disp/nv50-: split chid into chid.ctrl and chid.user
    - drm/nouveau/disp/nv50-: specify ctrl/user separately when constructing
      classes
    - drm/nouveau/disp/gp102: fix cursor/overlay immediate channel indices

 -- Benjamin M Romer <email address hidden> Wed, 25 Jan 2017 16:12:02 -0200

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers