NVMe driver accidentally reverted to use GSI instead of MSIX

Bug #1647887 reported by Dan Streetman on 2016-12-07
28
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Xenial
Undecided
Unassigned

Bug Description

[Impact]

These commit ids are in the xenial kernel branch.

Commit 90c9712fbb388077b5e53069cae43f1acbb0102a ("NVMe: Always use MSI/MSI-x
interrupts") changed the NVMe driver to always use MSI/MSI-x interrupts.
However, later commit 30d6592fce71beabe18460252c3823747c4742f6 ("NVMe: Don't
unmap controller registers on reset") as well as commit
e9820e415895bdd9cfd21f87e80e3e0a10f131f0 ("UBUNTU: (fix) NVMe: Don't unmap
controller registers on reset") accidentally reverted part of the original
commit, which reverted the NVMe driver to using GSI interrupts instead of
always using MSI/MSI-x interrupts.

The original commit was added because GSI interrupts do not always work for NVMe on all systems, while MSI/MSIX interrupts do work. The accidental reversion of the code to use MSI/MSIX causes the NVMe driver to not work on some systems with NVMe drives.

[Test Case]

On a system with NVMe drive(s) that do not work (because the NVMe driver is using the non-working GSI interrupt, instead of MSI/MSIX interrupt), test the current xenial kernel, and some or all of the NVMe drives will fail to initialize. Then test with the fixed kernel, and all the NVMe drives will initialize.

[Regression Potential]

If MSI/MSIX interrupts do not work with NVMe drives on some systems, this change would break those systems. Those systems would also be broken by the upstream kernel, however; additionally, without MSI/MSIX support, such NVMe controllers would be significantly slower due to reliance on a single GSI for notification of completion of all I/O. So it's very unlikely there are any systems with NVMe controllers that do not support MSI/MSI-x.

[Other Info]

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1647887

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Triaged
Dan Streetman (ddstreet) wrote :

Note: this problem isn't in yakkety or zesty, no backporting of these particular NVMe patches was needed for those kernels, as the commits were already included in the kernel they are based on. This is an issue only in the xenial kernel. (of course the precise/trusty kernels are far too old for this problem to be present)

Dan Streetman (ddstreet) wrote :

Note: bug 1639920 is also affected by this.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Xenial):
status: New → Confirmed
Luis Henriques (henrix) on 2016-12-08
Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Dan Streetman (ddstreet) wrote :

I verified kernel linux-image-generic 4.4.0.57.60 (uname 4.4.0-57-generic #78-Ubuntu) does fix this.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (17.0 KiB)

This bug was fixed in the package linux - 4.4.0-57.78

---------------
linux (4.4.0-57.78) xenial; urgency=low

  * Release Tracking Bug
    - LP: #1648867

  * Miscellaneous Ubuntu changes
    - SAUCE: Do not build the xr-usb-serial driver for s390

linux (4.4.0-56.77) xenial; urgency=low

  * Release Tracking Bug
    - LP: #1648867

  * Release Tracking Bug
    - LP: #1648579

  * CONFIG_NR_CPUS=256 is too low (LP: #1579205)
    - [Config] Increase the NR_CPUS to 512 for amd64 to support systems with a
      large number of cores.

  * NVMe drives in Amazon AWS instance fail to initialize (LP: #1648449)
    - SAUCE: (no-up) NVMe: only setup MSIX once

linux (4.4.0-55.76) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1648503

  * NVMe driver accidentally reverted to use GSI instead of MSIX (LP: #1647887)
    - (fix) NVMe: restore code to always use MSI/MSI-x interrupts

linux (4.4.0-54.75) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1648017

  * Update hio driver to 2.1.0.28 (LP: #1646643)
    - SAUCE: hio: update to Huawei ES3000_V2 (2.1.0.28)

  * linux: Enable live patching for all supported architectures (LP: #1633577)
    - [Config] CONFIG_LIVEPATCH=y for s390x

  * Botched backport breaks level triggered EOIs in QEMU guests with --machine
    kernel_irqchip=split (LP: #1644394)
    - kvm/irqchip: kvm_arch_irq_routing_update renaming split

  * Xenial update to v4.4.35 stable release (LP: #1645453)
    - x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems
    - KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
    - KVM: Disable irq while unregistering user notifier
    - fuse: fix fuse_write_end() if zero bytes were copied
    - mfd: intel-lpss: Do not put device in reset state on suspend
    - can: bcm: fix warning in bcm_connect/proc_register
    - i2c: mux: fix up dependencies
    - kbuild: add -fno-PIE
    - scripts/has-stack-protector: add -fno-PIE
    - x86/kexec: add -fno-PIE
    - kbuild: Steal gcc's pie from the very beginning
    - ext4: sanity check the block and cluster size at mount time
    - crypto: caam - do not register AES-XTS mode on LP units
    - drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5)
    - clk: mmp: pxa910: fix return value check in pxa910_clk_init()
    - clk: mmp: pxa168: fix return value check in pxa168_clk_init()
    - clk: mmp: mmp2: fix return value check in mmp2_clk_init()
    - rtc: omap: Fix selecting external osc
    - iwlwifi: pcie: fix SPLC structure parsing
    - mfd: core: Fix device reference leak in mfd_clone_cell
    - uwb: fix device reference leaks
    - PM / sleep: fix device reference leak in test_suspend
    - PM / sleep: don't suspend parent when async child suspend_{noirq, late}
      fails
    - IB/mlx4: Check gid_index return value
    - IB/mlx4: Fix create CQ error flow
    - IB/mlx5: Use cache line size to select CQE stride
    - IB/mlx5: Fix fatal error dispatching
    - IB/core: Avoid unsigned int overflow in sg_alloc_table
    - IB/uverbs: Fix leak of XRC target QPs
    - IB/cm: Mark stale CM id's whenever the mad agent was unregistered
    - netfilter: nft_dynset: fix element timeou...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for linux-aws has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers