Ubuntu 18.04 - Kernel crash on nvme subsystem-reset /dev/nvme0 (Bolt / NVMe)

Bug #1753371 reported by bugproxy on 2018-03-05
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Critical
Canonical Kernel Team
linux (Ubuntu)
Critical
Joseph Salisbury
Bionic
Critical
Joseph Salisbury

Bug Description

== Comment: #1 - NAVEED A. UPPINANGADY SALIH <email address hidden> - 2018-02-25 23:45:13 ==

== Comment: #6 - Wen Xiong <email address hidden> - 2018-02-27 10:41:23 ==
hi Naveed,

nvme subsystem-reset calls EEH recovery path. Is EEH recovery working with Bolt on this machine?

Thanks,
Wendy

== Comment: #11 - Wen Xiong <email address hidden> - 2018-03-02 16:03:46 ==
The following patch should fix the issue.

http://lists.infradead.org/pipermail/linux-nvme/2018-February/015745.html

It should be accepted into community soon. Keith has agreed to queue up for 4.16.

CVE References

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-165124 severity-critical targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → kernel-package (Ubuntu)

------- Comment From <email address hidden> 2018-03-05 02:04 EDT-------
(In reply to comment #14)
> == Comment: #1 - NAVEED A. UPPINANGADY SALIH <email address hidden> -
> 2018-02-25 23:45:13 ==
>
>
> == Comment: #6 - Wen Xiong <email address hidden> - 2018-02-27 10:41:23 ==
> hi Naveed,
>
> nvme subsystem-reset calls EEH recovery path. Is EEH recovery working with
> Bolt on this machine?
>
> Thanks,
> Wendy
As discussed earlier, EEH worked on this adapter.

>
> == Comment: #11 - Wen Xiong <email address hidden> - 2018-03-02 16:03:46 ==
> The following patch should fix the issue.
>
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015745.html
>
> It should be accepted into community soon. Keith has agreed to queue up for
> 4.16.
>
> Default Comment by Bridge

Default Comment by Bridge

Frank Heimes (fheimes) on 2018-03-05
no longer affects: linux (Ubuntu)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with the patch mentioned in the description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753371

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Frank Heimes (fheimes) on 2018-03-05
Changed in ubuntu-power-systems:
status: New → Triaged

------- Comment From <email address hidden> 2018-03-05 10:10 EDT-------
I tested the kernel which I got from the following git tree. It is 4.15.3 kernel.

git://kernel.ubuntu.com/ubuntu/ubuntu-bionic.git

I am not sure if it works with 4.15.0-10 kernel.

Why I got difference kernel level from git tree? 4.15.3 is Ubuntu18.04 kernel?

Thanks,
Wendy

Joseph Salisbury (jsalisbury) wrote :

4.15.3 is an upstream stable kernel. 4.15.0-10 from the git repo you mention is an Ubuntu kernel.

Were you able to test the kernel posted in comment #4?

Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Frank Heimes (fheimes) on 2018-03-12
Changed in ubuntu-power-systems:
status: Triaged → In Progress
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-03-14 09:47 EDT-------
The patch is upstream accepted in Linus' tree as git commit
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=651438bb0af5213f1f70d66e75bf11d08cb5537a
("nvme-pci: Fix EEH failure on ppc")

Please build the kernel and try again.

Assign to backport team.

Joseph Salisbury (jsalisbury) wrote :
Joseph Salisbury (jsalisbury) wrote :

A new test kernel with commit 651438bb0af is also available here:
http://kernel.ubuntu.com/~jsalisbury/lp1753371

Seth Forshee (sforshee) on 2018-03-15
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Frank Heimes (fheimes) on 2018-03-15
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (32.6 KiB)

This bug was fixed in the package linux - 4.15.0-13.14

---------------
linux (4.15.0-13.14) bionic; urgency=medium

  * linux: 4.15.0-13.14 -proposed tracker (LP: #1756408)

  * devpts: handle bind-mounts (LP: #1755857)
    - SAUCE: devpts: hoist out check for DEVPTS_SUPER_MAGIC
    - SAUCE: devpts: resolve devpts bind-mounts
    - SAUCE: devpts: comment devpts_mntget()
    - SAUCE: selftests: add devpts selftests

  * [bionic][arm64] d-i: add hisi_sas_v3_hw to scsi-modules (LP: #1756103)
    - d-i: add hisi_sas_v3_hw to scsi-modules

  * [Bionic][ARM64] enable ROCE and HNS3 driver support for hip08 SoC
    (LP: #1756097)
    - RDMA/hns: Refactor eq code for hip06
    - RDMA/hns: Add eq support of hip08
    - RDMA/hns: Add detailed comments for mb() call
    - RDMA/hns: Add rq inline data support for hip08 RoCE
    - RDMA/hns: Update the usage of sr_max and rr_max field
    - RDMA/hns: Set access flags of hip08 RoCE
    - RDMA/hns: Filter for zero length of sge in hip08 kernel mode
    - RDMA/hns: Fix QP state judgement before sending work requests
    - RDMA/hns: Assign dest_qp when deregistering mr
    - RDMA/hns: Fix endian problems around imm_data and rkey
    - RDMA/hns: Assign the correct value for tx_cqn
    - RDMA/hns: Create gsi qp in hip08
    - RDMA/hns: Add gsi qp support for modifying qp in hip08
    - RDMA/hns: Fill sq wqe context of ud type in hip08
    - RDMA/hns: Assign zero for pkey_index of wc in hip08
    - RDMA/hns: Update the verbs of polling for completion
    - RDMA/hns: Set the guid for hip08 RoCE device
    - net: hns3: Refactor of the reset interrupt handling logic
    - net: hns3: Add reset service task for handling reset requests
    - net: hns3: Refactors the requested reset & pending reset handling code
    - net: hns3: Add HNS3 VF IMP(Integrated Management Proc) cmd interface
    - net: hns3: Add mailbox support to VF driver
    - net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support
    - net: hns3: Add HNS3 VF driver to kernel build framework
    - net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC
    - net: hns3: Add mailbox support to PF driver
    - net: hns3: Change PF to add ring-vect binding & resetQ to mailbox
    - net: hns3: Add mailbox interrupt handling to PF driver
    - net: hns3: add support to query tqps number
    - net: hns3: add support to modify tqps number
    - net: hns3: change the returned tqp number by ethtool -x
    - net: hns3: free the ring_data structrue when change tqps
    - net: hns3: get rss_size_max from configuration but not hardcode
    - net: hns3: add a mask initialization for mac_vlan table
    - net: hns3: add vlan offload config command
    - net: hns3: add ethtool related offload command
    - net: hns3: add handling vlan tag offload in bd
    - net: hns3: cleanup mac auto-negotiation state query
    - net: hns3: fix for getting auto-negotiation state in hclge_get_autoneg
    - net: hns3: add support for set_pauseparam
    - net: hns3: add support to update flow control settings after autoneg
    - net: hns3: add Asym Pause support to phy default features
    - net: hns3: add support for querying advertised pause frame by ethtool ethx
    - net:...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released

Default Comment by Bridge

Andrew Cloke (andrew-cloke) wrote :

The SOS report in comment #11 has been posted against a bug that has already been closed. Is the observed issue still occuring with the new kernel? Does the bug need to be re-opened?
Thanks.

Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released

------- Comment From <email address hidden> 2018-04-04 03:22 EDT-------
Looks like the kernel 4.15.0-14-generic did not crash when "nvme subsystem-reset /dev/nvme0" was run.
We see EEH which is expected during subsystem reset?

after the device recovers from EEH,
#nvme list;# works fine.
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S3RVNA0J600206 PCIe3 1.6TB NVMe Flash Adapter II x8 1 737.28 GB / 737.28 GB 4 KiB + 0 B MN12MN12
/dev/nvme0n2 S3RVNA0J600206 PCIe3 1.6TB NVMe Flash Adapter II x8 2 737.28 GB / 737.28 GB 4 KiB + 0 B MN12MN12
======================

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-05 16:57 EDT-------
Based on comment #34, subsystem-reset works fine with the latest kernel. Close the bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers