Ubuntu 18.04 - Kernel crash on nvme subsystem-reset /dev/nvme0 (Bolt / NVMe)

Bug #1753371 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Critical
Joseph Salisbury
Bionic
Fix Released
Critical
Joseph Salisbury

Bug Description

== Comment: #1 - NAVEED A. UPPINANGADY SALIH <email address hidden> - 2018-02-25 23:45:13 ==

== Comment: #6 - Wen Xiong <email address hidden> - 2018-02-27 10:41:23 ==
hi Naveed,

nvme subsystem-reset calls EEH recovery path. Is EEH recovery working with Bolt on this machine?

Thanks,
Wendy

== Comment: #11 - Wen Xiong <email address hidden> - 2018-03-02 16:03:46 ==
The following patch should fix the issue.

http://lists.infradead.org/pipermail/linux-nvme/2018-February/015745.html

It should be accepted into community soon. Keith has agreed to queue up for 4.16.

CVE References

Revision history for this message
bugproxy (bugproxy) wrote : sos report

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-165124 severity-critical targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → kernel-package (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-03-05 02:04 EDT-------
(In reply to comment #14)
> == Comment: #1 - NAVEED A. UPPINANGADY SALIH <email address hidden> -
> 2018-02-25 23:45:13 ==
>
>
> == Comment: #6 - Wen Xiong <email address hidden> - 2018-02-27 10:41:23 ==
> hi Naveed,
>
> nvme subsystem-reset calls EEH recovery path. Is EEH recovery working with
> Bolt on this machine?
>
> Thanks,
> Wendy
As discussed earlier, EEH worked on this adapter.

>
> == Comment: #11 - Wen Xiong <email address hidden> - 2018-03-02 16:03:46 ==
> The following patch should fix the issue.
>
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015745.html
>
> It should be accepted into community soon. Keith has agreed to queue up for
> 4.16.
>
> Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : sos report

Default Comment by Bridge

Frank Heimes (fheimes)
no longer affects: linux (Ubuntu)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with the patch mentioned in the description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1753371

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-03-05 10:10 EDT-------
I tested the kernel which I got from the following git tree. It is 4.15.3 kernel.

git://kernel.ubuntu.com/ubuntu/ubuntu-bionic.git

I am not sure if it works with 4.15.0-10 kernel.

Why I got difference kernel level from git tree? 4.15.3 is Ubuntu18.04 kernel?

Thanks,
Wendy

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

4.15.3 is an upstream stable kernel. 4.15.0-10 from the git repo you mention is an Ubuntu kernel.

Were you able to test the kernel posted in comment #4?

Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-03-14 09:47 EDT-------
The patch is upstream accepted in Linus' tree as git commit
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=651438bb0af5213f1f70d66e75bf11d08cb5537a
("nvme-pci: Fix EEH failure on ppc")

Please build the kernel and try again.

Assign to backport team.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

A new test kernel with commit 651438bb0af is also available here:
http://kernel.ubuntu.com/~jsalisbury/lp1753371

Seth Forshee (sforshee)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (32.6 KiB)

This bug was fixed in the package linux - 4.15.0-13.14

---------------
linux (4.15.0-13.14) bionic; urgency=medium

  * linux: 4.15.0-13.14 -proposed tracker (LP: #1756408)

  * devpts: handle bind-mounts (LP: #1755857)
    - SAUCE: devpts: hoist out check for DEVPTS_SUPER_MAGIC
    - SAUCE: devpts: resolve devpts bind-mounts
    - SAUCE: devpts: comment devpts_mntget()
    - SAUCE: selftests: add devpts selftests

  * [bionic][arm64] d-i: add hisi_sas_v3_hw to scsi-modules (LP: #1756103)
    - d-i: add hisi_sas_v3_hw to scsi-modules

  * [Bionic][ARM64] enable ROCE and HNS3 driver support for hip08 SoC
    (LP: #1756097)
    - RDMA/hns: Refactor eq code for hip06
    - RDMA/hns: Add eq support of hip08
    - RDMA/hns: Add detailed comments for mb() call
    - RDMA/hns: Add rq inline data support for hip08 RoCE
    - RDMA/hns: Update the usage of sr_max and rr_max field
    - RDMA/hns: Set access flags of hip08 RoCE
    - RDMA/hns: Filter for zero length of sge in hip08 kernel mode
    - RDMA/hns: Fix QP state judgement before sending work requests
    - RDMA/hns: Assign dest_qp when deregistering mr
    - RDMA/hns: Fix endian problems around imm_data and rkey
    - RDMA/hns: Assign the correct value for tx_cqn
    - RDMA/hns: Create gsi qp in hip08
    - RDMA/hns: Add gsi qp support for modifying qp in hip08
    - RDMA/hns: Fill sq wqe context of ud type in hip08
    - RDMA/hns: Assign zero for pkey_index of wc in hip08
    - RDMA/hns: Update the verbs of polling for completion
    - RDMA/hns: Set the guid for hip08 RoCE device
    - net: hns3: Refactor of the reset interrupt handling logic
    - net: hns3: Add reset service task for handling reset requests
    - net: hns3: Refactors the requested reset & pending reset handling code
    - net: hns3: Add HNS3 VF IMP(Integrated Management Proc) cmd interface
    - net: hns3: Add mailbox support to VF driver
    - net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support
    - net: hns3: Add HNS3 VF driver to kernel build framework
    - net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC
    - net: hns3: Add mailbox support to PF driver
    - net: hns3: Change PF to add ring-vect binding & resetQ to mailbox
    - net: hns3: Add mailbox interrupt handling to PF driver
    - net: hns3: add support to query tqps number
    - net: hns3: add support to modify tqps number
    - net: hns3: change the returned tqp number by ethtool -x
    - net: hns3: free the ring_data structrue when change tqps
    - net: hns3: get rss_size_max from configuration but not hardcode
    - net: hns3: add a mask initialization for mac_vlan table
    - net: hns3: add vlan offload config command
    - net: hns3: add ethtool related offload command
    - net: hns3: add handling vlan tag offload in bd
    - net: hns3: cleanup mac auto-negotiation state query
    - net: hns3: fix for getting auto-negotiation state in hclge_get_autoneg
    - net: hns3: add support for set_pauseparam
    - net: hns3: add support to update flow control settings after autoneg
    - net: hns3: add Asym Pause support to phy default features
    - net: hns3: add support for querying advertised pause frame by ethtool ethx
    - net:...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : sos report

Default Comment by Bridge

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

The SOS report in comment #11 has been posted against a bug that has already been closed. Is the observed issue still occuring with the new kernel? Does the bug need to be re-opened?
Thanks.

Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-04-04 03:22 EDT-------
Looks like the kernel 4.15.0-14-generic did not crash when "nvme subsystem-reset /dev/nvme0" was run.
We see EEH which is expected during subsystem reset?

after the device recovers from EEH,
#nvme list;# works fine.
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S3RVNA0J600206 PCIe3 1.6TB NVMe Flash Adapter II x8 1 737.28 GB / 737.28 GB 4 KiB + 0 B MN12MN12
/dev/nvme0n2 S3RVNA0J600206 PCIe3 1.6TB NVMe Flash Adapter II x8 2 737.28 GB / 737.28 GB 4 KiB + 0 B MN12MN12
======================

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-05 16:57 EDT-------
Based on comment #34, subsystem-reset works fine with the latest kernel. Close the bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.