kernel crash when NVMe drive inserted in one slot

Bug #1661131 reported by jeffrey leung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Xenial
Fix Released
High
Unassigned

Bug Description

Opening this on behalf of one of my colleagues at Cisco, we're seeing an issue on our new S-series S3260 server that's causing the kernel to crash.

If we have an NVMe device inserted into one of two drive slots, we will see kernel crash only with Ubuntu. With an NVMe drive in the bad slot, other OS's will work fine. If we move the NVMe drive out of the bad slot and into the other slot, everything is working fine as expected. We only see the kernel crash with an NVMe drive in that bad slot when using Ubuntu. We tested with HGST and Intel NVMe drives and were able to reproduce the issue with both. HGST reviewed some logs and they don't believe at this time the issue is with the NVMe drives.

We're hoping someone from Canonical can take a look to understand what is the difference between the working and failing slot. The data collection was done with the NVMe drive inserted in the working slot so we could access the OS.

I had a connection time out when trying to use ubuntu-bug, so I saved the apport file and will attach to the bug. I have collected the kernel and syslog as well, but they are ~9GB. I found a call trace in the kernel log start on Jan 25 06:02:54 and floods the logs afterwards. I will include the call trace in a separate text file on the attachment.

Revision history for this message
jeffrey leung (jefleung) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1661131

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.10 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10-rc6

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Revision history for this message
jeffrey leung (jefleung) wrote :

I've asked the engineer to try the latest v4.10-rc6 kernel. This is their first test with 16.04.1, so unknown if it works with previous kernel version. I've asked them to check older versions to see if it exists.

I'll update the tag and comment when I hear back from them. Thanks for the recommendation.

Revision history for this message
jeffrey leung (jefleung) wrote :

The mainline kernel seems to have resolved the issue so far. We're running some IO just to make sure it's stable. I will update the tag once we have confirmation.

jeffrey leung (jefleung)
tags: added: kernel-fixed-upstream
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you also test the latest upstream 4.4 kernel? That will tell us if the fix in mainline was already cc'd to stable or if we need to perform a "Reverse" bisect.

The 4.4.47 kernel can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.47/

Revision history for this message
jeffrey leung (jefleung) wrote :

We tried with the 4.4.47 kernel but when booting it gives up waiting on the root device. We are troubleshooting to try and get it to boot.

Revision history for this message
jeffrey leung (jefleung) wrote :

Hi Joseph,

To confirm, in order to correctly install the v4.4.47 kernel, we need to do the git clone and apply the patches before installing the kernel .deb packages correct? My colleague was only copying the .deb packages which may be causing us to see the root device timeout. I was able to successfully install v4.4.47 kernel in my own environment following the instructions you provided with no issues.

Thanks,
Jeff

Revision history for this message
jeffrey leung (jefleung) wrote :

I was able to git clone the v4.4.47 kernel over to the machine having issues. When trying to apply the patches after 0001, it's unable to locate the file to patch for some. I am seeing this same error on another machine I've previously successfully upgraded to the 4.4.47 kernel with the same files.

root@savbu-qa-colusa3-24-2:~# patch -p1 < 0002-UBUNTU-SAUCE-add-vmlinux.strip-to-BOOT_TARGETS1-on-p.patch
can't find file to patch at input line 16
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|From 87f65999aab113d4ecf0d7fedfa5a4cc9c1141b5 Mon Sep 17 00:00:00 2001
|From: Andy Whitcroft <email address hidden>
|Date: Fri, 9 Sep 2016 14:02:29 +0100
|Subject: [PATCH 2/6] UBUNTU: SAUCE: add vmlinux.strip to BOOT_TARGETS1 on
| powerpc
|
|Signed-off-by: Andy Whitcroft <email address hidden>
|---
| arch/powerpc/Makefile | 2 +-
| 1 file changed, 1 insertion(+), 1 deletion(-)
|
|diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
|index 96efd82..96f49dd 100644
|--- a/arch/powerpc/Makefile
|+++ b/arch/powerpc/Makefile
--------------------------
File to patch: ^C
root@savbu-qa-colusa3-24-2:~#

root@savbu-qa-colusa3-24-2:~# patch -p1 < 0003-UBUNTU-SAUCE-tools-hv-lsvmbus-add-manual-page.patch
patching file tools/hv/lsvmbus.8

root@savbu-qa-colusa3-24-2:~# patch -p1 < 0004-UBUNTU-SAUCE-no-up-disable-pie-when-gcc-has-it-enabl.patch
can't find file to patch at input line 37
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|From 5d5e1ec9dcd36f965aee2285f678b7fcc3553c2f Mon Sep 17 00:00:00 2001
|From: Steve Beattie <email address hidden>
|Date: Tue, 10 May 2016 12:44:04 +0100
|Subject: [PATCH 4/6] UBUNTU: SAUCE: (no-up) disable -pie when gcc has it
| enabled by default
|
|In Ubuntu 16.10, gcc's defaults have been set to build Position
|Independent Executables (PIE) on amd64 and ppc64le (gcc was configured
|this way for s390x in Ubuntu 16.04 LTS). This breaks the kernel build on
|amd64. The following patch disables pie for x86 builds (though not yet
|verified to work with gcc configured to build PIE by default i386 --
|we're not planning to enable it for that architecture).
|
|The intent is for this patch to go upstream after expanding it to
|additional architectures where needed, but I wanted to ensure that
|we could build 16.10 kernels first. I've successfully built kernels
|and booted them with this patch applied using the 16.10 compiler.
|
|Patch is against yakkety.git, but also applies with minor movement
|(no fuzz) against current linus.git.
|
|Signed-off-by: Steve Beattie <email address hidden>
|[<email address hidden>: shifted up so works in arch/<arch/Makefile.]
|BugLink: http://bugs.launchpad.net/bugs/1574982
|Signed-off-by: Andy Whitcroft <email address hidden>
|Acked-by: Tim Gardner <email address hidden>
|Acked-by: Stefan Bader <email address hidden>
|Signed-off-by: Kamal Mostafa <email address hidden>
|---
| Makefile | 6 ++++++
| 1 file changed, 6 insertions(+)
|
|diff --git a/Makefile b/Makefile
|index 7b233ac..3c6e704 100644
|--- a/Makefile
|+++ b/Makefile
--------------------------
File to patch:

Revision history for this message
jeffrey leung (jefleung) wrote :

v4.4.47 kernel has the fix and is working with the NVMe drive in the "bad" slot.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The proposed Xenial kernel now has the 4.4.47 updates and is in the -proposed repository.

Would it be possible for you to test this latest kernel and post back if it resolves this bug?
See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.

Thank you in advance!

Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
jeffrey leung (jefleung) wrote :

Sorry for the delay, had to wait for another issue to be debugged before I could take over the hardware again. I enabled -proposed and performed an apt-get update and fix is working.

Revision history for this message
jeffrey leung (jefleung) wrote :

May have spoken too soon, found server went into a kernel panic after I got back from lunch. There's some complaints about the controller so I am going to try the 4.4.47 kernel again to be sure it's not the hardware.

Is there anything that needs to be ran besides "apt-get update" after enabling proposed? What can I run to be sure the fix was applied after enabling -proposed?

Revision history for this message
jeffrey leung (jefleung) wrote :

Got it working 2nd try, enabling -proposed in Xenial resolved the issue.

Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Fix Committed
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.