NVMe detection failed during bootup

Bug #1639920 reported by dexterb
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Dan Streetman

Bug Description

I've been running an on-off test on a couple of Power8 systems and have been getting a failure of detection on the NVMe drives on Ubuntu 16.04.1 only. I've ran the same test on RHEL 7.2 and have not encountered this proble. Once the problem occurs the OS will stop to boot up and a message appears:

Welcome to emergency mode! After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to try again to boot into default mode.
Give root password for maintenance (or press Control-D to continue):

ProblemType: Crash
DistroRelease: Ubuntu 16.04
Package: apport 2.20.1-0ubuntu2.1
ProcVersionSignature: Ubuntu 4.4.0-45.66-generic 4.4.21
Uname: Linux 4.4.0-45-generic ppc64le
ApportVersion: 2.20.1-0ubuntu2.1
Architecture: ppc64el
CrashReports:
 640:0:116:12295:2016-11-07 11:11:27.995107650 -0800:2016-11-07 11:31:31.733627559 -0800:/var/crash/_usr_bin_apport-bug.0.crash
 644:0:116:0:2016-11-07 11:11:28.931104586 -0800:2016-11-07 11:11:28.931104586 -0800:/var/crash/_usr_bin_apport-cli.0.None.hanging
Date: Mon Nov 7 11:11:28 2016
ExecutablePath: /usr/bin/apport-bug
InstallationDate: Installed on 2016-11-05 (2 days ago)
InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Release ppc64el (20160719)
InterpreterPath: /usr/bin/python3.5
PackageArchitecture: all
ProcCmdline: /usr/bin/python3 /usr/bin/apport-cli --hanging
ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcLoadAvg: 0.30 0.41 0.19 1/1132 2428
ProcLocks:

ProcSwaps:
 Filename Type Size Used Priority
 /dev/sda3 partition 157914048 0 -1
ProcVersion: Linux version 4.4.0-45-generic (buildd@bos01-ppc64el-030) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #66-Ubuntu SMP Wed Oct 19 14:13:11 UTC 2016
PythonArgs: ['/usr/bin/ubuntu-bug', '--hanging']
SourcePackage: apport
Title: apport-bug crashed with TypeError in run_hang(): int() argument must be a string, a bytes-like object or a number, not 'NoneType'
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_smt: SMT=8
mtime.conffile..etc.apport.crashdb.conf: 2016-11-07T10:38:18.528739

Revision history for this message
dexterb (dexterbermudez) wrote :
dexterb (dexterbermudez)
information type: Private → Public
affects: apport (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-11-14 11:17 EDT-------
dougmill-ibm commented 6 days ago
It appears that one of NVMe drives failed to function correctly during boot/probe. The 'lspci' output does not show that drive, which means it was taken after the failure but before recovery (reboot). It means we don't have full info on that drive.

From a high-level look at the code, the vfree error seems to be for freeing the PCI BAR. The code path through the initialization might allow for a failure before/during allocation of BAR and not account for that during device removal. The stack trace is not much help because the "remove dead controller" routine is invoked as "work" on a kthread, and so we do not have the stack trace of the thread that actually encountered the original failure (I/O timeout).

So, there are two problems shown here. One is the vfree WARNING which indicates that the error paths are not quite right. The other is why the NVMe drive failed to function correctly - which is the primary issue for this test case. NOTE: the vfree message is only a WARNING and should not cause any sort of permanent problem with the running kernel.

------- Comment From <email address hidden> 2016-11-14 16:42 EDT-------
Debug continues.

tags: added: architecture-ppc64le bugnameltc-148618 severity-critical targetmilestone-inin1604
Revision history for this message
Dan Streetman (ddstreet) wrote :

I bisected to the commit that introduces the failing behavior which is:
df00a7683b417a15a6e695f9a3b200a799b787fa ("nvme: only add a controller
to dev_list after it's been fully initialized")

bugproxy (bugproxy)
tags: added: severity-high
removed: severity-critical
Revision history for this message
Dan Streetman (ddstreet) wrote :

I should also note, the latest upstream kernel has similar timeout problems while probing, so this isn't restricted to only the xenial kernel.

Revision history for this message
dexterb (dexterbermudez) wrote :

Hi Canonical,
Is there any update on this issue?

Thanks,
Dexter

Revision history for this message
bugproxy (bugproxy) wrote : ProcPpc64.tar.gz

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Traceback.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : modified.conffile..etc.apport.crashdb.conf.txt

Default Comment by Bridge

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Is there any update on this issue?

I'm still debugging, but can give a summary of the problem, at least on the system I'm working on.

First, to address:
> One is the vfree WARNING which indicates that the error paths are not quite right

this is a trival fix to the error handling path; it's fixed upstream by commit f58944e265d4ebe47216a5d7488aee3928823d30 ('NVMe: Simplify device reset failure'), or alternately can be fixed more simply by removing the unmap call from the error path, since the later removal also unmaps (which is where the error comes from, unmapping already-unmapped memory). I'll SRU a fix for this, but of course this is just a minor annoyance since the real problem is why the device timed out during probing.

> why the NVMe drive failed to function correctly - which is the primary issue for this test case

the problem here, in debugging on the system i have access to at least, is the nvme driver isn't able to obtain the interrupt for the nvme controller admin queue, and so the requests send to the admin queue (to configure the controller) are never completed, and eventually time out, leading the aborting the device probing.

The reason this used to work (e.g. with stock 4.4.x) is the nvme driver used to use a kthread to poll the admin queue and complete any work directly. The kthread polling was removed upstream with commit 79f2b358c9ba373943a9284be2861fde58291c4e ('nvme: don't poll the CQ from the kthread') which was pulled into the xenial kernel at 4.4.0-35 - but, probe-time kthread polling was removed by upstream commit 7385014c073263b077442439299fad013edd4409 ('nvme: only add a controller to dev_list after it's been fully initialized') which came into xenial very early at 4.4.0-3. So if the irq doesn't work, the nvme probing times out.

As to why the nvme driver can't obtain the irq, I can't determine yet, which is what i'm still debugging. If you have a test system I can access, I can debug there as well, or at least provide a debug kernel to try (but i'm still adding more debug). I'm not working on a ppc system, so it's possible something else is going on there.

Revision history for this message
bugproxy (bugproxy) wrote : Dependencies.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : JournalErrors.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcMaps.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcMisc.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcPpc64.tar.gz

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcStatus.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Traceback.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : modified.conffile..etc.apport.crashdb.conf.txt

Default Comment by Bridge

Revision history for this message
dexterb (dexterbermudez) wrote :

Dan thanks for the update.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Dexter,

is it possible to get access to your ppc system to debug this? Or at least, can you boot a debug kernel and provide the dmesg? The other nvme probe timeout issue I'm debugging is looking like it's problem that wouldn't exist on ppc, so I'd like to look at it on your system directly.

Changed in linux (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
Revision history for this message
dexterb (dexterbermudez) wrote :

Hi Dan,

  Please email me directly <email address hidden>

Thanks,
Dexter

Revision history for this message
Dan Streetman (ddstreet) wrote :

Hi Dexter,

before I jump onto the system to debug, can you try this test kernel:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1639920

I think that might fix the problem.

tags: removed: need-duplicate-check
Revision history for this message
dexterb (dexterbermudez) wrote :

Hi Dan,

  The test kernel is working. I'm not seeing the issue anymore. Here's what I've tried with our on-off test:

 1.) 1U system with (2x) 2-port NVMe controller with over 500 cycles = no issue
 2.) 1U system with (1x) 4-port NVMe controller with over 300 cycles = no issue
 3.) 2U system with (1x) 4-port NVMe controller with over 250 cycles = no issue

Please let me know if you need any info from me?
I'd also like to know how or what we need to do in order to proceed and getting this fix for GA?

thank you so much!!

regards,
Dexter

Revision history for this message
bugproxy (bugproxy) wrote :

Default Comment by Bridge

Revision history for this message
Dan Streetman (ddstreet) wrote :

> I'd also like to know how or what we need to do in order to proceed and getting this fix for GA?

I opened bug 1647887 separately for this (I'm working on this problem for a separate situation and I forgot about using this bug to track the patch, oops) and the kernel patch to fix this is in the process of being included into the xenial kernel, so it's just a matter of time to wait for the kernel update process, until a fixed kernel is released.

Revision history for this message
bugproxy (bugproxy) wrote : .sys.firmware.opal.msglog.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Dependencies.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : JournalErrors.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcMaps.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcMisc.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcPpc64.tar.gz

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcStatus.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Traceback.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : modified.conffile..etc.apport.crashdb.conf.txt

Default Comment by Bridge

Revision history for this message
Dan Streetman (ddstreet) wrote :

Please note, this is fixed in kernel 4.4.0.57.60, which is in -proposed right now. See bug 1647887 for more details.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-09-24 16:08 EDT-------
Closing this as there seems to be nothing to do. If the problem still exists, we should see new bugs open for it.

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.