NVMe detection failed during bootup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Triaged
|
High
|
Dan Streetman |
Bug Description
I've been running an on-off test on a couple of Power8 systems and have been getting a failure of detection on the NVMe drives on Ubuntu 16.04.1 only. I've ran the same test on RHEL 7.2 and have not encountered this proble. Once the problem occurs the OS will stop to boot up and a message appears:
Welcome to emergency mode! After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to try again to boot into default mode.
Give root password for maintenance (or press Control-D to continue):
ProblemType: Crash
DistroRelease: Ubuntu 16.04
Package: apport 2.20.1-0ubuntu2.1
ProcVersionSign
Uname: Linux 4.4.0-45-generic ppc64le
ApportVersion: 2.20.1-0ubuntu2.1
Architecture: ppc64el
CrashReports:
640:0:
644:0:
Date: Mon Nov 7 11:11:28 2016
ExecutablePath: /usr/bin/apport-bug
InstallationDate: Installed on 2016-11-05 (2 days ago)
InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Release ppc64el (20160719)
InterpreterPath: /usr/bin/python3.5
PackageArchitec
ProcCmdline: /usr/bin/python3 /usr/bin/apport-cli --hanging
ProcEnviron:
TERM=linux
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcLoadAvg: 0.30 0.41 0.19 1/1132 2428
ProcLocks:
ProcSwaps:
Filename Type Size Used Priority
/dev/sda3 partition 157914048 0 -1
ProcVersion: Linux version 4.4.0-45-generic (buildd@
PythonArgs: ['/usr/
SourcePackage: apport
Title: apport-bug crashed with TypeError in run_hang(): int() argument must be a string, a bytes-like object or a number, not 'NoneType'
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_smt: SMT=8
mtime.conffile.
information type: | Private → Public |
affects: | apport (Ubuntu) → linux (Ubuntu) |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
status: | New → Triaged |
tags: |
added: severity-high removed: severity-critical |
tags: | removed: need-duplicate-check |
tags: | added: cscc |
------- Comment From <email address hidden> 2016-11-14 11:17 EDT-------
dougmill-ibm commented 6 days ago
It appears that one of NVMe drives failed to function correctly during boot/probe. The 'lspci' output does not show that drive, which means it was taken after the failure but before recovery (reboot). It means we don't have full info on that drive.
From a high-level look at the code, the vfree error seems to be for freeing the PCI BAR. The code path through the initialization might allow for a failure before/during allocation of BAR and not account for that during device removal. The stack trace is not much help because the "remove dead controller" routine is invoked as "work" on a kthread, and so we do not have the stack trace of the thread that actually encountered the original failure (I/O timeout).
So, there are two problems shown here. One is the vfree WARNING which indicates that the error paths are not quite right. The other is why the NVMe drive failed to function correctly - which is the primary issue for this test case. NOTE: the vfree message is only a WARNING and should not cause any sort of permanent problem with the running kernel.
------- Comment From <email address hidden> 2016-11-14 16:42 EDT-------
Debug continues.