On VMware ESXi with PCI passthru enabled for Intel NVMe Ubuntu Xenial VM does not boot

Bug #1695780 reported by Govindarajan Soundararajan
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
High
Unassigned

Bug Description

My set up is a Cisco UCS c240 server with an Intel NVMe 1.6TB drive running VMware ESXi version 6.0U2. NVMe device is made available as a PCI passthru device and not claimed by ESXi kernel. This NVMe device when added to a Ubuntu 16.04.2 running kernel version 4.4.0-62 and above does not boot, kernel does not boot fully and hangs for while before the VM powers off. However, running kernel versions 4.4.0-43, 4.4.0-53, 4.4.0-57, and 4.4.0-59 everything works as expected. Earlier versions of the kernel also do not work. Here is a short list of the I tested with

4.4.0-31 -> kernel panic (different issue)
4.4.0-43 -> works
4.4.0-53 -> works
4.4.0-57 -> works
4.4.0-59 -> works
4.4.0-62 -> fail
4.4.0-64 -> fail
4.4.0-75 -> fail
4.4.0-77 -> fail
4.8.0-51 -> fail

~# cat /proc/version_signature
Ubuntu 4.4.0-75.96-generic 4.4.59

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :
affects: linux (Ubuntu) → linux-lts-xenial (Ubuntu)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream stable kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.4 stable kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.70

affects: linux-lts-xenial (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key performing-bisect xenial
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1695780

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

I just tried mainline http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.70/linux-image-4.4.70-040470-generic_4.4.70-040470.201705251131_amd64.deb and it works fine with PCI passthru Intel NVMe and HGST NVMe drives.

With the way our VM is built I do not have apport installed but I'll be happy to pull out any logs that you may need.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-fixed-upstream
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

We can perform a reverse bisect to identify the commit that fixes this in mainline. Can you test the following kernels:

4.8 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
4.9 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9/
4.10 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/
4.11 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/
4.12-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

 You don't have to test every kernel, just up until the first kernel that does not have this bug.

Thanks in advance!

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Just to confirm, mainline kernel linux-image-4.4.70-040470-generic_4.4.70-040470.201705251131 works fine. Looks like the bug is fixed in mainline. Should I still continue with the bisect task with 4.8+ kernels?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the testing. Can you test upstream 4.4.41? This will tell us if it is an Ubuntu specific bug, or upstream as well.

The 4.4.41 kernel can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.41/

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

4.4.41 works too!

~# lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10)
00:0f.0 VGA compatible controller: VMware SVGA II Adapter
00:10.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 PCI bridge: VMware PCI bridge (rev 02)
00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.7 PCI bridge: VMware PCI Express Root Port (rev 01)
02:01.0 SATA controller: VMware SATA AHCI controller
02:02.0 Ethernet controller: Intel Corporation 82545EM Gigabit Ethernet Controller (Copper) (rev 01)
03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3108 PCI-Express Fusion-MPT SAS-3 (rev 02)
04:00.0 Non-Volatile memory controller: HGST, Inc. Device 0023 (rev 02)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
13:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

~# uname -r
4.4.41-040441-generic

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. This bug sounds like it was introduced by an Ubuntu specific SAUCE patch. Per your bug description 4.4.0-59 is good and 4.4.0-62 is bad. Can you test the following two kernels:

4.4.0-60: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/10948005
4.4.0-61: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/10960867

Note, with these two kernels, you need to install both the linux-image and linux-image-extra .deb packages.

I can start a kernel bisect once we know the last good kernel version and the first bad one.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Thanks for taking time to look at this bug. Due to this nvme issue I am not able to go to newer kernels that have important security fixes. Here are the results.

4.4.0-60 (linux-image-4.4.0-40-generic-4.4.0-40.60): Boots fine but no nvme devices are listed under 'lspci'. I did check that nvme kernel module was loaded fine.

4.4.0-61 (linux-image-4.4.0-41-generic_4.4.0-41.61): Boots fine and nvme devices are shown with lspci and 'nvme list' commands.

Looks like the last known good kernel version is 4.4.0-61.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between Ubuntu-4.4.0-61 and Ubuntu-4.4.0-62. The kernel bisect will require testing of about 3-4 test kernels.

I built the first test kernel, up to the following commit:
7392f29e5ee5cb8509f20e5ce6cc1e360c486c91

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1695780/7392f29e5ee5cb8509f20e5ce6cc1e360c486c91

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

360c486c91 works fine. Time to move to the next commit.

~# uname -a
Linux nvme-dev 4.4.0-62-generic #83~lp1695780Commit7392f29 SMP Thu Jun 8 20:09:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

~# nvme list
Node SN Model Version Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- -------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 CVFT65260034400GGN INTEL SSDPE2MD400G4K 1.0 1 400.09 GB / 400.09 GB 512 B + 0 B 8DV1CP02

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Just to note, my earlier hardware is tied up with some other work. I moved my testing to another hardware with Intel NVMe instead of the earlier HGST one. But the issue was seen on both Intel and HGST, the only two vendors' NVMe we have in-house. Both servers are Cisco UCS C-240.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
c173e45ac6939f124d6645369e8981c3b4c4c75b

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1695780/c173e45ac6939f124d6645369e8981c3b4c4c75b

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Hide

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

I need a little more time to test the kernel. The hardware I was working on is tied up with something else. I am working on getting that freed up or get another identical one. Once I get hold of one I'll have the results.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

I got the original hardware back. After new round of testing on this hardware (with both Intel and HGST NVMe devices)

Kernel 4.4.0-62-generic commit: 7392f29e5ee5cb8509f20e5ce6cc1e360c486c91 exhibits the issue and kernel fails to boot up. Ignore my earlier result with the same kernel. This is the commit id we have to focus on.

Here is a complete list of all the kernels requested for testing in this bug.

4.4.0-59-generic:
* Boot up passed
* NVMe devices listed in both lspci and "nvme list" commands

4.4.70-040470-generic Mainline
* Boot up passed
* NVMe devices listed in both lspci and "nvme list" commands

4.4.41-040441-generic:
* Boot up passed
* NVMe devices listed in both lspci and "nvme list" commands

4.4.0-60-generic:
* Boot up passed
* NVMe devices NOT listed in both lspci and "nvme list" commands

4.4.0-61-generic:
* Boot up passed
* NVMe devices listed in both lspci and "nvme list" commands

4.4.0-62-generic
commit: 7392f29e5ee5cb8509f20e5ce6cc1e360c486c91

* Boot up FAIL. Kernel does not boot. This is the original issue. Looks like this commit has the bug.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I restart the bisect between Ubuntu-4.4.0-61 and Ubuntu-4.4.0-62. I marked commit 7392f29e5ee5cb8509f20e5ce6cc1e360c486c91 as bad. The bisect reported the following commit to test next:

362a958e1fa81a134a18fba0af39375b9ae5d238

I built a test kernel up to this commit.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1695780/362a958e1fa81a134a18fba0af39375b9ae5d238

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Kernel with commit 362a958e1fa81a134a18fba0af39375b9ae5d238 Fails to boot.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Hi Joseph,
    Any updates on this? Let me know if you had a chance to generate another kernel after a new bisect.

Dimitrenko (paviliong6)
Changed in linux (Ubuntu):
status: Confirmed → Opinion
status: Opinion → Confirmed
status: Confirmed → Fix Released
Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Is there a new kernel with the fix that I can try on my hardware? Please do let me know where I can download it.

Changed in linux (Ubuntu):
status: Fix Released → Fix Committed
status: Fix Committed → Fix Released
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry for the delay. I can build the next test kernel. However, this bug was marked as "Fix Released". Was that a mistake? If so, I'll change the status back and build the kernel.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Hi Joseph,
    I am not sure if this is fixed. Dimitrenko changed the status to "Fix Released". I had asked if this was as intended and if there is indeed a kernel with the fix but have not gotten a response. I think we should proceed with further bisection.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Hi Joseph,
    Is it possible to provide me with multiple kernels with different bisects? I can then run through them in quick sequence and provide you an update. This was we can minimize turn-around time. Due to this bug we are unable to move to newer kernel that have security fixes for high importance vulnerabilities.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

I checked the latest released Ubuntu kernel (4.4.0-83) and it does _not_ have a fix. So I am changing the status of the bug back to "confirmed". If a fix has indeed been released please let me know where I can pick up a new in-development kernel.

Changed in linux (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Before continuing the bisect, can you see if this bug happens with the latest Artful kernel? It can be downloaded from:

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/13078726

You need to install both the linux-image and linux-image-extra .deb packages.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Is the linux-image-extra mandatory for all kernels? I am asking this because 4.4.0-59 boots cleanly without the extra package.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

I tried 4.11.0-11 kernel from artful and it too fails to boot.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Sorry to be a bother. Any luck in further bisection?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry, I missed the update to this bug. Are you still seeing this issue? If so, can you test the following kernel:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.1/

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

I'll verify this new kernel in a day and provide an update.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Thank you for your patience. I now have the servers back with me. I just tried mainline 4.13.1 kernel as suggested by and it boots fine. All nvme passthrough devices come up fine.

Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

Would you be able to point me at all the ubuntu specific patches that are applied on top of mainline kernel? I can run some bisection on my side to understand what might cause the issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Govindarajan Soundararajan (govind-rajan) wrote :

This was confirmed to be a VMware ESXi bug in their MSI-X code path for PCI passthru devices. VMware has fixed this bug in their latest patch.

Changed in linux (Ubuntu):
status: Expired → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.