pcie hotplug not working in linux-generic-hwe-18.04 5.4.0.135.152~18

Bug #1998224 reported by Sven Kieske
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

situation: ubuntu 18.04 server install on a supermicro x64 host.

hot plug NVME ssd into NVME U.2 HotSwap Slot.

problem: hot plug does not work/ nvme is not recognised.

how to test:

echo 1 > /sys/bus/pci/rescan

dmesg output:

[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.1: bridge window [io 0x1000-0x0fff] to [bus 41] add_size 1000
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.2: bridge window [io 0x1000-0x0fff] to [bus 42] add_size 1000
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.1: BAR 13: no space for [io size 0x1000]
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.1: BAR 13: failed to assign [io size 0x1000]
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.2: BAR 13: no space for [io size 0x1000]
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.2: BAR 13: failed to assign [io size 0x1000]
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.2: BAR 13: no space for [io size 0x1000]
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.2: BAR 13: failed to assign [io size 0x1000]
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.1: BAR 13: no space for [io size 0x1000]
[Mon Nov 28 15:46:33 2022] pcieport 0000:40:01.1: BAR 13: failed to assign [io size 0x1000]

Kernel Version:

5.4.0-107-generic #121~18.04.1-Ubuntu SMP Thu Mar 24 17:21:33 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

hardware information: tried with micron and intel NVME, e.g:

INTEL SSDPE2KE016T8

after a reboot, the NVME is recognized, so there is no hardware problem.

if you need additional debug information, feel free to ask.

Revision history for this message
Sven Kieske (s-kieske) wrote :
Revision history for this message
Sven Kieske (s-kieske) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1998224

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Sven Kieske (s-kieske) wrote : Re: pcie hotplug not working in linux-generic-hwe-18.04 5.4.0.107.121~18

apport-collect is not installed on this system.

I added some manually collected logs. If you need further debug data, please specify what exactly you need and I will try to provide it.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does this work on older/newer kernel?

Revision history for this message
Sven Kieske (s-kieske) wrote :

I wasn't able to test this just yet as I have limited time for debugging this and it is a production system, so I need to prepare maintenance etc. so it might take some time.

Do you have any information if this is supposed to work on this kernel?

I saw some patches afaik from 2017~2019 which addressed problems in this area, but I don't know yet if they made it to the ubuntu kernel.

I will report back if a newer kernel fixes this. We are only using offical HWE Kernels from Ubuntu LTS releases at the moment though.

Revision history for this message
Sven Kieske (s-kieske) wrote :

Our vendor told us that this is a generic problem with the following supermicro board/system and many nvme ssd devices:

Board: H12SSW-NT

Servertype: Supermicro AS -1114S-WTRT https://www.supermicro.com/Aplus/system/1U/1114/AS-1114S-WTRT.cfm

so it seems this is not kernel related, afaik.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Sven Kieske (s-kieske) wrote :

my vendor told me this should work with updated bios and firmware, which it does not.

in fact, I can't even find the pci hotplug kernel module, neither loaded, nor present under /lib/modules/*.

so could you please reopen, so we can double check I'm not missing anything from the ubuntu side?

currently I have upgraded to the official 18.04 HWE Kernel 5.4.0-135-generic #152

when I e.g. grep for hotplug on a fedora test laptop I get:

grep -i hotplug /lib/modules/6.1.8-100.fc36.x86_64/modules.builtin
kernel/drivers/pci/hotplug/shpchp.ko
kernel/drivers/pci/hotplug/acpiphp.ko

but the ubuntu system returns:

root@ceph-osd01:~# grep -i hotplug /lib/modules/5.4.0-135-generic/modules.builtin
root@ceph-osd01:~#

Changed in linux (Ubuntu):
status: Invalid → New
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1998224

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Sven Kieske (s-kieske) wrote : Re: pcie hotplug not working in linux-generic-hwe-18.04 5.4.0.107.121~18

as already stated above I did provide log files manually, as apport-collect is not installed on this system.

I'm happy to provide further logfiles to debug this issue.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Sven Kieske (s-kieske) wrote :

okay, it seems that both acpi_pci_hotplug and pci_hotplug are enabled for this kernel:

grep "CONFIG_HOTPLUG_PCI_ACPI=" /boot/config-`uname -r`
CONFIG_HOTPLUG_PCI_ACPI=y
grep "CONFIG_HOTPLUG_PCI=" /boot/config-`uname -r`
CONFIG_HOTPLUG_PCI=y

but still, hotplug is not working.

the current situation is: I unplugged a correctly recognized Intel NVME and switched it for a Micron NVME, there is also a second intel nvme still installed.

nvme list shows the intel nvme which remained in the system:

nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 PHLN130100QJ1P6AGN INTEL SSDPE2KE016T8 1 1.60 TB / 1.60 TB 512 B + 0 B VDV10184

but when I look into /sys/, I still find shadows of the other intel nvme, which just got plugged out:

find /sys/devices | egrep "nvme[0-9][0-9]?$"
/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0/nvme/nvme0
/sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1
/sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0

also, nvme list-subsys, throws an error. I suppose because that old subsystem nvme-subsys1 does no longer exist, but is still referenced:

nvme list-subsys
free(): double free detected in tcache 2
Aborted

googling the above error leads me to this bug report: https://github.com/linux-nvme/nvme-cli/issues/1707

when rescanning the pci bus via:

echo 1 > /sys/bus/pci/rescan

I get the following messages in dmesg:

[Wed Feb 8 10:20:49 2023] pci 0000:c3:00.0: PCI bridge to [bus c4]
[Wed Feb 8 10:20:49 2023] pci 0000:c3:00.0: bridge window [io 0xf000-0xffff]
[Wed Feb 8 10:20:49 2023] pci 0000:c3:00.0: bridge window [mem 0xb8000000-0xb90fffff]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.1: bridge window [io 0x1000-0x0fff] to [bus 41] add_size 1000
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.2: bridge window [io 0x1000-0x0fff] to [bus 42] add_size 1000
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.1: BAR 13: no space for [io size 0x1000]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.1: BAR 13: failed to assign [io size 0x1000]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.2: BAR 13: no space for [io size 0x1000]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.2: BAR 13: failed to assign [io size 0x1000]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.2: BAR 13: no space for [io size 0x1000]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.2: BAR 13: failed to assign [io size 0x1000]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.1: BAR 13: no space for [io size 0x1000]
[Wed Feb 8 10:20:49 2023] pcieport 0000:40:01.1: BAR 13: failed to assign [io size 0x1000]

as already stated in the original bug report.

how can I get rid of the now defunct nvme subsystem? anything else I could try?

summary: - pcie hotplug not working in linux-generic-hwe-18.04 5.4.0.107.121~18
+ pcie hotplug not working in linux-generic-hwe-18.04 5.4.0.135.152~18
Revision history for this message
Sven Kieske (s-kieske) wrote :

I was finally able to test this with the mainline kernel:

6.1.10-060110-generic #202302060840

taken from: https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.10/amd64/linux-image-unsigned-6.1.10-060110-generic_6.1.10-060110.202302060840_amd64.deb

and it works!

this is specific about hot swap.

i suspect the following commits are missing from the HWE kernel in 18.04. to make this work:

2baa85d6927d11b8d946da2e4ad00dddca5b8da2 (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2baa85d6927d11b8d946da2e4ad00dddca5b8da2)

and 85ae3970a0e393cbb07ec30ac99d82cfd6c3f922 (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=85ae3970a0e393cbb07ec30ac99d82cfd6c3f922)

Revision history for this message
Sven Kieske (s-kieske) wrote :

The above comment might not be correct, because it turned out one of the micron nvme devices had a defect and was not recognized by any hardware at all, even when not hot swapping.

There is a chance that this is related to the bios/firmware combination on this supermicro model.

I will test this again with a known working micron nvme and the following scenarios:

- hotswap intel -> micron
- hotswap micron -> micron

both will first be tested on the linux-generic-hwe-18.04 5.4.0.135.152~18 kernel.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.