OpenStack Compute (nova)

Nova fails to reuse mdev vgpu devices

Bug #1981631 reported by Colby Walsworth on 2022-07-13

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Opinion	Undecided	Unassigned

Bug Description

Description:
============================
Hello we are experiencing a weird issue where Nova creates the mdev devices from virtual functions when none are created but then will not reuse them once they are all created and vgpu instances are removed.

I believe part of this issue was the uuid issue from this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1701281

Manually applying the latest patch partially fixed the issue (placement stopped reporting no hosts available), now the error is on the hypervisor side saying 'no vgpu resources available'.

If I manually remove the mdev device by with commands like the following:
echo "1" > /sys/bus/mdev/devices/150c155c-da0b-45a6-8bc1-a8016231b100/remove

then Im able to spin up an instance again.

all mdev devices match in mdevctl list and virsh nodedev-list

Steps to reproduce:
================================
1) freshly setup hypervisor with no mdev devices created yet
2) spin up vgpu instances until all mdevs are created that will fit on physical gpu(s)
3) delete vgpu instances
4) try and spin up new vgpu instances

Expected Result:
=====================================
Instance spin up and use reuse the mdev vgpu devices

Actual Result:
=====================================
Build error from Nova API:
Error: Failed to perform requested operation on instance "colby_gpu_test23", the instance has an error status: Please try again later [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance c18565f9-da37-42e9-97b9-fa33da5f1ad0.].

Error in hypervisor logs:
nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: vGPU resource is not available

mdevctl output:
cdc98056-8597-4531-9e55-90ab44a71b4e 0000:21:00.7 nvidia-563 manual
298f1e4b-784d-42a9-b3e5-bdedd0eeb8e1 0000:21:01.2 nvidia-563 manual
2abee89e-8cb4-4727-ac2f-62888daab7b4 0000:21:02.4 nvidia-563 manual
32445186-57ca-43f4-b599-65a455fffe65 0000:21:04.2 nvidia-563 manual
0c4f5d07-2893-49a1-990e-4c74c827083b 0000:81:00.7 nvidia-563 manual
75d1b78c-b097-42a9-b736-4a8518b02a3d 0000:81:01.2 nvidia-563 manual
a54d33e0-9ddc-49bb-8908-b587c72616a9 0000:81:02.5 nvidia-563 manual
cd7a49a8-9306-41bb-b44e-00374b1e623a 0000:81:03.4 nvidia-563 manual

virsh nodedev-list -cap mdev:
mdev_0c4f5d07_2893_49a1_990e_4c74c827083b_0000_81_00_7
mdev_298f1e4b_784d_42a9_b3e5_bdedd0eeb8e1_0000_21_01_2
mdev_2abee89e_8cb4_4727_ac2f_62888daab7b4_0000_21_02_4
mdev_32445186_57ca_43f4_b599_65a455fffe65_0000_21_04_2
mdev_75d1b78c_b097_42a9_b736_4a8518b02a3d_0000_81_01_2
mdev_a54d33e0_9ddc_49bb_8908_b587c72616a9_0000_81_02_5
mdev_cd7a49a8_9306_41bb_b44e_00374b1e623a_0000_81_03_4
mdev_cdc98056_8597_4531_9e55_90ab44a71b4e_0000_21_00_7

Environment:
===========================================
Centos 8 stream
Victoria openstack version (Nova 22.4.0-1)
libvirt 8.0.0-6
qemu-kvm 6.2.0-12
Nvidia A40 GPUs

Tags:

Revision history for this message

Colby Walsworth (colbywalsworth) wrote on 2022-07-13:

Full logs during failed vgpu instance creation Edit (114.3 KiB, text/plain)

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2022-07-22:

This bug seems to be a libvirt regression : libvirt isn't catching the mdev creation on the fly.

I filed a BZ against the libvirt team for investigation : https://bugzilla.redhat.com/show_bug.cgi?id=2109450

Changed in nova:
status:	New → Confirmed

Revision history for this message

Dylan McCulloch (dylan-mcculloch) wrote on 2022-07-29:

I think I've run into something similar with A100 GPUs on Victoria. We're running libvirt 6.0.0 so I don't think it's due to a regression but I may be wrong.
In our case I think it's due to the way the A100s have their inventory presented to placement. There appears to be a mismatch between the vgpu capacity of the host when certain vgpu types are selected and the number of resource providers that are created in placement which correspond to each GPU PCI Bus:Device.Function address.
e.g. We have a host with 2 x A100s. If we configure nova to enable the nvidia-471 (A100-10C) vgpu_type we can use 4 VGPUs per physical card (i.e. we can launch a total of 8 instances with VGPU=1 on that host).
The problem is that there are 32 GPU PCI Bus:Device.Function addresses on the host (16 for each card) and a resource provider for each GPU PCI BDF address is created in placement with VPU=1.

So, placement thinks there are 32 VGPUs available but the enabled nvidia vgpu type only allows 8.
When an instance is spawned on the host an mdev is created for a specific BDF. So, we launch 8 instances and 8 mdevs are created, each corresponding to a different PCI address. Launching a 9th instance will pass placement and schedule, but fail spawning due to lack of vgpu capacity on the host.
After deleting one or more instances and attempting to boot a new instance placement will assign one of the 32 resource providers and only succeed spawning if that resource provider corresponds with the BDF of an existing and available mdev.
To workaround this I've set a custom trait on 8 of the 32 resource providers that correspond to 4 BDF addresses on each of the two cards in the host and updated relevant flavors to require that trait.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2022-09-12:

OK, I maybe mistriaged this bug report, as this is specific to the Ampere architecture with SR-IOV support, so nevermind comment #2.

FWIW, this hardware support is very special as you indeed need to enable VFs, as described in nvidia docs :
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#creating-sriov-vgpu-device-red-hat-el-kvm

Indeed, 32 VFs would be configured *but* if you specify enabled_vgpu_types to the right nvidia-471 type for the PCI address, then the VGPU inventory for this PCI device will have a total of 4, not 32 as I tested earlier.

Anyway, this whole Ampere support is very fragile upstream as this is not fully supported upstream, so I'm about to set this bug to Opinion, as Ampere GPUs won't be able to be tested upstream.

Please do further testing to identify whether something is missing with current vGPU support we have in Nova but for the moment and which would break Ampere support, but please understand that upstream support is absolutely hardware-independant and has to not be nvidia-specific.

tags:	added: vgpu
Changed in nova:
status:	Confirmed → Opinion

Revision history for this message

Dylan McCulloch (dylan-mcculloch) wrote on 2022-09-14:

pgpu_type_mapping_fix.patch Edit (1.1 KiB, text/plain)

Thanks Sylvain. Just to clarify, are you suggesting that we could specify a subset of the PCI addresses in the vgpu_type group in nova.conf to supply placement with the correct inventory? (i.e. instead of using custom traits to workaround the issue)
e.g.
enabled_vgpu_types=nvidia-471

[vgpu_nvidia-471]
device_addresses = 0000:41:00.4,0000:a1:00.4

I had initially tried that, but it didn't work. It seems that when specifying a single enabled_vgpu_type on the host then it is assumed that all discovered gpu devices are of the enabled type and any pgpu_type_mapping is ignored. As a result, all of the gpu pci addresses are added as inventory in placement rather than the subset specified in device_addresses.
The attached patch seems to fix the issue. I've only tested an equivalent patch on victoria. We're unable to test against master with this hardware currently, though the issue still appears to exist in the code in master. Happy to send up a review if that looks sane.

Revision history for this message

Dylan McCulloch (dylan-mcculloch) wrote on 2022-09-15:

Ignore my patch in comment #5. It obviously doesn't work for the case in which a single vgpu type is enabled and no vgpu_type group is specified with device_addresses in nova.conf.
e.g. only specifying:
enabled_vgpu_types=nvidia-471

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-09-16: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/858012

Revision history for this message

Marcus Boden (marcusboden) wrote on 2024-01-23:

We're running into the same issue. I've used the same workaround using traits as Dylan mentioned in #3.
For future travellers, these were my steps:
openstack trait create CUSTOM_VGPU_PLACEMENT

# the VFs/BDFs I'm grepping for are the ones already allocated to a mdev on the machine
for uuid in $(openstack resource provider list | grep -e 0000_25_00_6 -e 0000_25_01_3 -e 0000_25_03_0 -e 0000_25_03_3 -e 0000_25_03_6 -e 0000_25_04_1 | awk '{print $2}'); do openstack resource provider trait set --trait CUSTOM_VGPU_PLACEMENT $uuid;done

openstack flavor set <my_flavor> --property trait:CUSTOM_VGPU_PLACEMENT=required