OpenStack Compute (nova)

Inventories of SR-IOV GPU VFs are impacted by allocations for other VFs

Bug #2041519 reported by Sylvain Bauza on 2023-10-27

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Unassigned

Bug Description

This is hard to summarize the problem in a bug report title, my bad.

Long story short, the case arrives if you start using nVidia SR-IOV next-gen GPUs like A100 which create Virtual Functions on the host, each of them supporting the same GPU types but with a specific amount of available mediated devices to be created equal to 1.
If you're using other GPUs (like V100) and you're not running nvidia's sriov-manage to expose the VFs, please nevermind this bug, you shall not be impacted.

So, say you have a A100 GPU card, before configuring Nova, you have to run the aforementioned sriov-manage script which will allocate 16 virtual functions for the GPU. Each of those PCI adddresses will correspond to a Placement resource provider (if you configure Nova so) with an VGPU inventory with total=1.

Example :
https://paste.opendev.org/show/bVxrVLW3yOR3TPV2Lz3A/

Sysfs shows the exact same thing on the nvidia-472 type I configured for :
[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Now, the problem arises when you're exhausting the number of mediated devices you can create.
In the case of nvidia-472, which corresponds to nvidia's GRID A100-20C, you can create up to 2 VGPUs, ie. mediated devices.

Accordingly, when Nova creates the 2 mediated devices automatically when booting an instance, and if *no* mediated devices are found available yet, then *all other* VFs that don't use those 2 mediated devices will have their available_instances value equal to 0 :

[stack@lenovo-sr655-01 nova]$ openstack server create --image cirros-0.6.2-x86_64-disk --flavor c1g --key-name mykey --network public vm1
(skipped)
[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
[stack@lenovo-sr655-01 nova]$ openstack server create --image cirros-0.6.2-x86_64-disk --flavor c1g --key-name mykey --network public vm2
(skipped)
[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

No, when we look at the inventories for all VFs, we see that while it's normal to see 2 Resource Providers having their total to 1 (since we created a mdev, it's counted) and their usage to 1, that said it's not normal to see *other VFs* having a total of 1 and an usage of 0.

[stack@lenovo-sr655-01 nova]$ for uuid in $(openstack resource provider list -f value -c uuid); do openstack resource provider inventory list $uuid -f value -c resource_class -c total -c used; done | grep VGPU
VGPU 1 1
VGPU 1 1
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0

I eventually went down into the code and found the culprit :

https://github.com/openstack/nova/blob/9c9cd3d9b6d1d1e6f62012cd8a86fd588fb74dc2/nova/virt/libvirt/driver.py#L9110-L9111

Before this method is called, we correctly calculate the numbers that we get from libvirt, and all the non-used VFs have their total=0, but since we enter this conditional, we skip to update them.

There are different ways to solve this problem :
- we stop automatically creating mediated devices and ask operators to pre-allocate all mediated devices before starting nova-compute but there is a big operator impact (and they need to add some tooling)
- we blindly remove the RP from the PlacementTree and let update_resource_providers() call in compute manager to try to update Placement with this new view. In that very particular case, we're sure that none of the RPs that have total=0 have allocations against them, so it shouldn't fail but this logic can be errorprone if we try to reproduce it elsewhere.

Tags:

Sylvain Bauza (sylvain-bauza) on 2023-10-27

tags:

added: vgpu

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-30: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/899625

Changed in nova:
status:	New → In Progress

Sylvain Bauza (sylvain-bauza) on 2024-01-17

Changed in nova:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-18: Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/899625
Committed: https://opendev.org/openstack/nova/commit/60851e44649e463bfda25d9dea84443467e4a30c
Submitter: "Zuul (22348)"
Branch: master

commit 60851e44649e463bfda25d9dea84443467e4a30c
Author: Sylvain Bauza <email address hidden>
Date: Mon Oct 30 18:11:46 2023 +0100

libvirt: Cap with max_instances GPU types

    We want to cap a maximum mdevs we can create.
    If some type has enough capacity, then other GPUs won't be used and
    existing ResourceProviders would be deleted.

Closes-Bug: #2041519
Change-Id: I069879a333152bb849c248b3dcb56357a11d0324

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-18: Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/902084
Committed: https://opendev.org/openstack/nova/commit/d445eaf9dd94a42916688b05a28d3aa1f9970ede
Submitter: "Zuul (22348)"
Branch: master

commit d445eaf9dd94a42916688b05a28d3aa1f9970ede
Author: Sylvain Bauza <email address hidden>
Date: Tue Nov 28 11:52:57 2023 +0100

vgpu: Allow device_addresses to not be set

    Sometimes, some GPU may have a long list of PCI addresses (say a SRIOV
    GPU) or operators may have a long list of GPUs. In order to help their
    lifes, let's allow device_addresses to be optional.

This means that a valid configuration could be :

[devices]
enabled_mdev_types = nvidia-35, nvidia-36

[mdev_nvidia-35]

[mdev_nvidia-36]

NOTE(sbauza): we have a slight coverage gap for testing what happens
if the groups aren't set, but I'll add it in a next patch

Related-Bug: #2041519
Change-Id: I73762a0295212ee003db2149d6a9cf701023464f

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-19: Fix included in openstack/nova 29.0.0.0rc1

This issue was fixed in the openstack/nova 29.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-04-18: Fix proposed to nova (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/916089