Inventories of SR-IOV GPU VFs are impacted by allocations for other VFs

Bug #2041519 reported by Sylvain Bauza
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Unassigned

Bug Description

This is hard to summarize the problem in a bug report title, my bad.

Long story short, the case arrives if you start using nVidia SR-IOV next-gen GPUs like A100 which create Virtual Functions on the host, each of them supporting the same GPU types but with a specific amount of available mediated devices to be created equal to 1.
If you're using other GPUs (like V100) and you're not running nvidia's sriov-manage to expose the VFs, please nevermind this bug, you shall not be impacted.

So, say you have a A100 GPU card, before configuring Nova, you have to run the aforementioned sriov-manage script which will allocate 16 virtual functions for the GPU. Each of those PCI adddresses will correspond to a Placement resource provider (if you configure Nova so) with an VGPU inventory with total=1.

Example :
https://paste.opendev.org/show/bVxrVLW3yOR3TPV2Lz3A/

Sysfs shows the exact same thing on the nvidia-472 type I configured for :
[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Now, the problem arises when you're exhausting the number of mediated devices you can create.
In the case of nvidia-472, which corresponds to nvidia's GRID A100-20C, you can create up to 2 VGPUs, ie. mediated devices.

Accordingly, when Nova creates the 2 mediated devices automatically when booting an instance, and if *no* mediated devices are found available yet, then *all other* VFs that don't use those 2 mediated devices will have their available_instances value equal to 0 :

[stack@lenovo-sr655-01 nova]$ openstack server create --image cirros-0.6.2-x86_64-disk --flavor c1g --key-name mykey --network public vm1
(skipped)
[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
[stack@lenovo-sr655-01 nova]$ openstack server create --image cirros-0.6.2-x86_64-disk --flavor c1g --key-name mykey --network public vm2
(skipped)
[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

No, when we look at the inventories for all VFs, we see that while it's normal to see 2 Resource Providers having their total to 1 (since we created a mdev, it's counted) and their usage to 1, that said it's not normal to see *other VFs* having a total of 1 and an usage of 0.

[stack@lenovo-sr655-01 nova]$ for uuid in $(openstack resource provider list -f value -c uuid); do openstack resource provider inventory list $uuid -f value -c resource_class -c total -c used; done | grep VGPU
VGPU 1 1
VGPU 1 1
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0

I eventually went down into the code and found the culprit :

https://github.com/openstack/nova/blob/9c9cd3d9b6d1d1e6f62012cd8a86fd588fb74dc2/nova/virt/libvirt/driver.py#L9110-L9111

Before this method is called, we correctly calculate the numbers that we get from libvirt, and all the non-used VFs have their total=0, but since we enter this conditional, we skip to update them.

There are different ways to solve this problem :
 - we stop automatically creating mediated devices and ask operators to pre-allocate all mediated devices before starting nova-compute but there is a big operator impact (and they need to add some tooling)
 - we blindly remove the RP from the PlacementTree and let update_resource_providers() call in compute manager to try to update Placement with this new view. In that very particular case, we're sure that none of the RPs that have total=0 have allocations against them, so it shouldn't fail but this logic can be errorprone if we try to reproduce it elsewhere.

Tags: vgpu
tags: added: vgpu
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/899625

Changed in nova:
status: New → In Progress
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/899625
Committed: https://opendev.org/openstack/nova/commit/60851e44649e463bfda25d9dea84443467e4a30c
Submitter: "Zuul (22348)"
Branch: master

commit 60851e44649e463bfda25d9dea84443467e4a30c
Author: Sylvain Bauza <email address hidden>
Date: Mon Oct 30 18:11:46 2023 +0100

    libvirt: Cap with max_instances GPU types

    We want to cap a maximum mdevs we can create.
    If some type has enough capacity, then other GPUs won't be used and
    existing ResourceProviders would be deleted.

    Closes-Bug: #2041519
    Change-Id: I069879a333152bb849c248b3dcb56357a11d0324

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/902084
Committed: https://opendev.org/openstack/nova/commit/d445eaf9dd94a42916688b05a28d3aa1f9970ede
Submitter: "Zuul (22348)"
Branch: master

commit d445eaf9dd94a42916688b05a28d3aa1f9970ede
Author: Sylvain Bauza <email address hidden>
Date: Tue Nov 28 11:52:57 2023 +0100

    vgpu: Allow device_addresses to not be set

    Sometimes, some GPU may have a long list of PCI addresses (say a SRIOV
    GPU) or operators may have a long list of GPUs. In order to help their
    lifes, let's allow device_addresses to be optional.

    This means that a valid configuration could be :

        [devices]
        enabled_mdev_types = nvidia-35, nvidia-36

        [mdev_nvidia-35]

        [mdev_nvidia-36]

    NOTE(sbauza): we have a slight coverage gap for testing what happens
    if the groups aren't set, but I'll add it in a next patch

    Related-Bug: #2041519
    Change-Id: I73762a0295212ee003db2149d6a9cf701023464f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 29.0.0.0rc1

This issue was fixed in the openstack/nova 29.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/916089

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.