Inventories of SR-IOV GPU VFs are impacted by allocations for other VFs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Unassigned |
Bug Description
This is hard to summarize the problem in a bug report title, my bad.
Long story short, the case arrives if you start using nVidia SR-IOV next-gen GPUs like A100 which create Virtual Functions on the host, each of them supporting the same GPU types but with a specific amount of available mediated devices to be created equal to 1.
If you're using other GPUs (like V100) and you're not running nvidia's sriov-manage to expose the VFs, please nevermind this bug, you shall not be impacted.
So, say you have a A100 GPU card, before configuring Nova, you have to run the aforementioned sriov-manage script which will allocate 16 virtual functions for the GPU. Each of those PCI adddresses will correspond to a Placement resource provider (if you configure Nova so) with an VGPU inventory with total=1.
Example :
https:/
Sysfs shows the exact same thing on the nvidia-472 type I configured for :
[stack@
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Now, the problem arises when you're exhausting the number of mediated devices you can create.
In the case of nvidia-472, which corresponds to nvidia's GRID A100-20C, you can create up to 2 VGPUs, ie. mediated devices.
Accordingly, when Nova creates the 2 mediated devices automatically when booting an instance, and if *no* mediated devices are found available yet, then *all other* VFs that don't use those 2 mediated devices will have their available_instances value equal to 0 :
[stack@
(skipped)
[stack@
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
[stack@
(skipped)
[stack@
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
No, when we look at the inventories for all VFs, we see that while it's normal to see 2 Resource Providers having their total to 1 (since we created a mdev, it's counted) and their usage to 1, that said it's not normal to see *other VFs* having a total of 1 and an usage of 0.
[stack@
VGPU 1 1
VGPU 1 1
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
VGPU 1 0
I eventually went down into the code and found the culprit :
Before this method is called, we correctly calculate the numbers that we get from libvirt, and all the non-used VFs have their total=0, but since we enter this conditional, we skip to update them.
There are different ways to solve this problem :
- we stop automatically creating mediated devices and ask operators to pre-allocate all mediated devices before starting nova-compute but there is a big operator impact (and they need to add some tooling)
- we blindly remove the RP from the PlacementTree and let update_
Fix proposed to branch: master /review. opendev. org/c/openstack /nova/+ /899625
Review: https:/