Comment 8 for bug 2008883

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote (last edit ):

Reviving this discussion a bit.

Running openstack/Yoga, host Ubuntu 20.04, nvidia drivers 470.103.02, pGPU nvidia A100 PCIe 40GB

With proper configuration I can not reproduce this issue, however I suspect the reason why it appeared (see at the end)

My configuration:

nvidia mig:
~# nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 1g.5gb 19 9 2:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 10 3:1 |
+-------------------------------------------------------+
| 0 MIG 2g.10gb 14 3 0:2 |
+-------------------------------------------------------+
| 0 MIG 3g.20gb 9 2 4:4 |
+-------------------------------------------------------+

nova config:

[devices]
enabled_mdev_types = nvidia-474,nvidia-475,nvidia-476
[mdev_nvidia-474]
device_addresses = 0000:42:00.4,0000:42:00.5
mdev_class = CUSTOM_MIG_1G_5GB
[mdev_nvidia-475]
device_addresses = 0000:42:00.6
mdev_class = CUSTOM_MIG_2G_10GB
[mdev_nvidia-476]
device_addresses = 0000:42:00.7
mdev_class = CUSTOM_MIG_3G_20GB

resource providers (3 computes, but only one has single A100 gpu in MIG mode)
# openstack resource provider list -f value -c uuid | xargs -l1 openstack resource provider inventory list -f value
VCPU 8.0 1 16 0 1 16 0
MEMORY_MB 1.0 1 31999 512 1 31999 0
DISK_GB 1.6 1 1360 0 1 1360 0
VCPU 8.0 1 16 0 1 16 0
MEMORY_MB 1.0 1 31999 512 1 31999 0
DISK_GB 1.6 1 1360 0 1 1360 0
VCPU 8.0 1 32 0 1 32 13
MEMORY_MB 1.0 1 193449 512 1 193449 16384
DISK_GB 1.6 1 1360 0 1 1360 20
CUSTOM_MIG_2G_10GB 1.0 1 1 0 1 1 1
CUSTOM_MIG_3G_20GB 1.0 1 1 0 1 1 1
CUSTOM_MIG_1G_5GB 1.0 1 1 0 1 1 1
CUSTOM_MIG_1G_5GB 1.0 1 1 0 1 1 1

relevant parts of flavors:
for i in 474 475 476; do openstack flavor show nvidia-$i -c properties -f value; done
{'resources:CUSTOM_MIG_1G_5GB': '1'}
{'resources:CUSTOM_MIG_2G_10GB': '1'}
{'resources:CUSTOM_MIG_3G_20GB': '1'}

I can boot all the 4 instances consuming all available vGPUs:
# openstack server list -c ID -c Flavor -c Status
+--------------------------------------+--------+------------+
| ID | Status | Flavor |
+--------------------------------------+--------+------------+
| d906c0de-22cc-4409-bbc8-b57e49874a40 | ACTIVE | nvidia-476 |
| a77d1acb-24f5-46a7-92d6-12c5ac9efb6b | ACTIVE | nvidia-475 |
| 66592cfd-0789-4186-a307-79bd54d0f121 | ACTIVE | nvidia-474 |
| 0e7f2b9d-ba24-4400-b2c9-9d55402b4791 | ACTIVE | nvidia-474 |
+--------------------------------------+--------+------------+

# nvidia-smi vgpu
Tue Jun 20 15:53:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.02 Driver Version: 470.103.02 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA A100-PCIE-40GB | 00000000:42:00.0 | N/A |
| 3251636855 GRID A100-1-5C | 0e7f... instance-000005b0 | N/A |
| 3251636860 GRID A100-1-5C | 6659... instance-000005b3 | N/A |
| 3251636865 GRID A100-2... | a77d... instance-000005b6 | N/A |
| 3251636870 GRID A100-3... | d906... instance-000005b9 | N/A |
+---------------------------------+------------------------------+------------+

What I suspect has happened here:

this configuration is not 'greenfield', the env was re-configured from some other setup with other MIG profiles set and configured in nova, and was not cleaned up manually.

There are several problems in Nova currently that may lead to this:

- nova never deletes mdev devices it has created, instead it re-uses those. If you change enabled_mdev_types and their mapping for PCI devices and do not delete those unused mdevs created before by booting instances, then this PCI device will report that it does not support the new mdev type (available instances: 0) because it already has another type of mdev device on top of it. As a result, nova will not update the resource provider with new resource class or total capacity.
TBH I am not sure why nova is not deleting mdevs, may be for some performance issues, but at least with nvidia delete and create an mdev is basically instant.

- nova does not cleanup redundant resource providers automatically, which is a more pronounced problem with MIG-enabled nvidia pGPUs (because now we have many PCI devices each supporting a single vGPU). Say you had 3x A100-1-5C, 2x A100-2-10C before, and now move to the 2x A100-1-5C, 1x A100-2-10C, 1x A100-3-20C. Even if you try to reuse the PCI device addresses, there will necessary be at least one device that is not currently mapped in nova to mdev type, but still has left its corresponding resource provider in placement, thus throwing the resource tracking off. If placement chooses it to provide the resource class you are asking in your flavor, the result will fail on compute, as nova is not configured for this device to handle this mdev type. Currently the orphan/redundant resource providers must be cleaned up manually after reconfiguration

- nova does not handle the dynamic device configuration group ([mdev_...]) when there's only one mdev type enabled. Thus if you enable only 1 mdev type, and say have one A100 card in MIG mode, nova will create 16 child providers in placement (the number of created SR-IOV VFs for pGPU) while in fact you can have at most 7 vGPUs on it. IF you then reconfigure to some other MIG partition, all those 16 devices will be left in place (see previous point) throwing off the resource capacity from placement's PoV.

I plan to file all the above as separate issues for Nova, with more details (have patches for 2 of those already).