VGPU - I can only use one MIG profile in Nova

Bug #2008883 reported by Piotr Łasak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Wishlist
Unassigned

Bug Description

I have one Nvidia A100 card and can only use one MIG profile.

I divided the card into 4 different MIG profiles 2x A100-1-5C, 1x A100-2-10C, 1x A100-3-20C, As below.

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 10MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 3 0 1 | 6MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 4739MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 3 | 4739MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+

My nova configuration: /etc/nova/nova.conf

[devices]
enabled_vgpu_types = nvidia-474,nvidia-475,nvidia-476
[vgpu_nvidia-474]
device_addresses = 0000:61:00.4,0000:61:01.0
[vgpu_nvidia-475]
device_addresses = 0000:61:01.7
[vgpu_nvidia-476]
device_addresses = 0000:61:00.6

# openstack resource provider list
+--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+
| a0269b89-d43d-4042-a64e-3c832f0bb23f | gpu-a01.example.os-tests.com | 104 | a0269b89-d43d-4042-a64e-3c832f0bb23f | None |
| f2e5a4e0-479e-4ee3-b504-36371ded49f5 | gpu-a01.example.os-tests.com_pci_0000_61_01_4 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| a513c661-6dd2-4462-b719-9fbf7b70c409 | gpu-a01.example.os-tests.com_pci_0000_61_01_2 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 9124d3a8-00fb-475e-a0f8-892ccf5d255e | gpu-a01.example.os-tests.com_pci_0000_61_00_7 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 5f443da2-3c75-45c6-9d8a-05ca8a487802 | gpu-a01.example.os-tests.com_pci_0000_61_02_1 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 20da8814-c5f0-4575-a785-579e9abdbb1d | gpu-a01.example.os-tests.com_pci_0000_61_01_3 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 37014baa-fba6-4f14-8be8-17084b3aad36 | gpu-a01.example.os-tests.com_pci_0000_61_01_1 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| a9e2f509-fb03-45a0-9ff0-7c50143c1a9c | gpu-a01.example.os-tests.com_pci_0000_61_02_2 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 202462de-7d01-45c7-b197-1f2ca5c9c7ae | gpu-a01.example.os-tests.com_pci_0000_61_02_3 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 5e26d9e8-b59a-47b3-879c-c2c50ab7f1f0 | gpu-a01.example.os-tests.com_pci_0000_61_01_7 | 43 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 315b5205-26ac-4ec6-b5d2-623cafc18f39 | gpu-a01.example.os-tests.com_pci_0000_61_01_6 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 208f395d-d9c7-4108-8f83-e48cbea0b637 | gpu-a01.example.os-tests.com_pci_0000_61_02_0 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 7d5abf99-3c42-4c62-ba33-15682c6cfc5b | gpu-a01.example.os-tests.com_pci_0000_61_00_4 | 18 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 315b2c35-401c-4165-add8-2b025961b9a0 | gpu-a01.example.os-tests.com_pci_0000_61_01_5 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| c43be42e-e564-46be-9025-4f00a1f7454e | gpu-a01.example.os-tests.com_pci_0000_61_00_5 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 3cd4dbc7-2c2a-448d-a041-27c8fd685950 | gpu-a01.example.os-tests.com_pci_0000_61_01_0 | 14 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 58fbbedb-9845-4397-bd20-f559ba68daee | gpu-a01.example.os-tests.com_pci_0000_61_00_6 | 27 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
+--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+

Created Flavor:

openstack --os-placement-api-version 1.6 trait create CUSTOM_N_1
openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_1 3cd4dbc7-2c2a-448d-a041-27c8fd685950
openstack flavor create --private --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-1 --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_1=required

openstack --os-placement-api-version 1.6 trait create CUSTOM_N_2
openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_2 7d5abf99-3c42-4c62-ba33-15682c6cfc5b
openstack flavor create --private --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-2 --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_2=required

openstack --os-placement-api-version 1.6 trait create CUSTOM_N_3
openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_3 5e26d9e8-b59a-47b3-879c-c2c50ab7f1f0
openstack flavor create --private --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-3 --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_3=required

openstack --os-placement-api-version 1.6 trait create CUSTOM_N_4
openstack --os-placement-api-version 1.6 resource provider trait set --trait CUSTOM_N_4 58fbbedb-9845-4397-bd20-f559ba68daee
openstack flavor create --private --description "vgpu-test" --ram $((8*1024)) --disk 0 --vcpus 8 vgpu-4 --project vgpu --property resources:VGPU=1 --property trait:CUSTOM_N_4=required

gpu-a01:~/nvidia-dev-ctl# ls /sys/class/mdev_bus/*/mdev_supported_types
'/sys/class/mdev_bus/0000:61:00.4/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:00.5/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:00.6/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:00.7/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.0/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.1/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.2/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.3/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.4/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.5/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.6/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:01.7/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:02.0/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:02.1/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:02.2/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

'/sys/class/mdev_bus/0000:61:02.3/mdev_supported_types':
nvidia-468 nvidia-469 nvidia-470 nvidia-471 nvidia-472 nvidia-473 nvidia-474 nvidia-475 nvidia-476 nvidia-477 nvidia-478 nvidia-706

Problem:
===========
I can create only two instances with A100-1-5C(nvidia-474) and types nvidia-475, nvidia-476 are omitted and I can't use them.

If I edit the nova config and replace in /etc/nova/nova.conf
enabled_vgpu_types = nvidia-474,nvidia-475,nvidia-476
on
enabled_vgpu_types = nvidia-475,nvidia-476
I will be able to use only one A100-2-10C(nvidia-475) type.

Packages:
============

Version: Ussuri

gpu-a01:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

gpu-a01:~# dpkg -l | grep nova
ii nova-common 2:21.2.4-0ubuntu1 all OpenStack Compute - common files
ii nova-compute 2:21.2.4-0ubuntu1 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:21.2.4-0ubuntu1 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:21.2.4-0ubuntu1 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:21.2.4-0ubuntu1 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:17.0.0-0ubuntu1 all client library for OpenStack Compute API - 3.x

gpu-a01:~# uname -a
Linux compgpu-a01 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Tags: vgpu
Piotr Łasak (plasec)
tags: added: vgpu
Piotr Łasak (plasec)
no longer affects: nova (Ubuntu)
Revision history for this message
sean mooney (sean-k-mooney) wrote :

Nova does not impos any limits on the mig profiles or mdev types that can be used.

this is a hardware limitation of the nvida gpus not a nova limitation.

you are using A100 which does support using more then one type but only on specific types which they have documented in there product docs.

if you find that you cannot use the selced type then this is likely a result of using a unsuproted combination.

if they have documented that this should work you should file a bug with nvidia support.

you can find there documentiaon here

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#a100-profiles

i would suggest starting with one of the example toploies before trying your own to ensure it works.

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#create-gi

the nova community only provides enablement of the mdev based vgpu feature
we do not provide support for configuring the nvidia hardware and software.

Changed in nova:
status: New → Invalid
Revision history for this message
Piotr Łasak (plasec) wrote (last edit ):
Download full text (3.8 KiB)

This is not a problem with the card.
The profiles are correctly created with nvidia-mig-parted and can create instances with these profiles.

mdev_supported_types -> name
nvidia-474 -> A100-1-5C
nvidia-475 -> A100-2-10C
nvidia-476 -> A100-3-20C

config: /etc/nova/nova.conf
-------------------------------------------
[devices]
enabled_vgpu_types = nvidia-474,nvidia-475,nvidia-476
[vgpu_nvidia-474]
device_addresses = 0000:61:00.4,0000:61:01.0
[vgpu_nvidia-475]
device_addresses = 0000:61:01.7
[vgpu_nvidia-476]
device_addresses = 0000:61:00.6
-------------------------------------------

With this config I can create only servers nvidia-474(A100-1-5C) even though I specified trait as VF: 0000:61:01.7.

As you can see below vgpu_type_id=474 and VF at 0:61:0.7
Logs:
Mar 2 14:41:07 gpu-a01 nvidia-vgpu-mgr[118045]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 300f8753-3031-4729-8b53-ce43c6859489 GPU PCI id 00:61:00.7 config params vgpu_type_id=474
Mar 2 14:41:07 compgpu-a01 nvidia-vgpu-mgr[118045]: notice: vmiop_log: (0x0): detected a VF at 0:61:0.7

Whereas if I modify the file /etc/nova/nova.conf and delete the nvidia-474 and restart nova-compute process, I will be able to create instances with a different profile.

config: /etc/nova/nova.conf
-------------------------------------------
[devices]
enabled_vgpu_types = nvidia-475,nvidia-476

[vgpu_nvidia-475]
device_addresses = 0000:61:01.7

[vgpu_nvidia-476]
device_addresses = 0000:61:00.6
-------------------------------------------

As you can see below vgpu_type_id=475 and VF at 0:61:0.7 after deleting and creating the my instance.

Logs:
Mar 2 14:52:04 compgpu-a01 nvidia-vgpu-mgr[119109]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid d425c1ee-9232-475a-9489-b0c01b1422bd GPU PCI id 00:61:01.7 config params vgpu_type_id=475
Mar 2 14:52:04 compgpu-a01 nvidia-vgpu-mgr[119109]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=475

Something seems to be wrong in Nova mapping vgpu_type_id with VF.

I can have instances with different MIG profiles but that involves editing the nova.conf file and restarting nova-compute every time I want to change the type I want to create.

gpu-a01:~# nvidia-smi vgpu -v
Thu Mar 2 15:03:43 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.07 Driver Version: 525.85.07 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA A100-PCIE-40GB | 00000000:61:00.0 | N/A |
| 3251650404 GRID A100-2... | d53e... instance-00003074 | N/A |
| 3251651234 GRID A100-1-5C | 9d3d... instance-00003077 | N/A |
+---------------------------------+------------------------------+------------+

Logs:
Mar 2 14:52:04 gpu-a01 nvidia-vgpu-mgr[119109]: notice: vmiop_env_log: (0x0): R...

Read more...

Revision history for this message
sean mooney (sean-k-mooney) wrote :

im torn between considering this a wishlist bug or a feature request.

i think this is related perhaps to the resource provider mapings

with this configuration
[devices]
enabled_vgpu_types = nvidia-474,nvidia-475,nvidia-476
[vgpu_nvidia-474]
device_addresses = 0000:61:00.4,0000:61:01.0
[vgpu_nvidia-475]
device_addresses = 0000:61:01.7
[vgpu_nvidia-476]
device_addresses = 0000:61:00.6

i would expect there to be 4 resource providers created each with an inventory of 1 vgpu

from the logs below

3cd4dbc7-2c2a-448d-a041-27c8fd685950
7d5abf99-3c42-4c62-ba33-15682c6cfc5b
5e26d9e8-b59a-47b3-879c-c2c50ab7f1f0
58fbbedb-9845-4397-bd20-f559ba68daee

can you do an inventory show on each and confirm that.

looking at the flavor you appear ot have added the correct trait request to have them target the appropriate rps.

the approach you are taking was replaced by the generic mdev feature in xena.
https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/generic-mdevs.html

there instead of tagging the rp manually with a trait you would use a different resource case per mdev type.

you are essically trying to use this feature

https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/vgpu-stein.html

but instead of having multiple; physical gpus you are trying to use mig to partition the GPU first into VFs.

that was intended to be enabled by
https://specs.openstack.org/openstack/nova-specs/specs/ussuri/implemented/vgpu-multiple-types.html

however when that feature was implemented no released GPU supported mig or multiple mdev types on the same card.

as such it was only ever tested with multiple mdev type on the same host but with one pGUP per mdev_type

Changed in nova:
importance: Undecided → Wishlist
status: Invalid → Incomplete
Revision history for this message
sean mooney (sean-k-mooney) wrote :

just following up on my last comment when we implemented
https://specs.openstack.org/openstack/nova-specs/specs/ussuri/implemented/vgpu-multiple-types.html
the spec linked to the fact that the Nvidia grid drivers at the time did not support multiple mdev types in the same pyhical gpu.
https://docs.nvidia.com/grid/10.0/grid-vgpu-user-guide/index.html#homogeneous-grid-vgpus

that changed in a100 with the introduction of mig

i think what we ended up merging in ussuri was support for multiple vgpus types with one vgpu type per physical GPU and i suspect mig support is not enough with novas current implementation to model different GPU types on the same card.

Revision history for this message
Piotr Łasak (plasec) wrote :
Download full text (5.9 KiB)

Provider list:
===================

# openstack resource provider list
+--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+----------------------------------------------+------------+--------------------------------------+--------------------------------------+
| a0269b89-d43d-4042-a64e-3c832f0bb23f | gpu-a01.example.os-tests.com | 104 | a0269b89-d43d-4042-a64e-3c832f0bb23f | None |
| f2e5a4e0-479e-4ee3-b504-36371ded49f5 | gpu-a01.example.os-tests.com_pci_0000_61_01_4 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| a513c661-6dd2-4462-b719-9fbf7b70c409 | gpu-a01.example.os-tests.com_pci_0000_61_01_2 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 9124d3a8-00fb-475e-a0f8-892ccf5d255e | gpu-a01.example.os-tests.com_pci_0000_61_00_7 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 5f443da2-3c75-45c6-9d8a-05ca8a487802 | gpu-a01.example.os-tests.com_pci_0000_61_02_1 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 20da8814-c5f0-4575-a785-579e9abdbb1d | gpu-a01.example.os-tests.com_pci_0000_61_01_3 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 37014baa-fba6-4f14-8be8-17084b3aad36 | gpu-a01.example.os-tests.com_pci_0000_61_01_1 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| a9e2f509-fb03-45a0-9ff0-7c50143c1a9c | gpu-a01.example.os-tests.com_pci_0000_61_02_2 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 202462de-7d01-45c7-b197-1f2ca5c9c7ae | gpu-a01.example.os-tests.com_pci_0000_61_02_3 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 5e26d9e8-b59a-47b3-879c-c2c50ab7f1f0 | gpu-a01.example.os-tests.com_pci_0000_61_01_7 | 43 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 315b5205-26ac-4ec6-b5d2-623cafc18f39 | gpu-a01.example.os-tests.com_pci_0000_61_01_6 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 208f395d-d9c7-4108-8f83-e48cbea0b637 | gpu-a01.example.os-tests.com_pci_0000_61_02_0 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 7d5abf99-3c42-4c62-ba33-15682c6cfc5b | gpu-a01.example.os-tests.com_pci_0000_61_00_4 | 18 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 315b2c35-401c-4165-add8-2b025961b9a0 | gpu-a01.example.os-tests.com_pci_0000_61_01_5 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| c43be42e-e564-46be-9025-4f00a1f7454e | gpu-a01.example.os-tests.com_pci_0000_61_00_5 | 1 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 3cd4dbc7-2c2a-448d-a041-27c8fd685950 | gpu-a01.example.os-tests.com_pci_0000_61_01_0 | 14 | a0269b89-d43d-4042-a64e-3c832f0bb23f | a0269b89-d43d-4042-a64e-3c832f0bb23f |
| 58fb...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote (last edit ):
Download full text (6.1 KiB)

Reviving this discussion a bit.

Running openstack/Yoga, host Ubuntu 20.04, nvidia drivers 470.103.02, pGPU nvidia A100 PCIe 40GB

With proper configuration I can not reproduce this issue, however I suspect the reason why it appeared (see at the end)

My configuration:

nvidia mig:
~# nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 1g.5gb 19 9 2:1 |
+-------------------------------------------------------+
| 0 MIG 1g.5gb 19 10 3:1 |
+-------------------------------------------------------+
| 0 MIG 2g.10gb 14 3 0:2 |
+-------------------------------------------------------+
| 0 MIG 3g.20gb 9 2 4:4 |
+-------------------------------------------------------+

nova config:

[devices]
enabled_mdev_types = nvidia-474,nvidia-475,nvidia-476
[mdev_nvidia-474]
device_addresses = 0000:42:00.4,0000:42:00.5
mdev_class = CUSTOM_MIG_1G_5GB
[mdev_nvidia-475]
device_addresses = 0000:42:00.6
mdev_class = CUSTOM_MIG_2G_10GB
[mdev_nvidia-476]
device_addresses = 0000:42:00.7
mdev_class = CUSTOM_MIG_3G_20GB

resource providers (3 computes, but only one has single A100 gpu in MIG mode)
# openstack resource provider list -f value -c uuid | xargs -l1 openstack resource provider inventory list -f value
VCPU 8.0 1 16 0 1 16 0
MEMORY_MB 1.0 1 31999 512 1 31999 0
DISK_GB 1.6 1 1360 0 1 1360 0
VCPU 8.0 1 16 0 1 16 0
MEMORY_MB 1.0 1 31999 512 1 31999 0
DISK_GB 1.6 1 1360 0 1 1360 0
VCPU 8.0 1 32 0 1 32 13
MEMORY_MB 1.0 1 193449 512 1 193449 16384
DISK_GB 1.6 1 1360 0 1 1360 20
CUSTOM_MIG_2G_10GB 1.0 1 1 0 1 1 1
CUSTOM_MIG_3G_20GB 1.0 1 1 0 1 1 1
CUSTOM_MIG_1G_5GB 1.0 1 1 0 1 1 1
CUSTOM_MIG_1G_5GB 1.0 1 1 0 1 1 1

relevant parts of flavors:
for i in 474 475 476; do openstack flavor show nvidia-$i -c properties -f value; done
{'resources:CUSTOM_MIG_1G_5GB': '1'}
{'resources:CUSTOM_MIG_2G_10GB': '1'}
{'resources:CUSTOM_MIG_3G_20GB': '1'}

I can boot all the 4 instances consuming all available vGPUs:
# openstack server list -c ID -c Flavor -c Status
+--------------------------------------+--------+------------+
| ID | Status | Flavor |
+--------------------------------------+--------+------------+
| d906c0de-22cc-4409-bbc8-b57e49874a40 | ACTIVE | nvidia-476 |
| a77d1acb-24f5-46a7-92d6-12c5ac9efb6b | ACTIVE | nvidia-475 |
| 66592cfd-0789-4186-a307-79bd54d0f121 | ACTIVE | nvidia-474 |
| 0e7f2b9d-ba24-4400-b2c9-9d55402b4791 | ACTIVE | nvidia-474 |
+--------------------------------------+--------+------------+

# nvidia-smi vgpu
Tue Jun 20 15:53:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.02 Driver Version: 470.103.02 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.