On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group

Bug #1922264 reported by Rodolphe Plachesi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
Sylvain Bauza

Bug Description

Description
===========
We have a multiple compute nodes with multiple NVIDIA GPU cards (RTX8000/RTX6000).
Nodes with a mix of RTX8000 and RTX6000 cards have 2 gpu groups configured in nova.conf but nova-compute only creates resource providers for the first gpu group.

Steps to reproduce
==================

For example, on a node with 2 RTX8000 and 1 RTX6000.

$ lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
81:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
e2:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

$ nvidia-smi
Thu Apr 1 17:22:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.04 Driver Version: 460.32.04 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:21:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 On | 00000000:81:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 6000 On | 00000000:E2:00.0 Off | 0 |
| N/A 30C P8 24W / 250W | 150MiB / 23039MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

Extract from nova.conf :
...
[devices]
enabled_vgpu_types = nvidia-428, nvidia-387

[vgpu_nvidia-428]
device_addresses = 0000:21:00.0,0000:81:00.0

[vgpu_nvidia-387]
device_addresses = 0000:e2:00.0

When nova-compute starts, log shows :
2021-04-01 17:15:25.454 7 WARNING nova.virt.libvirt.driver [req-bebc8637-d231-435c-a6cc-4613e14e2f76 - - - - -] The vGPU type 'nvidia-428' was listed in '[devices] enabled_vgpu_types' but no corresponding '[vgpu_nvidia-428]' group or '[vgpu_nvidia-428] device_addresses' option was defined. Only the first type 'nvidia-428' will be used.

And a listing of resource providers on this node shows that only nvidia-428 GPUs were used :
$ openstack resource provider list --os-placement-api-version 1.14 --in-tree f5d35bdc-b4b7-4764-a9d0-41f67fd95385
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
| f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | cloud-lyse-cmp-02 | 32 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | None |
| 21a4a16e-8d33-4a23-a924-b00f8c31f0d0 | cloud-lyse-cmp-02_pci_0000_81_00_0 | 4 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
| 76e1ee94-fbf2-410e-9711-fba71c709388 | cloud-lyse-cmp-02_pci_0000_21_00_0 | 2 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+

In nova.conf, if I swap nvidia-428 & nvidia-387 in enabled_vgpu_types, only nvidia-387 is loaded.

Expected result
===============
All gpu groups have to be loaded (as stated in docs).

Actual result
=============
Only the first gpu group is loaded.

Environment
===========
OpenStack Victoria was deployed with kolla-ansible.
NVIDIA GRID KVM drivers: 12.1 (latest)
System: Ubuntu 20.04.2
nova-compute version: 22.2.1

Hypervisor: libvirt+KVM (libvirt 6.0.0, QEMU/KVM 4.2.1)
Storage: Dell EMC Storage Center (7.3.20.19)
Network: neutron with OVN/OVS

Tags: vgpu
summary: - on a compute node with 3 GPUs et 2 gpu groups, nova fails to load second
+ On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second
group config
summary: - On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second
- group config
+ On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load
+ second group
tags: added: vgpu
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Looks very similar to https://bugs.launchpad.net/nova/+bug/1900006 which was fixed in Wallaby but not backported into stable/victoria.

Accepting it as valid.

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
assignee: nobody → Sylvain Bauza (sylvain-bauza)
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Marking this bug report as duplicate, so we can directly backport the change down to stable/victoria.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.