Nova fails to reuse mdev vgpu devices
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Opinion
|
Undecided
|
Unassigned |
Bug Description
Description:
=======
Hello we are experiencing a weird issue where Nova creates the mdev devices from virtual functions when none are created but then will not reuse them once they are all created and vgpu instances are removed.
I believe part of this issue was the uuid issue from this bug:
https:/
Manually applying the latest patch partially fixed the issue (placement stopped reporting no hosts available), now the error is on the hypervisor side saying 'no vgpu resources available'.
If I manually remove the mdev device by with commands like the following:
echo "1" > /sys/bus/
then Im able to spin up an instance again.
all mdev devices match in mdevctl list and virsh nodedev-list
Steps to reproduce:
=======
1) freshly setup hypervisor with no mdev devices created yet
2) spin up vgpu instances until all mdevs are created that will fit on physical gpu(s)
3) delete vgpu instances
4) try and spin up new vgpu instances
Expected Result:
=======
Instance spin up and use reuse the mdev vgpu devices
Actual Result:
=======
Build error from Nova API:
Error: Failed to perform requested operation on instance "colby_gpu_test23", the instance has an error status: Please try again later [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance c18565f9-
Error in hypervisor logs:
nova.exception.
mdevctl output:
cdc98056-
298f1e4b-
2abee89e-
32445186-
0c4f5d07-
75d1b78c-
a54d33e0-
cd7a49a8-
virsh nodedev-list -cap mdev:
mdev_0c4f5d07_
mdev_298f1e4b_
mdev_2abee89e_
mdev_32445186_
mdev_75d1b78c_
mdev_a54d33e0_
mdev_cd7a49a8_
mdev_cdc98056_
nvidia-smi vgpu output:
Wed Jul 13 20:15:16 2022
+------
| NVIDIA-SMI 510.73.06 Driver Version: 510.73.06 |
|------
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|======
| 0 NVIDIA A40 | 00000000:21:00.0 | 0% |
| 3251635106 NVIDIA A40-12Q | 2786... instance-00014520 | 0% |
| 3251635117 NVIDIA A40-12Q | 6dc4... instance-0001452f | 0% |
+------
| 1 NVIDIA A40 | 00000000:81:00.0 | 0% |
| 3251635061 NVIDIA A40-12Q | 0d95... instance-000144de | 0% |
| 3251635094 NVIDIA A40-12Q | 40a0... instance-0001450e | 0% |
| 3251635112 NVIDIA A40-12Q | 776e... instance-00014529 | 0% |
+------
Environment:
=======
Centos 8 stream
Victoria openstack version (Nova 22.4.0-1)
libvirt 8.0.0-6
qemu-kvm 6.2.0-12
Nvidia A40 GPUs
This bug seems to be a libvirt regression : libvirt isn't catching the mdev creation on the fly.
I filed a BZ against the libvirt team for investigation : https:/ /bugzilla. redhat. com/show_ bug.cgi? id=2109450