Unused pre-existing mediated devices not "available" on instance creation

Bug #2015892 reported by Benedikt Steinbusch
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Description
===========

We are running Yoga, deployed via Kolla-Ansible, on a number of compute hosts equipped with NVIDIA A100s. Since deploying images recently in order to pull in this fix (https://review.opendev.org/c/openstack/nova/+/866154), nova refuses to consider pre-existing mediated devices on instance creation, even though they are not in use by any instance. This happens both if we pre-create the mdevs (in an attempt to make their UUIDs persistent) and also if we leave the creation of mediated devices to nova. In the latter case, instance (and mdev) creation succeeds, but medvs are not cleaned up when instances are destroyed and after a while the hardware is filled up with unused mdevs left behind by past instances and new instances fail to be created.

Steps to reproduce
==================

1. Configure nova according to the "Attaching virtual GPUs devices to guests" guide (https://docs.openstack.org/nova/latest/admin/virtual-gpu.html)
2. Pre-create mediated devices to fill up all available instances
3. Create an instance with a vGPU attached

Expected result
===============

Instance is created with vGPU attached.

Actual result
=============

Instance creation fails with "Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance". Nova logs show:

2023-04-11 17:37:40.156 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Available mdevs at: set().
2023-04-11 17:37:40.253 7 INFO nova.virt.libvirt.driver [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] Failed to create mdev. No free space found among the following devices: ['pci_0000_01_02_1', 'pci_0000_a1_01_4', 'pci_0000_81_01_2', 'pci_0000_a1_01_3', 'pci_0000_81_02_5', 'pci_0000_81_02_2', 'pci_0000_01_01_2', 'pci_0000_81_01_5', 'pci_0000_81_00_4', 'pci_0000_01_02_0', 'pci_0000_a1_00_5', 'pci_0000_a1_01_0', 'pci_0000_01_02_4', 'pci_0000_81_02_4', 'pci_0000_a1_02_4', 'pci_0000_81_01_4', 'pci_0000_81_02_0', 'pci_0000_01_02_3', 'pci_0000_01_01_6', 'pci_0000_81_02_3', 'pci_0000_a1_00_4', 'pci_0000_a1_01_5', 'pci_0000_81_01_6', 'pci_0000_a1_02_1', 'pci_0000_01_01_3', 'pci_0000_a1_02_0', 'pci_0000_a1_02_3', 'pci_0000_01_02_7', 'pci_0000_81_02_7', 'pci_0000_81_00_7', 'pci_0000_81_00_6', 'pci_0000_01_02_6', 'pci_0000_a1_01_6', 'pci_0000_a1_02_5', 'pci_0000_a1_01_2', 'pci_0000_81_01_1', 'pci_0000_a1_01_7', 'pci_0000_a1_02_7', 'pci_0000_01_00_6', 'pci_0000_81_00_5', 'pci_0000_01_00_5', 'pci_0000_01_01_1', 'pci_0000_01_01_7', 'pci_0000_81_01_0', 'pci_0000_01_02_5', 'pci_0000_81_02_6', 'pci_0000_a1_00_7', 'pci_0000_81_01_3', 'pci_0000_a1_02_6', 'pci_0000_01_00_7', 'pci_0000_01_01_4', 'pci_0000_01_02_2', 'pci_0000_a1_02_2', 'pci_0000_01_01_0', 'pci_0000_01_00_4', 'pci_0000_a1_00_6', 'pci_0000_a1_01_1', 'pci_0000_01_01_5', 'pci_0000_81_01_7', 'pci_0000_81_02_1'].
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [req-a1d78938-b249-48c0-a874-1d52f321fea7 a5d54379b17e4bc29d680afee6141281 afe6237e16cb40e4aa26a9d0b00a114e - default default] [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] Traceback (most recent call last):
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2748, in _build_resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] yield resources
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2512, in _build_and_run_instance
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] accel_info=accel_info)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 4315, in spawn
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] mdevs = self._allocate_mdevs(allocations)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 391, in inner
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] return f(*args, **kwargs)
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8300, in _allocate_mdevs
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] reason='mdev-capable resource is not available')
2023-04-11 17:37:40.254 7 ERROR nova.compute.manager [instance: 6c497840-e46c-497a-9e9f-1cf976a6cb34] nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: mdev-capable resource is not available.

Environment
===========

1. OpenStack version

Not sure how to tell from the kolla source images. nova seems to be version 25.1.1. The top entries in the ChangeLog are:

CHANGES
=======

* Handle mdev devices in libvirt 7.7+
* Reproducer for bug 1951656
* ignore deleted server groups in validation
* add repoducer test for bug 1890244
* Add a workaround to skip hypervisor version check on LM

25.1.0
------

* [stable-only][cve] Check VMDK create-type against an allowed list
* Improving logging at '\_allocate\_mdevs'
* Gracefully ERROR in \_init\_instance if vnic\_type changed
* Reproduce bug 1981813 in func env
* Adapt websocketproxy tests for SimpleHTTPServer fix
* enable blocked VDPA move operations
* Add compute restart capability for libvirt func tests

2. Hypervisor

libvirt/KVM

3. Storage

NFS

4. Networking

Neutron with OVN

Logs & Configs
==============

I will attach a nova log in DEBUG mode of a failed instance creation. Additional data can be provided if necessary.

Tags: vgpu
Revision history for this message
Benedikt Steinbusch (bsteinb) wrote :
tags: added: vgpu
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Closing as duplicate of https://bugs.launchpad.net/nova/+bug/2041519 (sorry the bug report is older here, but for tracking purposes, let's keep the used one)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.