Placement error when using GPUs that is utilizing SR-IOV for VGPU

Bug #1906494 reported by Lars Erik Pedersen
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
melanie witt
Ussuri
Medium
melanie witt
Victoria
Medium
Lars Erik Pedersen

Bug Description

I'm experiencing some weird bugs/brokenness when trying to use the Nvidia A100 for vGPU in nova. It's a little bit involved, but I tihnk I have an idea of what's going on.

My setup:
Dell R7525 with 2xNvidia A100
CentOS 8.2
Openstack Ussuri (openstack-nova-compute.noarch 1:21.1.1-1.el8)
Nvidia GRID 11.2

A little preface
================
The Nvidia A100 is utilizing SR-IOV to create mdev files for vGPUs. The nvidia-driver ships a tool to enable SR-IOV on the physical cards, and this creates 16 VFs per card, which all appears in /sys/class/mdev_bus. Each VF can create only one mdev (no matter which GRID profile you are using). When nova-compute starts, it will automatically populate placement with (in my case) all the 32 VFs with the inventory of 1 VGPU each.

In my case, I create VMs with the GRID A100-20C profile (so, two VGPUs per GPU). I've configured nova.conf with the correct "devices/enabled_vgpu_types" (nvidia-472), and four of the 32 VFs as valid in "vgpu_nvidia-472/device_addresses". In addition, I've set a custom trait on the resource providers that corresponds to the devices listed in nova.conf, and of course listed this trait in my flavor spec.

The problem
===========
When a physical card is full of vGPUs, nova-compute is seemingly trying to tell placement that all of the remaining VFs from that card now has 0 VGPUs in their inventory. This fails a JSON validation since all of the resource providers was created with 1 as minimum. Here's the python stacktrace from nova-compute.log: https://paste.ubuntu.com/p/TwfTkyMvqR/

The UUID from the stacktrace is _not_ one of the resource providers which was successfully used to create a VGPU for the two VMs already created. It's the UUID for the last VF on the same physical card when listed from 'openstack resource provider list'

This error also stops me from:
* Creating more VMs with a VGPU (even though it's still physical cards that has free space)
* Deleting the existing VMs (They end up in an error state, but can then be deleted if I reboot the host)

The stacktrace is persistent through nova-compute restarts, but it disappears on host reboot when the existing VMs is in the error state after the deletion attempt...

Expected result
===============
Well, I expect nova to handle this properly.. I have no idea how, or why this happens, but I'm confident that you clever people will find a fix!

tags: added: vgpu
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Seems like a valid bug. The libvirt driver vgpu code assumes that the total value of a placement resource inventory can be set to 0 but placement does not allow that.

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
tags: added: libvirt placement
Revision history for this message
melanie witt (melwitt) wrote :

I think this is related to a bug I'm working on:

https://bugs.launchpad.net/nova/+bug/1901120

Revision history for this message
melanie witt (melwitt) wrote :

I have a patch proposed here: https://review.opendev.org/759348

Lars, if you might be able to try it out it would be helpful!

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: Confirmed → In Progress
Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

I tried to manually apply that patch in my environment. It seems to fix the bug!

Revision history for this message
melanie witt (melwitt) wrote :

Awesome, thank you Lars!

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

The fix merged to master: https://review.opendev.org/c/openstack/nova/+/759348
stable/victoria backport has been proposed: https://review.opendev.org/c/openstack/nova/+/766177

Revision history for this message
melanie witt (melwitt) wrote :

stable/ussuri change is proposed at https://review.opendev.org/c/openstack/nova/+/766825 for merge after the stable/victoria change merges.

Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.1.0

This issue was fixed in the openstack/nova 22.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.1.2

This issue was fixed in the openstack/nova 21.1.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers