VGPUs is not recreated on host reboot

Bug #1900800 reported by Lars Erik Pedersen
82
This bug affects 14 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Low
Sylvain Bauza

Bug Description

Description
===========
In Ussuri, when a compute node providing vGPUs (Nvidia GRID in my case) is rebooted, the mdevs for VGPUs is not recreated, and a traceback from libvirt.libvirtError is thrown.

https://paste.ubuntu.com/p/4t4NvTHGd8/

As far as I understand, this should have been fixed in https://review.opendev.org/#/c/715489/ but it seems like it fails even before it tries to recreate the mdev.

Expected result
===============
Upon host reboot, the mdevs should be recreated and the VMs should be restarted.

Actual result
=============
nova-compute throws the aforementioned error, the mdevs are not re-created and the VMs is left in an unrecoverable state.

Environment
===========
# dnf list installed | grep nova
openstack-nova-common.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
openstack-nova-compute.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
python3-nova.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
python3-novaclient.noarch 1:17.0.0-1.el8 @centos-openstack-ussuri

# dnf list installed | grep libvirt
libvirt-bash-completion.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-client.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-config-nwfilter.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-interface.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-network.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-nodedev.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-nwfilter.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-qemu.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-secret.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-core.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-disk.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-gluster.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-iscsi.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-iscsi-direct.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-logical.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-mpath.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-rbd.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-scsi.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-kvm.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-libs.x86_64 6.0.0-25.2.el8 @advanced-virtualization
python3-libvirt.x86_64 6.0.0-1.el8 @advanced-virtualization

Tags: libvirt vgpu
tags: added: libvirt vgpu
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Thanks for reporting !

Oh shit, you're right, we can't lookup the existing mdev info [1] to know its parent PCI device since the mdev disappeared after rebooting...

[1] https://github.com/openstack/nova/blob/450213f/nova/virt/libvirt/driver.py#L816

So, honestly, there are no ways to know the parent PCI device since we don't persist mdevs and to be honest, that's not something Nova should do since it's a KVM/kernel issue.

Leaving the bug open until we figure out a good way to either document or fix this.

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
assignee: nobody → Sylvain Bauza (sylvain-bauza)
Revision history for this message
Eigil Obrestad (obrestad) wrote :

I guess we in that case should simply free all GPU-resources registered in placement for a certain compute-node when the compute-node boots? And give the VM's new GPU's when they boot?

I guess it is not an issue that they do not get the exact same GPU after a reboot; as the GPU looses state at reboot anyway?

Revision history for this message
Mohammed Naser (mnaser) wrote :

Sylvain, Sean and I had a discussion about this and some of the possible solutions (long term and short term):

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2021-04-15.log.html#t2021-04-15T13:12:29

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

A possible workaround for the trivial case where all GPUs are set with a unique type could be :
 - stop the compute service
 - lookup all the existing guest domains and fetch each of the mdev UUIDs
 - for each UUID, ask sysfs to recreate the mdev by firing this command :

echo $UUID > /sys/bus/pci/devices/$PCI_ADDRR/mdev_supported_types/$THE_RIGHT_TYPE/create

where $UUID is the mdev UUID, $PCI_ADDR the parent GPU and $THE_RIGHT_TYPE is the mdev type this pGPU was using.

Hope that helps, I'll amend the upstream docs with this limitation and I'll work on a patch for looking up the allocations to get the Resource Provider that matches, so I'll know the physical GPU and then be able to lookup the config option for finding the type.

Revision history for this message
melanie witt (melwitt) wrote :

Sylvain, I had been thinking a possible workaround might be shelve offloading the server and then unshelving it. Would that work or no?

Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

Melanie: That is exactly what we're doing to workaround this in our environment. Work nicely for planned reboots. But of course, in the event of an unexpected reboot, we're still faced with a major manual cleanup job.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Yeah, shelving before rebooting does work as the unshelve will look at the allocations and then providing back mdevs for the instance.
But as saif by Lars, the problem here is that the nova-compute service can't be restarted after rebooting if you don't recreate the mdevs or removing the guests.

Revision history for this message
melanie witt (melwitt) wrote :

Hm, OK, I was thinking about shelve offloading after an unexpected reboot. AFAICT the placement allocations should remain after a reboot, so if you shelve offload, it will remove the placement allocations and then I was thinking the unshelve would recreate the placement allocations and mdevs when recreating the virt domain.

I take it that this is not actually what happens in practice?

Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

Correct Melanie. Shelving after an unexpected reboot won't work, because it will require nova-compute to actually start to succeed.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

I had other stuff to do so I stopped to work on this bugfix, but I'm now working on using init_instance() as we discussed here https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2021-05-05.log.html

Revision history for this message
norman shen (jshen28) wrote :

IMO, I think this issue deserves higher importance...

Revision history for this message
James Gibson (jamesgibo) wrote :

I also think this should have a higher importance, especially as the bug means that the nova-compute service fails to start.

Revision history for this message
James Gibson (jamesgibo) wrote :

Related bug report about: Persisting mediated devices on reboot
https://bugzilla.redhat.com/show_bug.cgi?id=1699274

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/810220

Revision history for this message
norman shen (jshen28) wrote :

1. libvirt 7.3 allows to define nodedev device
2. libvirt 6.5 starts to use mdevctl for handling mdev device
3. libvirt 7.6 adds support to autostart nodedev device but requires udev event which might need some work for containerized deployment

Revision history for this message
zrsolis (zrsolis) wrote :

I've run in to this same issue using an Nvidia RTX A5000, however the Ampere cards present an additional issue. The SR-IOV VFs that the MDEVs use are not persisted after reboot either. I've worked around this by having a system service run before nova-compute to run the sriov-manage command provided by Nvidia to enable the VFs. I used "mdevctl define" to persist the mdev definitions and this has worked for me.

Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

@zrsolis: We had to work aorund the SR-IOV issue as well, with the A100. Did it even easier with just a simple:

@reboot /usr/lib/nvidia/sriov-manage -e ALL

in root's crontab.

Works like a charm. I've actually been in contact with Nvidia Enterprise Support about persisting VFs over a reboot, and this was the only solution they came up with (which we actually implemented before we asked them anyways..)

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

But sriov-manage -e ALL doesn't re-create mdev devices with same UUIDs - it just activate virtual functions?

At least I don't have mdev devices after sriov-manage, so need to re-create them on A10. Maybe with MIG mode this does work though...

Revision history for this message
sean mooney (sean-k-mooney) wrote :

so the expecation here is that nova will creat the mdevs automatically on host reboot
there is justa bug in our currnt code.
you as an operator should not have to use mdevctl or any other host tooling to persist them.

we also dont support the use of mdevctl to precreat the mdevs today
if we want to move to that model we need to signifcatly change how we manage mdevs today.

Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

@noonedeadpunk: sriov-manage does not create mdevs at all. It creates the SR-IOV VFs that you later can create mdevs on. So it's a prerequisite that imho nova should just assume is in place. I once tried to as Nvidia Enterprise Support for "fix" for this, but all they came up with was exactly the same @reboot cronjob-hack I already had in place.

MIG mode is still something nova does not support, I guess?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/851924

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :

"MIG mode is still something nova does not support, I guess?"

no nova support mig mode but we expect that you use udev or systemd services or other tolling to precreate the VFs sriov-manage or other tooling liek nvidia's mig-parted tool.

we have tested using mig backed mdevs downstream.

nova will create teh mdevs on teh VFs but you need to list the VFs not the PF in nova's config

Revision history for this message
John Garbutt (johngarbutt) wrote :

it would be nice if the error message here described the details of what is missing, and saying it should be a udev rule, etc.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/864418

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.