OpenStack Compute (nova)

VGPUs is not recreated on host reboot

Bug #1900800 reported by Lars Erik Pedersen on 2020-10-21

This bug affects 14 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	In Progress	Low	Sylvain Bauza

Bug Description

Description
===========
In Ussuri, when a compute node providing vGPUs (Nvidia GRID in my case) is rebooted, the mdevs for VGPUs is not recreated, and a traceback from libvirt.libvirtError is thrown.

https://paste.ubuntu.com/p/4t4NvTHGd8/

As far as I understand, this should have been fixed in https://review.opendev.org/#/c/715489/ but it seems like it fails even before it tries to recreate the mdev.

Expected result
===============
Upon host reboot, the mdevs should be recreated and the VMs should be restarted.

Actual result
=============
nova-compute throws the aforementioned error, the mdevs are not re-created and the VMs is left in an unrecoverable state.

Environment
===========
# dnf list installed | grep nova
openstack-nova-common.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
openstack-nova-compute.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
python3-nova.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
python3-novaclient.noarch 1:17.0.0-1.el8 @centos-openstack-ussuri

# dnf list installed | grep libvirt
libvirt-bash-completion.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-client.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-config-nwfilter.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-interface.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-network.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-nodedev.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-nwfilter.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-qemu.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-secret.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-core.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-disk.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-gluster.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-iscsi.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-iscsi-direct.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-logical.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-mpath.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-rbd.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-scsi.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-kvm.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-libs.x86_64 6.0.0-25.2.el8 @advanced-virtualization
python3-libvirt.x86_64 6.0.0-1.el8 @advanced-virtualization

Tags:

Sylvain Bauza (sylvain-bauza) on 2020-10-21

tags:

added: libvirt vgpu

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-10-21:

Thanks for reporting !

Oh shit, you're right, we can't lookup the existing mdev info [1] to know its parent PCI device since the mdev disappeared after rebooting...

[1] https://github.com/openstack/nova/blob/450213f/nova/virt/libvirt/driver.py#L816

So, honestly, there are no ways to know the parent PCI device since we don't persist mdevs and to be honest, that's not something Nova should do since it's a KVM/kernel issue.

Leaving the bug open until we figure out a good way to either document or fix this.

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low
assignee:	nobody → Sylvain Bauza (sylvain-bauza)

Revision history for this message

Eigil Obrestad (obrestad) wrote on 2020-10-21:

I guess we in that case should simply free all GPU-resources registered in placement for a certain compute-node when the compute-node boots? And give the VM's new GPU's when they boot?

I guess it is not an issue that they do not get the exact same GPU after a reboot; as the GPU looses state at reboot anyway?

Revision history for this message

Mohammed Naser (mnaser) wrote on 2021-04-15:

Sylvain, Sean and I had a discussion about this and some of the possible solutions (long term and short term):

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2021-04-15.log.html#t2021-04-15T13:12:29

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2021-04-15:

A possible workaround for the trivial case where all GPUs are set with a unique type could be :
- stop the compute service
- lookup all the existing guest domains and fetch each of the mdev UUIDs
- for each UUID, ask sysfs to recreate the mdev by firing this command :

echo $UUID > /sys/bus/pci/devices/$PCI_ADDRR/mdev_supported_types/$THE_RIGHT_TYPE/create

where $UUID is the mdev UUID, $PCI_ADDR the parent GPU and $THE_RIGHT_TYPE is the mdev type this pGPU was using.

Hope that helps, I'll amend the upstream docs with this limitation and I'll work on a patch for looking up the allocations to get the Resource Provider that matches, so I'll know the physical GPU and then be able to lookup the config option for finding the type.

Revision history for this message

melanie witt (melwitt) wrote on 2021-04-15:

Sylvain, I had been thinking a possible workaround might be shelve offloading the server and then unshelving it. Would that work or no?

Revision history for this message

Lars Erik Pedersen (pedersen-larserik) wrote on 2021-04-16:

Melanie: That is exactly what we're doing to workaround this in our environment. Work nicely for planned reboots. But of course, in the event of an unexpected reboot, we're still faced with a major manual cleanup job.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2021-04-16:

Yeah, shelving before rebooting does work as the unshelve will look at the allocations and then providing back mdevs for the instance.
But as saif by Lars, the problem here is that the nova-compute service can't be restarted after rebooting if you don't recreate the mdevs or removing the guests.

Revision history for this message

melanie witt (melwitt) wrote on 2021-04-17:

Hm, OK, I was thinking about shelve offloading after an unexpected reboot. AFAICT the placement allocations should remain after a reboot, so if you shelve offload, it will remove the placement allocations and then I was thinking the unshelve would recreate the placement allocations and mdevs when recreating the virt domain.

I take it that this is not actually what happens in practice?

Revision history for this message

Lars Erik Pedersen (pedersen-larserik) wrote on 2021-04-19:

Correct Melanie. Shelving after an unexpected reboot won't work, because it will require nova-compute to actually start to succeed.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2021-08-04:

#10

I had other stuff to do so I stopped to work on this bugfix, but I'm now working on using init_instance() as we discussed here https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2021-05-05.log.html

Revision history for this message

norman shen (jshen28) wrote on 2021-09-06:

#11

IMO, I think this issue deserves higher importance...

Revision history for this message

James Gibson (jamesgibo) wrote on 2021-09-10:

#12

I also think this should have a higher importance, especially as the bug means that the nova-compute service fails to start.

Revision history for this message

James Gibson (jamesgibo) wrote on 2021-09-13:

#13

Related bug report about: Persisting mediated devices on reboot
https://bugzilla.redhat.com/show_bug.cgi?id=1699274

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-21: Fix proposed to nova (master)

#14

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/810220

Revision history for this message

norman shen (jshen28) wrote on 2021-09-22:

#15

1. libvirt 7.3 allows to define nodedev device
2. libvirt 6.5 starts to use mdevctl for handling mdev device
3. libvirt 7.6 adds support to autostart nodedev device but requires udev event which might need some work for containerized deployment

Revision history for this message

zrsolis (zrsolis) wrote on 2021-09-30:

#16

I've run in to this same issue using an Nvidia RTX A5000, however the Ampere cards present an additional issue. The SR-IOV VFs that the MDEVs use are not persisted after reboot either. I've worked around this by having a system service run before nova-compute to run the sriov-manage command provided by Nvidia to enable the VFs. I used "mdevctl define" to persist the mdev definitions and this has worked for me.

Revision history for this message

Lars Erik Pedersen (pedersen-larserik) wrote on 2021-10-01:

#17

@zrsolis: We had to work aorund the SR-IOV issue as well, with the A100. Did it even easier with just a simple:

@reboot /usr/lib/nvidia/sriov-manage -e ALL

in root's crontab.

Works like a charm. I've actually been in contact with Nvidia Enterprise Support about persisting VFs over a reboot, and this was the only solution they came up with (which we actually implemented before we asked them anyways..)

Revision history for this message

Dmitriy Rabotyagov (noonedeadpunk) wrote on 2022-03-10:

#19

But sriov-manage -e ALL doesn't re-create mdev devices with same UUIDs - it just activate virtual functions?

At least I don't have mdev devices after sriov-manage, so need to re-create them on A10. Maybe with MIG mode this does work though...

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2022-03-10:

#20

so the expecation here is that nova will creat the mdevs automatically on host reboot
there is justa bug in our currnt code.
you as an operator should not have to use mdevctl or any other host tooling to persist them.

we also dont support the use of mdevctl to precreat the mdevs today
if we want to move to that model we need to signifcatly change how we manage mdevs today.

Revision history for this message

Lars Erik Pedersen (pedersen-larserik) wrote on 2022-03-10:

#21

@noonedeadpunk: sriov-manage does not create mdevs at all. It creates the SR-IOV VFs that you later can create mdevs on. So it's a prerequisite that imho nova should just assume is in place. I once tried to as Nvidia Enterprise Support for "fix" for this, but all they came up with was exactly the same @reboot cronjob-hack I already had in place.

MIG mode is still something nova does not support, I guess?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-02:

#22

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/851924

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2022-08-10:

#23

"MIG mode is still something nova does not support, I guess?"

no nova support mig mode but we expect that you use udev or systemd services or other tolling to precreate the VFs sriov-manage or other tooling liek nvidia's mig-parted tool.

we have tested using mig backed mdevs downstream.

nova will create teh mdevs on teh VFs but you need to list the VFs not the PF in nova's config

Revision history for this message

John Garbutt (johngarbutt) wrote on 2022-10-21:

#24

it would be nice if the error message here described the details of what is missing, and saying it should be a udev rule, etc.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-11-14:

#25

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/864418