Attaching virtual GPU devices to guests in nova

Bug #1887380 reported by ryan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

This bug tracker is for errors with the documentation, use the following as a template and remove or add fields as you see fit. Convert [ ] into [x] to check boxes:

- [X] This is a doc addition request.

Hi, a problem came up when we are using nova(Queens) configured with the vGPU feature to create several instances. It seems multiple instances preempt the same vGPU resource, in our case, on the exact same instance which has acquired a vGPU already. Here is the error reported in the log:

"libvirt.libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/xxx is in use by driver QEMU, domain xxx"

Apparently, nova is trying to allocate the vGPU resource that is already being used by another instance. Also, we ruled out a situation that there is not enough vGPU resources on the host. In our case, 25% of instances fell into error-creating state while we are only creating instances which only need 50% of all vGPU resources. From our perspective, the problem is with the nova-scheduler. Any idea how to work this out?

Thanks

Ruien Zhang
<email address hidden>

-----------------------------------
Release: 21.1.0.dev214 on 2020-04-28 20:09:00
SHA: d19f1ac47b0a5fe1dd80b7187087e5810501f16c
Source: https://opendev.org/openstack/nova/src/doc/source/admin/virtual-gpu.rst
URL: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html

Revision history for this message
Artom Lifshitz (notartom) wrote :

Hi, thanks for the bug report.

To better understand what's going on, we need additional details:

1. Your nova.conf file (specifically your [devices] section, as well as each specific device section)

2. Hardware details (which GPU model you're using)

3. nova-compute and nova-scheduler logs

4. The flavor(s) the instances were created with.

I've set this bug as incomplete for now, please set it back to NEW when you reply to make sure it gets looked at.

Thanks!

Changed in nova:
status: New → Incomplete
Revision history for this message
ryan (ryanzh) wrote :
Download full text (12.1 KiB)

Hi, here's the detailed information. In our test, we used Tesla V100 and Tesla T4 as our GPU hardware resources:
1.
[devices]
#
# A list of the vGPU types enabled in the compute node.
#
# Some pGPUs (e.g. NVIDIA GRID K1) support different vGPU types. User can use
# this option to specify a list of enabled vGPU types that may be assigned to a
# guest instance. But please note that Nova only supports a single type in the
# Queens release. If more than one vGPU type is specified (as a comma-separated
# list), only the first one will be used. An example is as the following:
# [devices]
# enabled_vgpu_types = GRID K100,Intel GVT-g,MxGPU.2,nvidia-11
# (list value)
enabled_vgpu_types = nvidia-317
# enabled_vgpu_types = nvidia-320, for Telsa T4

2. Tesla V100 * 8 / Tesla T4 * 8 per node

3. In nova-compute.log:
Here's the typical output when failed to allocate the vGPU resource to an instance.

2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] Total vcpu: 92 VCPU, used: 88.00 VCPU
2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] vcpu limit not specified, defaulting to unlimited
2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] vcpu limit not specified, defaulting to unlimited
2020-07-18 16:03:29.072 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] Claim successful on node xxx

: libvirt.libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/24ccc9fc-1fd5-446b-b193-474d8b875a15 is in use by driver QEMU, domain instance-0000a567
Error: Requested operation is not valid: mediated device /sys/bus/mdev/devices/24ccc9fc-1fd5-446b-b193-474d8b875a15 is in use by driver QEMU, domain instance-0000a567
2020-07-18 16:03:29.388 1 ERROR nova.virt.libvirt.guest [req-56336f58-9ab8-47cb-8690-6d6429e3360c ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] Error launching a defined domain with XML:
...
  <uuid>d8e207c2-44fc-49d9-83c4-5e1386c709ff</uuid>
   ...
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='24ccc9fc-1fd5-446b-b193-474d8b875a15'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] Traceback (most recent call last):
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] File "/openstack/nova/nova/compute/manager.py", line 22...

Changed in nova:
status: Incomplete → New
tags: added: compute libvirt vgpu
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Looks like something messed up when trying to look at which mdevs are available.

So, you have two diffrerent GPU profiles, T4 and V100. For Queens, Nova only supports one vGPU type so you need to find a vGPU type that is supported by both of them.

So, when you say that you use either
enabled_vgpu_types = nvidia-317
# enabled_vgpu_types = nvidia-320, for Telsa T4

Does that mean that it changes between compute nodes ?

Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.