Bug #1887380 “Attaching virtual GPU devices to guests in nova” : Bugs : OpenStack Compute (nova)

Revision history for this message

Artom Lifshitz (notartom) wrote on 2020-07-16:

#1

Hi, thanks for the bug report.

To better understand what's going on, we need additional details:

1. Your nova.conf file (specifically your [devices] section, as well as each specific device section)

2. Hardware details (which GPU model you're using)

3. nova-compute and nova-scheduler logs

4. The flavor(s) the instances were created with.

I've set this bug as incomplete for now, please set it back to NEW when you reply to make sure it gets looked at.

Thanks!

Changed in nova:
status:	New → Incomplete

Revision history for this message

ryan (ryanzh) wrote on 2020-07-18:

#2

Download full text (12.1 KiB)

Hi, here's the detailed information. In our test, we used Tesla V100 and Tesla T4 as our GPU hardware resources:
1.
[devices]
#
# A list of the vGPU types enabled in the compute node.
#
# Some pGPUs (e.g. NVIDIA GRID K1) support different vGPU types. User can use
# this option to specify a list of enabled vGPU types that may be assigned to a
# guest instance. But please note that Nova only supports a single type in the
# Queens release. If more than one vGPU type is specified (as a comma-separated
# list), only the first one will be used. An example is as the following:
# [devices]
# enabled_vgpu_types = GRID K100,Intel GVT-g,MxGPU.2,nvidia-11
# (list value)
enabled_vgpu_types = nvidia-317
# enabled_vgpu_types = nvidia-320, for Telsa T4

2. Tesla V100 * 8 / Tesla T4 * 8 per node

3. In nova-compute.log:
Here's the typical output when failed to allocate the vGPU resource to an instance.

2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] Total vcpu: 92 VCPU, used: 88.00 VCPU
2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] vcpu limit not specified, defaulting to unlimited
2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] vcpu limit not specified, defaulting to unlimited
2020-07-18 16:03:29.072 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] Claim successful on node xxx

: libvirt.libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/24ccc9fc-1fd5-446b-b193-474d8b875a15 is in use by driver QEMU, domain instance-0000a567
Error: Requested operation is not valid: mediated device /sys/bus/mdev/devices/24ccc9fc-1fd5-446b-b193-474d8b875a15 is in use by driver QEMU, domain instance-0000a567
2020-07-18 16:03:29.388 1 ERROR nova.virt.libvirt.guest [req-56336f58-9ab8-47cb-8690-6d6429e3360c ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] Error launching a defined domain with XML:
...
  <uuid>d8e207c2-44fc-49d9-83c4-5e1386c709ff</uuid>
   ...
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='24ccc9fc-1fd5-446b-b193-474d8b875a15'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] Traceback (most recent call last):
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] File "/openstack/nova/nova/compute/manager.py", line 22...

Hi, here's the detailed information. In our test, we used Tesla V100 and Tesla T4 as our GPU hardware resources:
1. 
[devices]
#
# A list of the vGPU types enabled in the compute node.
#
# Some pGPUs (e.g. NVIDIA GRID K1) support different vGPU types. User can use
# this option to specify a list of enabled vGPU types that may be assigned to a
# guest instance. But please note that Nova only supports a single type in the
# Queens release. If more than one vGPU type is specified (as a comma-separated
# list), only the first one will be used. An example is as the following:
#     [devices]
#     enabled_vgpu_types = GRID K100,Intel GVT-g,MxGPU.2,nvidia-11
#  (list value)
enabled_vgpu_types = nvidia-317
# enabled_vgpu_types = nvidia-320, for Telsa T4

2. Tesla V100 * 8 / Tesla T4 * 8 per node

3. In nova-compute.log:
Here's the typical output when failed to allocate the vGPU resource to an instance.

2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] Total vcpu: 92 VCPU, used: 88.00 VCPU
2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] vcpu limit not specified, defaulting to unlimited
2020-07-18 16:03:29.071 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] vcpu limit not specified, defaulting to unlimited
2020-07-18 16:03:29.072 1 INFO nova.compute.claims [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: 000ef195-f46b-4ac3-b0c1-a684a95ee8f3] Claim successful on node xxx

: libvirt.libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/24ccc9fc-1fd5-446b-b193-474d8b875a15 is in use by driver QEMU, domain instance-0000a567
Error: Requested operation is not valid: mediated device /sys/bus/mdev/devices/24ccc9fc-1fd5-446b-b193-474d8b875a15 is in use by driver QEMU, domain instance-0000a567
2020-07-18 16:03:29.388 1 ERROR nova.virt.libvirt.guest [req-56336f58-9ab8-47cb-8690-6d6429e3360c ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] Error launching a defined domain with XML:
...
  <uuid>d8e207c2-44fc-49d9-83c4-5e1386c709ff</uuid>
   ...
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='24ccc9fc-1fd5-446b-b193-474d8b875a15'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] Traceback (most recent call last):
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/compute/manager.py", line 2238, in _build_resources
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     yield resources
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/compute/manager.py", line 2018, in _build_and_run_instance
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     block_device_info=block_device_info)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/virt/libvirt/driver.py", line 3257, in spawn
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     destroy_disks_on_failure=True)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/virt/libvirt/driver.py", line 5843, in _create_domain_and_network
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     destroy_disks_on_failure)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     self.force_reraise()
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     six.reraise(self.type_, self.value, self.tb)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     raise value
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/virt/libvirt/driver.py", line 5812, in _create_domain_and_network
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     post_xml_callback=post_xml_callback)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/virt/libvirt/driver.py", line 5747, in _create_domain
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     guest.launch(pause=pause)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/virt/libvirt/guest.py", line 150, in launch
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     self._encoded_xml, errors='ignore')
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     self.force_reraise()
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     six.reraise(self.type_, self.value, self.tb)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     raise value
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/openstack/nova/nova/virt/libvirt/guest.py", line 145, in launch
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     return self._domain.createWithFlags(flags)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/eventlet/tpool.py", line 186, in doit
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     rv = execute(f, *args, **kwargs)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/eventlet/tpool.py", line 125, in execute
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     six.reraise(c, e, tb)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/eventlet/support/six.py", line 625, in reraise
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     raise value
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/eventlet/tpool.py", line 83, in tworker
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     rv = meth(*args, **kwargs)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]   File "/usr/local/lib/python3.5/dist-packages/libvirt.py", line 1110, in createWithFlags
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] libvirt.libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/24ccc9fc-1fd5-446b-b193-474d8b875a15 is in use by driver QEMU, domain instance-0000a567
2020-07-18 16:03:29.541 1 ERROR nova.compute.manager [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff]
2020-07-18 16:03:29.541 1 INFO nova.compute.manager [req-9425ed50-e7d3-4060-ad13-37274c03ec3f ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] Terminating instance
2020-07-18 16:03:29.545 1 INFO nova.virt.libvirt.driver [-] [instance: d8e207c2-44fc-49d9-83c4-5e1386c709ff] Instance destroyed successfully.

And the corresponding nova-scheduler.log:
2020-07-18 16:03:27.974 12 INFO nova.scheduler.host_manager [req-1138997e-23cb-4b26-a325-143251cc2b90 ba48dadec84c4c55bac4f65c4a9e626a cf859eb03c8840fe9406de3e4703840f - default default] Host filter forcing available hosts to xxx

Update:
During our lastest test, 14 out of 30 instances fell into error-creating state due to the vGPU misallocation. It seems batch-creating with vGPU resource comes with more instability than we've already known. Looking forward to further discussion.

Changed in nova:
status:	Incomplete → New

Balazs Gibizer (balazs-gibizer) on 2020-08-27

tags:

added: compute libvirt vgpu

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-08-27:

#3

Looks like something messed up when trying to look at which mdevs are available.

So, you have two diffrerent GPU profiles, T4 and V100. For Queens, Nova only supports one vGPU type so you need to find a vGPU type that is supported by both of them.

So, when you say that you use either
enabled_vgpu_types = nvidia-317
# enabled_vgpu_types = nvidia-320, for Telsa T4

Does that mean that it changes between compute nodes ?

Changed in nova:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-10-27:

#4

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

OpenStack Compute (nova)

Attaching virtual GPU devices to guests in nova

Bug Description

Other bug subscribers

Remote bug watches