Nova fails creating multiple NVIDIA VGPU instances at the same time

Bug #1797269 reported by parser
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Hi everyone, I'm using Openstack Queen with Kolla Ansible.

https://docs.openstack.org/kolla-ansible/latest/

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c08ab2bc6e8e registry.vscaler.com:5000/kolla/centos-binary-neutron-openvswitch-agent:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks neutron_openvswitch_agent
65c7efca89c1 registry.vscaler.com:5000/kolla/centos-binary-openvswitch-vswitchd:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks openvswitch_vswitchd
982fdc925784 registry.vscaler.com:5000/kolla/centos-binary-openvswitch-db-server:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks openvswitch_db
05c2908fcff9 registry.vscaler.com:5000/kolla/centos-binary-nova-compute:6.0.1 "kolla_start" 11 weeks ago Up 3 weeks nova_compute
b22fcbc6b48f registry.vscaler.com:5000/kolla/centos-binary-nova-libvirt:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks nova_libvirt
754fda56368e registry.vscaler.com:5000/kolla/centos-binary-nova-ssh:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks nova_ssh
72ca828b94e8 registry.vscaler.com:5000/kolla/centos-binary-cron:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks cron
d86fd4806efe registry.vscaler.com:5000/kolla/centos-binary-kolla-toolbox:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks kolla_toolbox
d0d7bf199bf1 registry.vscaler.com:5000/kolla/centos-binary-fluentd:6.0.1 "kolla_start" 11 weeks ago Up 5 weeks fluentd

I have 4 NVIDIA Corporation GP100GL Card configured with Nvidia drivers NVIDIA-GRID-RHEL-7.5-390.72-390.75-391.81.

The nova.conf is configure with:

[devices]
enabled_vgpu_types = nvidia-96

So, I have 16 VGPU to use individually with each instance. I have no problem using the Horizon interface when I want to create the instances one at a time, the problem appears when I select 5 or 6 instances to create at the same time.

The instance properties are: 8GB RAM, 4 VCPUS and 1 VGPU, without VOLUME.

2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2031, in _build_and_run_instance
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] block_device_info=block_device_info)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3089, in spawn2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] destroy_disks_on_failure=True)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5614, in _create_domain_and_network
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] destroy_disks_on_failure)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] self.force_reraise()
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] six.reraise(self.type_, self.value, self.tb)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5583, in _create_domain_and_network
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] post_xml_callback=post_xml_callback)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5502, in _create_domain
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] guest.launch(pause=pause)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 144, in launch
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] self._encoded_xml, errors='ignore')
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] self.force_reraise()
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] six.reraise(self.type_, self.value, self.tb)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 139, in launch
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] return self._domain.createWithFlags(flags)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] result = proxy_call(self._autowrap, f, *args, **kwargs)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] rv = execute(f, *args, **kwargs)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] six.reraise(c, e, tb)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] rv = meth(*args, **kwargs)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8] libvirtError: Requested operation is not valid: mediated device /sys/bus/mdev/devices/8d219724-ba8c-11e8-8c75-0cc47ad9af5c is in use by driver QEMU, domain instance-00001396
2018-10-09 16:53:41.815 7 ERROR nova.compute.manager [instance: bf4b679f-50ec-4ade-a530-a822486999a8]
2018-10-09 16:53:41.818 7 INFO nova.compute.manager [req-15e5d59f-4a08-46e2-b748-94a4fe5daf2f 7e942918147f479e8c07877240de5d16 c5957eec2ab84cd5b45b9324ed9a3b29 - default default] [instance: bf4b679f-50ec-4ade-a530-a822486999a8] Terminating instance

I think is a race condition problem, because the VGPU is assigned to one instance, after that, Nova wants to assign the same VGPU to another instance but it was previously assigned.

Thanks!

Tags: libvirt
parser (parser)
summary: - Nova fail creating multiple NVIDIA VGPU instances at the same time
+ Nova fails creating multiple NVIDIA VGPU instances at the same time
description: updated
tags: added: libvirt
Revision history for this message
Mariusz Karpiarz (mkarpiarz) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.