The allocation of VGPU has race problem

Bug #1836204 reported by Alex Xu on 2019-07-11
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Alex Xu

Bug Description

The vgpu is allocated by this method https://github.com/openstack/nova/blob/8260979b71b29ce2666d37b3adc7c256482aa16d/nova/virt/libvirt/driver.py#L3235

That method list the assigned mdev by listing the libvirt domain.

But if there are two concurrent request come to this method. They will see the set of assigned mdev. So they may get same free mdev also.

So there are a race window between:
https://github.com/openstack/nova/blob/8260979b71b29ce2666d37b3adc7c256482aa16d/nova/virt/libvirt/driver.py#L3235

and

We create the domain in the libvirt
https://github.com/openstack/nova/blob/8260979b71b29ce2666d37b3adc7c256482aa16d/nova/virt/libvirt/driver.py#L3241

Alex Xu (xuhj) on 2019-07-11
Changed in nova:
assignee: nobody → Alex Xu (xuhj)
Eric Fried (efried) on 2019-07-11
Changed in nova:
status: New → Triaged
importance: Undecided → High
Eric Fried (efried) wrote :

This is of high importance not because the race is particularly likely in current code, but we need to establish the framework to fix it so we can reuse that framework for other similar types of hardware.

In general, the fix is to claim (earmark for use by a specific instance) specific hardware artifacts [1] on the compute node in instance_claim, which is under COMPUTE_RESOURCE_SEMAPHORE. But only the virt driver can know what needs to be done to effect that claim for its specific hypervisor. And today instance_claim doesn't talk to the virt driver at all.

So the solution discussed in IRC [2] is to establish a new ComputeDriver interface, working title claim_for_instance() (and possibly a corresponding unclaim_for_instance() for rollbacks), which will be invoked from instance_claim (and _move_claim).

Using VGPUs-in-libvirt as an example, claim_for_instance would use an in-memory dict to associate a specific mdev with the specific instance for each VGPU in the allocation. This mapping could then be deleted during spawn, since the information can subsequently be gleaned from the domain XML.

[1] where "hardware" encompasses things like VFs - don't get pedantic on me
[2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-07-11.log.html#t2019-07-11T12:39:18

Related fix proposed to branch: master
Review: https://review.opendev.org/670783

Related fix proposed to branch: master
Review: https://review.opendev.org/670784

Related fix proposed to branch: master
Review: https://review.opendev.org/670785

Related fix proposed to branch: master
Review: https://review.opendev.org/670786

melanie witt (melwitt) on 2019-07-15
tags: added: libvirt

Related fix proposed to branch: master
Review: https://review.opendev.org/671222

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670786

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670785

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670782

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670783

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670784

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/671388

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670787

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/671222

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers