nova boot GPU instance will attach more one GPU pci device when reschedule happened

Bug #1901170 reported by guolei
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
guolei

Bug Description

Description
===========
When we boot a GPU instance, on nova-compute's instance_claim
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/resource_tracker.py#L197-L217
input instance object's attribute 'pci_devices' had update from [] to [PciDevice], it include a calculated GPU PCI device object.

Ok, now we pay attention to claim's code flow:
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L64
it cloned input instance object, set to self.instance
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L78-L84
abort func will abort instance's claim with self.instance, it a cloned one, not the origin input instance object.

Now, we can see, if spawn instance failed, claim.abort will be called, it revert the cloned instance object's
 'pci_devices' attribute to [], and pci_device in db had reverted from allocate to free too. but the origin input instance object not, origin instance object's 'pci_devices' is still [PciDevice], and it will send to nova-conductor to do reschedule, and on next node, after claim, instance.pci_devices will be [PciDevice, PciDevice]

And then, spawn instance will have two GPU pci device, or raise a LibvirtError, "Device xxx is in used"

Steps to reproduce
==================
1. build libvirt error on all compute nodes
2. nova boot a GPU instance
3. show guest xml in nova-compute.log

Expected result
===============
on reschedule node, guest xml had just one GPU pci device

Actual result
=============
on reschedule node, guest xml had more then one GPU pci device

guolei (guolei-5)
Changed in nova:
assignee: nobody → guolei (guolei-5)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/759544

Changed in nova:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.