nova boot GPU instance will attach more one GPU pci device when reschedule happened

Bug #1901170 reported by guolei on 2020-10-23
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)

Bug Description

When we boot a GPU instance, on nova-compute's instance_claim
input instance object's attribute 'pci_devices' had update from [] to [PciDevice], it include a calculated GPU PCI device object.

Ok, now we pay attention to claim's code flow:
it cloned input instance object, set to self.instance
abort func will abort instance's claim with self.instance, it a cloned one, not the origin input instance object.

Now, we can see, if spawn instance failed, claim.abort will be called, it revert the cloned instance object's
 'pci_devices' attribute to [], and pci_device in db had reverted from allocate to free too. but the origin input instance object not, origin instance object's 'pci_devices' is still [PciDevice], and it will send to nova-conductor to do reschedule, and on next node, after claim, instance.pci_devices will be [PciDevice, PciDevice]

And then, spawn instance will have two GPU pci device, or raise a LibvirtError, "Device xxx is in used"

Steps to reproduce
1. build libvirt error on all compute nodes
2. nova boot a GPU instance
3. show guest xml in nova-compute.log

Expected result
on reschedule node, guest xml had just one GPU pci device

Actual result
on reschedule node, guest xml had more then one GPU pci device

guolei (guolei-5) on 2020-10-23
Changed in nova:
assignee: nobody → guolei (guolei-5)

Fix proposed to branch: master

Changed in nova:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers