pci device lost when error in the configuration file

Bug #1809040 reported by Marc Gariépy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

pci passthrough is lost when you restart nova-compute with the wrong configuration.

I have this issue on Queens release.

====
step to reproduce:
====
1- configurare passthrough for a pci device
2- start a vm with the pci device
3- change the config /etc/nova/nova.conf
[pci]
-passthrough_whitelist = "[{"vendor_id": "8086", "product_id": "1572"}]"
+pci_passthrough_whitelist = "[{"vendor_id": "8086", "product_id": "1572"}]"

4- restart nova-compute
5- nova.pci_devices , device gets deleted in the DB.
6- hard reboot the vm, the pci devices are not in the libvirt conf anymore.
7- fix the config in nova.conf, and restart. new devices are created, but cannot be assigned to the running vm.

I had a small issue during an upgrade, (made a typo for the nova.conf file, [Default] > pci_passthrough_whitelist TO [pci] > pci_passthrough_whitelist. and i lost all the pci device all over the computes.

Tags: pci
Revision history for this message
Matt Riedemann (mriedem) wrote :

Are you able to workaround the issue with the lost PCI devices assigned to the one VM that was rebooted by cold migrating it? I'm not sure why the reboot wouldn't have fixed it, except I think the PCI devices are more or less "assigned" to the instance when a resource "claim" is made, which would be during server create or a move operation like cold migrate (resize).

tags: added: pci
Revision history for this message
Marc Gariépy (mgariepy) wrote :

seems to be a duplicate of this one: https://bugs.launchpad.net/nova/+bug/1653810

Revision history for this message
Matt Riedemann (mriedem) wrote :

Marc was able to workaround this by shelving and then unshelving the instance (that worked because the instance had nowhere else to go but the same host, otherwise he would have cold migrated it).

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm going to mark this as a duplicate of bug 1653810 since yes it is an edge case. We could maybe document some kind of troubleshooting tip for it in the docs though?

https://docs.openstack.org/nova/latest/admin/support-compute.html

Long-term we want to model pci devices with placement so that we don't have the brittle config-driven inventory code in compute and the pci tracker/manager, so presumably that would help with this.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Also, that other bug looks like a duplicate of bug 1633120.

In thinking about this more, why do we allow deleting allocated pci devices from the DB and then re-creating them with new records that are now marked 'available' when clearly the physical device is not available? Obviously there is a referential constraint check missing somewhere in that code. Seems we could easily make that a failure by simply checking if a given pci device is assigned to an instance when nova-compute tries to delete it, regardless of the whitelist.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.