Upgrade from X/O -> B/Q breaks pci_devices in mysql for SR-IOV

Bug #1883929 reported by Diko Parvanov
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned
OpenStack Nova Compute Charm
Expired
Undecided
Unassigned

Bug Description

After upgrade from xenial/ocata to bionic/queens SR-IOV instance creation (--vnic-type direct) fails with missing devices:

The pci_devices mysql table if filled with wrong PCI entries, that do not exist on the server. Restarting nova-compute and nova-cloud-controller services did not fix (rediscover) the proper PCI devices.

Related errors:

2020-06-17 12:55:19.556 1182599 WARNING nova.pci.utils [req-76b21329-b364-4999-ac86-8c729cb91ac0 - - - - -] No net device was found for VF 0000:d8:05.0: PciDeviceNotFoundById: PCI device 0000:d8:05.0 not found
2020-06-17 12:55:19.603 1182599 WARNING nova.pci.utils [req-76b21329-b364-4999-ac86-8c729cb91ac0 - - - - -] No net device was found for VF 0000:d8:05.1: PciDeviceNotFoundById: PCI device 0000:d8:05.1 not found
2020-06-17 12:55:19.711 1182599 WARNING nova.pci.utils [req-76b21329-b364-4999-ac86-8c729cb91ac0 - - - - -] No net device was found for VF 0000:d8:04.7: PciDeviceNotFoundById: PCI device 0000:d8:04.7 not found

Error instance creation:

{u'message': u'Device 0000:d8:04.4 not found: could not access /sys/bus/pci/devices/0000:d8:04.4/config: No such file or directory', u'code': 500, u'details': u'Traceback (most recent call last):\n File "/usr/lib/python2.7/dist-packa
ges/nova/compute/manager.py", line 1863, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2143, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\
nRescheduledException: Build of instance ec163abf-9c7a-460a-9512-4915f47af6b9 was re-scheduled: Device 0000:d8:04.4 not found: could not access /sys/bus/pci/devices/0000:d8:04.4/config: No such file or directory\n', u'created': u'2020-06-17T11:46:11Z'

summary: - Upgrade from X/O -> B/Q brakes pci_devices in mysql for SR-IOV
+ Upgrade from X/O -> B/Q breaks pci_devices in mysql for SR-IOV
tags: added: pci resource-tracker upgrade
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

"The pci_devices mysql table if filled with wrong PCI entries, that do not exist on the server."
Are those devices existed before the upgrade?

Revision history for this message
Diko Parvanov (dparv) wrote :

Hardware hasn't been changed, however firmware was upgraded. It is possible naming/mapping changed during the firmware upgrade. They must have been there as the machines were running workloads with SR-IOV ports attached.

Revision history for this message
James Page (james-page) wrote :

How is the whitelist for PCI devices configured? if all of the PCI device naming changed as part of the OS upgrade (or maybe the firmware upgrade) do you also need to update the whitelist configuration for the charm?

Revision history for this message
James Page (james-page) wrote :

I suspect that as the pci_devices are allocated that they can't then be cleared up when they disappear - as they are associated with other records in the DB.

Revision history for this message
James Page (james-page) wrote :

Updating the pci device whitelist configuration does result in any devices which are no longer whitelisted being marked as deleted (see the deleted_at column).

Revision history for this message
James Page (james-page) wrote :

Marking 'Incomplete' for now until further information is provided - the following would be nice:

 logs from nova-* charm units
 lspci from compute nodes
 table dumps of instances and pci_devices from the nova database

Thanks!

Changed in charm-nova-compute:
status: New → Incomplete
Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack nova-compute charm because there has been no activity for 60 days.]

Changed in charm-nova-compute:
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.