_get_pci_passthrough_devices prone to race condition

Bug #1972028 reported by Mohammed Naser
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Mohammed Naser

Bug Description

At the moment, the `_get_pci_passthrough_devices` function is prone to race conditions.

This specific code here calls `listCaps()`, however, it is possible that the device has disappeared by the time on method has been called:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7949-L7959

Which would result in the following traceback:

2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager [req-51b7c1c4-2b4a-46cc-9baa-8bf61801c48d - - - - -] Error updating resources for node <snip>.: libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager Traceback (most recent call last):
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 9946, in _update_available_resource_for_node
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager self.rt.update_available_resource(context, nodename,
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py", line 879, in update_available_resource
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename)
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 8937, in get_available_resource
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7663, in _get_pci_passthrough_devices
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager vdpa_devs = [
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7664, in <listcomp>
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager dev for dev in devices.values() if "vdpa" in dev.listCaps()
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 6276, in listCaps
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager raise libvirtError('virNodeDeviceListCaps() failed')
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager

I think the cleaner way is to loop over all the items and skip a device if it raises an error that the device is not found.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/840993

Changed in nova:
status: New → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :

triaging it as medium as i think the agent will recover the next time the periodic task runs
but its valid and we should fix it

assigning it to mnaser since they already have a patch up for review

Changed in nova:
assignee: nobody → Mohammed Naser (mnaser)
importance: Undecided → Medium
tags: added: compute libvirt pci resource-tracker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/840993
Committed: https://opendev.org/openstack/nova/commit/8534499b4a76a8aaf39005f251da33a25e95a67c
Submitter: "Zuul (22348)"
Branch: master

commit 8534499b4a76a8aaf39005f251da33a25e95a67c
Author: Mohammed Naser <email address hidden>
Date: Fri May 6 17:18:35 2022 -0400

    Fix race condition in _get_pci_passthrough_devices

    The call to _get_pci_passthrough_devices could fail because a
    network device could have disappeared which would cause a traceback
    in the logs.

    This wraps the function in a safe way to return an empty array
    if it fails, which will clean-up the logs if the device disappears

    Closes-Bug: #1972028
    Change-Id: I46d3bbe122d9f8452f168286391bab67ecea3128

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.0.0.0rc1

This issue was fixed in the openstack/nova 26.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.