[Victoria] nova-compute won't start on aarch64 - raises PciDeviceNotFoundById

Bug #1915255 reported by Aurelien Lourot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
sean mooney
Victoria
Triaged
Medium
Unassigned

Bug Description

Description
===========

When deploying OpenStack Victoria on Ubuntu 20.04 (Focal) on arm64/aarch64, nova-compute 22.0.1 fails to start with (nova-compute.log):

----------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 156, in get_ifname_by_pci_address
    dev_info = os.listdir(dev_path)
FileNotFoundError: [Errno 2] No such file or directory: '/sys/bus/pci/devices/0002:01:00.1/physfn/net'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9823, in _update_available_resource_for_node
    self.rt.update_available_resource(context, nodename,
  File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 880, in update_available_resource
    resources = self.driver.get_available_resource(nodename)
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8473, in get_available_resource
    data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in _get_pci_passthrough_devices
    pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in <listcomp>
    pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7199, in _get_pcidev_info
    device.update(_get_device_type(cfgdev, address, dev, net_devs))
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7154, in _get_device_type
    parent_ifname = pci_utils.get_ifname_by_pci_address(
  File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 159, in get_ifname_by_pci_address
    raise exception.PciDeviceNotFoundById(id=pci_addr)
nova.exception.PciDeviceNotFoundById: PCI device 0002:01:00.1 not found
----------

This results in an empty `openstack hypervisor list`.

This does not happen with OpenStack Ussuri (nova-compute 21.1.0). We also haven't seen this on other architectures (yet?). This code actually appeared between Ussuri and Victoria, [0] i.e. the first version having it is 22.0.0.

$ lspci | grep 0002:01:00.1
0002:01:00.1 Ethernet controller: Cavium, Inc. THUNDERX Network Interface Controller virtual function (rev 09)

Indeed /sys/bus/pci/devices/0002:01:00.1/physfn/ doesn't contain `net` but I'm not sure if that's really a problem or if nova-compute should just catch the exception and move on?

A similar issue in the past [1] shows that this might be an issue specific to the Cavium Thunder X NIC.

Related issue: [2]

Steps to reproduce
==================

Install and run nova >= 22.0.0 on an aarch64 machine (with a Cavium Thunder X NIC if possible). I personally use Juju [3] for deploying an entire OpenStack Victoria setup to a lab:

$ git clone https://github.com/openstack-charmers/openstack-bundles
$ cd openstack-bundles/development/openstack-base-focal-victoria/
$ juju deploy ./bundle.yaml

Expected result
===============

`openstack hypervisor list` shows at least one hypervisor.
nova-compute.log doesn't contain nova.exception.PciDeviceNotFoundById

Actual result
=============

`openstack hypervisor list` doesn't show any hypervisor.
nova-compute.log contains nova.exception.PciDeviceNotFoundById

Environment
===========

$ dpkg -l | grep nova
ii nova-api-metadata 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - metadata API frontend
ii nova-common 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - common files
ii nova-compute 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:17.2.1-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x

# cat /etc/nova/nova-compute.conf
[DEFAULT]
compute_driver=libvirt.LibvirtDriver
[libvirt]
virt_type=kvm

$ dpkg -l | grep libvirt
ii libvirt-clients 6.0.0-0ubuntu8.5 arm64 Programs for the libvirt library
ii libvirt-daemon 6.0.0-0ubuntu8.5 arm64 Virtualization daemon
ii libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.5 arm64 Virtualization daemon QEMU connection driver
ii libvirt-daemon-driver-storage-rbd 6.0.0-0ubuntu8.5 arm64 Virtualization daemon RBD storage driver
ii libvirt-daemon-system 6.0.0-0ubuntu8.5 arm64 Libvirt daemon configuration files
ii libvirt-daemon-system-systemd 6.0.0-0ubuntu8.5 arm64 Libvirt daemon configuration files (systemd)
ii libvirt0:arm64 6.0.0-0ubuntu8.5 arm64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:22.0.1-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-libvirt 6.1.0-1 arm64 libvirt Python 3 bindings

This shouldn't be relevant but:

* Ceph 15.2.7 for storage
* Neutron with OVN

Logs & Configs
==============

sosreport attached.

[0] https://opendev.org/openstack/nova/commit/efc27ff84c3
[1] https://bugs.launchpad.net/charm-nova-compute/+bug/1771662
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1724999
[3] https://jaas.ai/openstack-base

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
tags: added: aarch64 pci
tags: added: compute
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Looking at the diff between stable/ussuri and stable/victoria I found this patch that seems pretty suspicious https://review.opendev.org/c/openstack/nova/+/739131 Could you try to revert this patch locally in your environment to see if that solves your problem? I let the author of that patch know about this bug report.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

This is a real issue because the Cavium ThunderX hardware violates an assumtion we have with regards to PF having netdevs if VF do.
we just need to re add this try excpet that was removed.
https://review.opendev.org/c/openstack/nova/+/739131/12/nova/virt/libvirt/driver.py#b6957

it was orginally removed as we are only looking at the sub set of VFs that are nics
but since the Cavium ThunderX does not assing a PF to all VFs
per https://bugs.launchpad.net/charm-nova-compute/+bug/1771662

we need to catch the exception in this case as we did before.

this means that minium bandwidth based QOS cannot be implemented on this hardware as we rely on the PF netdev name to correlate the bandwidth between nova and neutron but other functionality shoudl work. The only way to support min bandwith qos on thsi hardware would be to altere the nic driver or enhance nova/neutron to support using the PF pci address instead of the parent netdev name.

Changed in nova:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
sean mooney (sean-k-mooney) wrote :

https://review.opendev.org/c/openstack/nova/+/777679
the bot is slow due to network issue so just adding the review link

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Hi Sean, thanks a lot for your work on this! Anything I can do to help it cross the finish line?

Revision history for this message
sean mooney (sean-k-mooney) wrote :

hi sorry for the delay. im now back form PTO ill try and work on this on monday
if you have time you could also pick up the patch but ill see if we can get this moving forward next week.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/777679
Committed: https://opendev.org/openstack/nova/commit/a569a51fedd058fdae2eb0066e087c37688987f8
Submitter: "Zuul (22348)"
Branch: master

commit a569a51fedd058fdae2eb0066e087c37688987f8
Author: Sean Mooney <email address hidden>
Date: Fri May 21 14:45:45 2021 +0100

    fix sr-iov support on Cavium ThunderX hosts.

    This change is a partial revert of
    Ibf8dca4bd57b3bddb39955b53cc03564506f5754
    to reintoduce a try-except which is required for
    some non standard hardware.

    On the Cavium ThunderX platform, it's possible to have
    virutal functions which are netdevs which are not associated
    to a PF. This causes the PF name lookup to fail.
    Prior to Ibf8dca4bd57b3bddb39955b53cc03564506f5754
    when the lookup failed it was caught and we skipped
    populating the parent PF interface name.

    This change restores that behavior.

    Closes-Bug: #1915255
    Change-Id: Ia10ccdd9fbed3870d0592e3cbbff17f292651dd2

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Thanks a lot! I confirm that this is fixed for me in Nova 24.0.0-0ubuntu1~cloud0 (OpenStack Xena), and not fixed in Nova 23.0.2-0ubuntu1~cloud0 (OpenStack Wallaby).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers