nova-compute service failed to start up

Bug #1926375 reported by Linda Guo
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Description
   ===========
   nova compute service crashed and can't start up

   Steps to reproduce
   ==================
   N/A

   Expected result
   ===============
   nova compute service started successfully

   Actual result
   =============
   nova compute service failed to start, the error log looks like

   7fa7cc4856fc),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap9d4120de-9a')
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service [req-7e1e7398-1ba6-440a-8632-2aed9e0d1bf9 - - - - -] Error starting thread.: KeyError: 'pci_slot'
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service Traceback (most recent call last):
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 810, in run_service
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service service.start()
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/service.py", line 172, in start
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service self.manager.init_host()
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1426, in init_host
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service self._init_instance(context, instance)
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1128, in _init_instance
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service self.driver.plug_vifs(instance, net_info)
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1152, in plug_vifs
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service self.vif_driver.plug(instance, vif)
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/vif.py", line 767, in plug
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service self.plug_hw_veb(instance, vif)
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/vif.py", line 676, in plug_hw_veb
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service set_vf_trusted(vif['profile']['pci_slot'], True)
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service KeyError: 'pci_slot'
2021-03-31 15:32:52.056 59251 ERROR oslo_service.service

   Environment
   ===========
   1. Exact version of OpenStack you are running. See the following
      list for all releases: http://docs.openstack.org/releases/

      $ dpkg -l | grep nova
ii nova-api-metadata 2:21.1.1-0ubuntu2~cloud0 all OpenStack Compute - metadata API frontend
ii nova-common 2:21.1.1-0ubuntu2~cloud0 all OpenStack Compute - common files
ii nova-compute 2:21.1.1-0ubuntu2~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:21.1.1-0ubuntu2~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:21.1.1-0ubuntu2~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:21.1.1-0ubuntu2~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:17.0.0-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x

   2. Which storage type did you use?
      ceph

   3. Which networking type did you use?
      Neutron with OpenVSwitch

Linda Guo (lihuiguo)
description: updated
Revision history for this message
sean mooney (sean-k-mooney) wrote :

based on the fact the error is .plug_hw_veb and plugin='ovs',port_profile=VIFPortProfileOpenVSwitch, this would imply this port is vnic_type direct and it was bound by either the ml2/ovs or m2/ovn mechnium drivers which means the network backend is hardware offloaded ovs.

the trusted VF feature is only implemented for standard sriov and is not supported with hardware offloaded ovs but the error in this case seams to be cased by the lack of the pci address in the binding profile. can you provide the output of openstack port show and confirm that this is hardware offloaded ovs.

the failure is happening because we try to plug the vifs on agent start and if the port data is corrupted as it appears to be in this case it will fail. i believe the way to fix this would be to identify which VF is claimed in the pci_devices table for the instance and update the binding profile manually.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

setting this to incompelte as we do not have enough info to determin the source of the error.
can you confirm if this is hardware offloaed ovs or standard ovs?
if its standard ovs do you also have the sriov nic agent installed on the same host?
can you also confirm the version of nova and neutron and provide the current pci device claims for the isntance, the instnace info cache entry (spefically the netwrok info cache info) as well as the port info returnt by neutron so we can validate teh content of the port binding:profile.

Changed in nova:
status: New → Incomplete
tags: added: network neutron pci
Revision history for this message
Tamas Erdei (terdei) wrote :

Looks like we ran into the same (or very similar) issue. We have active direct (SR-IOV) ports, where the Neutron database has incomplete binding profile without PCI info. It just has '{"trusted": "true"}'.
The VM with the port is running on the hypervisor. We wanted to restart nova-compute service on the node, but it does not start, it throws the KeyError: 'pci_slot' exception, with the very same stack trace as in the bug description. I think this is because we do not have pci_slot key in the port binding profile.
The question is, how the binding_profile data became incomplete. It should have pci_slot, pci_vendor_info and physical_network keys with proper values at least.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
Erlon R. Cruz (sombrafam) wrote :

Unexpire

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.