SRIOV instance Error: Exception during message handling: KeyError: 'pci_slot'

Bug #1912273 reported by Satish Patel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

I am running stein version of openstack and running SR-IOV instance but today vm die with some memory issue (might be oom) so i start to re-start using following command but got error on compute node and failed to start.

nova start <vm-name>

Error on compute nova logs, this is happened to multiple VMs on different compute nodes.

---------

2021-01-18 17:18:31.396 1496 INFO nova.virt.libvirt.imagecache [req-d82a76a2-edd6-41d7-a7be-7245997ac4b1 - - - - -] image c5d0e40e-3b5a-44a0-9ac3-e6e59b3c6276 at (/var/lib/nova/instances/_base/b62c49c5d32aafa9028bd2b518699eab17dd07e0): checking
2021-01-18 17:18:31.638 1496 INFO nova.virt.libvirt.imagecache [req-d82a76a2-edd6-41d7-a7be-7245997ac4b1 - - - - -] Active base files: /var/lib/nova/instances/_base/b62c49c5d32aafa9028bd2b518699eab17dd07e0
2021-01-18 17:19:04.007 1496 INFO nova.virt.libvirt.driver [-] [instance: 041ba2fe-2a0f-4918-820f-e6401ea0255b] Instance destroyed successfully.
2021-01-18 17:19:04.640 1496 INFO nova.compute.manager [req-ac9ab44c-c6bc-421c-8c6d-7a777cb4575e 63847837de444225accd1ae1db2b1f11 6297c04e9593466d9c6747874e379444 - default default] [instance: 041ba2fe-2a0f-4918-820f-e6401ea0255b] Successfully reverted task state from powering-on on failure for instance.
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server [req-ac9ab44c-c6bc-421c-8c6d-7a777cb4575e 63847837de444225accd1ae1db2b1f11 6297c04e9593466d9c6747874e379444 - default default] Exception during message handling: KeyError: 'pci_slot'
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, in wrapped
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server function_name, call_dict, binary, tb)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server self.force_reraise()
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in wrapped
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/manager.py", line 187, in decorated_function
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server "Error: %s", e, instance=instance)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server self.force_reraise()
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/manager.py", line 157, in decorated_function
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/utils.py", line 1323, in decorated_function
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/manager.py", line 215, in decorated_function
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server self.force_reraise()
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/manager.py", line 203, in decorated_function
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/manager.py", line 2922, in start_instance
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server self._power_on(context, instance)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/manager.py", line 2892, in _power_on
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server block_device_info)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2987, in power_on
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server self._hard_reboot(context, instance, network_info, block_device_info)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2855, in _hard_reboot
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server mdevs=mdevs)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5482, in _get_guest_xml
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server context, mdevs)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5282, in _get_guest_config
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server flavor, virt_type, self._host)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/virt/libvirt/vif.py", line 608, in get_config
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server inst_type, virt_type, host)
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/virt/libvirt/vif.py", line 353, in get_config_hw_veb
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server conf, net_type, profile['pci_slot'],
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server KeyError: 'pci_slot'
2021-01-18 17:19:04.657 1496 ERROR oslo_messaging.rpc.server

---------

Revision history for this message
Satish Patel (satish-txt) wrote :
Revision history for this message
Satish Patel (satish-txt) wrote :

I am reading this bug - https://review.opendev.org/c/openstack/nova/+/605118/

Its saying "Attaching SR-IOV ports to existing instances is not supported" but this VM was working fine until it die because of some reason and currently its not running on compute node. I am trying to start existing VM. (what would be the solution for this kind of scenario where you reboot vm?)

Revision history for this message
sean mooney (sean-k-mooney) wrote :

this looks like something cleared the contence of the neutron ports binding:profile attibue
we store the pci address of the vm in the pci_slot

the binding profile is admin only by default and is owned by nova to pass info to the network backend. if you have relaxed that policy to allow user to set things likel trusted_vf in that binding:profile they could have currupted the data and deleted teh pci_slot info.

i am not aware of any nova bug that casues this so i would look a the neutron api logs for api request related to the port uuid.

if you find one that is modifying the profile you could take teh request id and see if its exists in the nova logs but i suspect that will be a direct user request.

https://review.opendev.org/c/openstack/nova/+/605118/ is unrealted to this.
we have never supported attich sriov port to running instnaces but that is not the same behavior your are seeign now.

the attach would have failed previously with an error on the compute node and been rolled back
since we check if the port has a pci claim before we try to bind it and add it to the vm.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

setting this to incomplete as we really need logs for the neuton port to assess this future.
this does not look imediatly like a nova bug but rather a case of currpted data in the neutron port binding profile.

this could be a resulted of a nova action but we would have to have the nova vm event list to know what action have been performaed on teh vm before the reboot and likely logs for the neutron server and nova compute correspondign to those events.

realistically im not sure that we have time to debug this upstream on your behlaf but if you can investgate this more and provide a repoducer or at least confrim if/when the pci_slot was removed from the port bindign prfiolie we may be able to suggest where to look next.

the fix althoguh not an easy one is to populate the pci_slot in the binding profile for the affected instace based on the pci device they have allcoated in teh nova pci_devices table.

we have discsussed downstream the idea or writing a tool to validate this and heal it relitivly recently but realistically we wont have time to write that this cycle. nova currently does not have all the info it need to reconstuct this ethier today without some guess work. e.g. if we have multipel pci devices in a vm we cant nessiarly corralate them all to the neutron ports because we dont currently reqored the port to request_id mapping anywhere.

Changed in nova:
status: New → Incomplete
tags: added: libvirt neutron pci
Revision history for this message
Satish Patel (satish-txt) wrote :

When my VM get OOM and kind of freeze then i can't ssh or console my vm and if i keep that VM in that state for long time then i have noticed openstack destroying config files when i reboot from "nova reboot <instance_name>" looking at /etc/libvirt/qemu/ that directory is empty it doesn't have any vm configuration file. This is dead end i am not able to recover this VM.

For experiment i did following

when my VM got OOM and freezed then i didn't use nova reboot <instance_name> instead i went to host machine and kill -9 <pid> to kill my VM and then i used nova start <instance_name> which works and my vm is back

is this a bug or something where openstack destroy vm if it freezed or stuck?

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.