[Error Code 42] Domain not found when hard-reset is used

Bug #1846027 reported by Orestes Leal Rodriguez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Not entirely sure if this is a bug, but at least the underlying logic seems to mess this up.

I have 7 computes nodes on a ostack cluster. THis issue happens on cluster1 and 5. for two VMs.

When it happens: At hard reboot. Let's say I have a VM that for some reason is blocked (out of memory, whatever). Then I do a hard reboot. When I do that the underlying nova code closes the iSCSI connection to the cinder storage (I verified this), then it tries to restart the domain failing with:

2019-09-30 11:54:00.366 4484 WARNING nova.virt.libvirt.driver [req-1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] Error from libvirt while getting description of instance-000002b1: [Error Code 42] Domain not found: no domain with matching uuid '39a02162-7e99-45b8-837c-4db0f20025af' (instance-000002b1): libvirt.libvirtError: Domain not found: no domain with matching uuid '39a02162-7e99-45b8-837c-4db0f20025af' (instance-000002b1)

Let me stop here for a moment. If in this step I go to the compute node and do a virsh list --all the instance is not there at all.

I also get:

 {u'message': u'Volume device not found at .', u'code': 500, u'details': u' File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 202, in decorated_function\n return function(self, context, *args, **kwargs)\n File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3512, in reboot_instance\n self._set_instance_obj_error_state(context, instance)\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3486, in reboot_instance\n bad_volumes_callback=bad_volumes_callback)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 2739, in reboot\n block_device_info)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 2833, in _hard_reboot\n mdevs=mdevs)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5490, in _get_guest_xml\n context, mdevs)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5283, in _get_guest_config\n flavor, guest.os_type)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4093, in _get_guest_storage_config\n self._connect_volume(context, connection_info, instance)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1276, in _connect_volume\n vol_driver.connect_volume(connection_info, instance)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/iscsi.py", line 64, in connect_volume\n device_info = self.connector.connect_volume(connection_info[\'data\'])\n File "/usr/lib/python3/dist-packages/os_brick/utils.py", line 137, in trace_logging_wrapper\n return f(*args, **kwargs)\n File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 328, in inner\n return f(*args, **kwargs)\n File "/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py", line 518, in connect_volume\n self._cleanup_connection(connection_properties, force=True)\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py", line 512, in connect_volume\n return self._connect_single_volume(connection_properties)\n File "/usr/lib/python3/dist-packages/os_brick/utils.py", line 61, in _wrapper\n return r.call(f, *args, **kwargs)\n File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call\n raise attempt.get()\n File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get\n six.reraise(self.value[0], self.value[1], self.value[2])\n File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call\n attempt = Attempt(fn(*args, **kwargs), attempt_number, False)\n File "/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py", line 587, in _connect_single_volume\n raise exception.VolumeDeviceNotFound(device=\'\')\n', u'created': u'2019-09-29T23:44:32Z'} |

And on the nova compute logs I see:

2019-09-30 14:15:21.388 4484 WARNING nova.compute.manager [req-1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] While synchronizing instance power states, found 33 instances in the database and 34 instances on the hypervisor.

Something is not well synchronized and I believe this is the reason everything else is failing.

My workaround:

When this happens ostack set the vm-state to ERROR. I change the state to active, and the stop the Instance. then I detach the volume (cinder, iscsi based) start the VM, shutdown the VM, attach the volume agan, and start the VM. This fix it. But if my user do a hard reset again it will happen again.

Let me know if you need more information and I would be eager to provide it.

Matt Riedemann (mriedem)
tags: added: libvirt
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

When we hard reset, libvirt redefines the domain so that's why you probably don't see it.

Could you please give us more logs from the compute when rebooting the instance ?

tags: added: volumes
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Punting the bug to the Incomplete status for triaging reasons, but please put it back to 'New' once you reply.

Changed in nova:
status: New → Incomplete
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

This bug actually looks related : https://bugs.launchpad.net/nova/+bug/1738297
I still need to understand what happens with your own node hence the log request I have.

Also, could you please tell us which Nova release are you using ? This is important as the bug could have been fixed in a later release.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.