OpenStack Compute (nova)

[Error Code 42] Domain not found when hard-reset is used

Bug #1846027 reported by Orestes Leal Rodriguez on 2019-09-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Expired	Undecided	Unassigned

Bug Description

Not entirely sure if this is a bug, but at least the underlying logic seems to mess this up.

I have 7 computes nodes on a ostack cluster. THis issue happens on cluster1 and 5. for two VMs.

When it happens: At hard reboot. Let's say I have a VM that for some reason is blocked (out of memory, whatever). Then I do a hard reboot. When I do that the underlying nova code closes the iSCSI connection to the cinder storage (I verified this), then it tries to restart the domain failing with:

2019-09-30 11:54:00.366 4484 WARNING nova.virt.libvirt.driver [req-1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] Error from libvirt while getting description of instance-000002b1: [Error Code 42] Domain not found: no domain with matching uuid '39a02162-7e99-45b8-837c-4db0f20025af' (instance-000002b1): libvirt.libvirtError: Domain not found: no domain with matching uuid '39a02162-7e99-45b8-837c-4db0f20025af' (instance-000002b1)

Let me stop here for a moment. If in this step I go to the compute node and do a virsh list --all the instance is not there at all.

I also get:

{u'message': u'Volume device not found at .', u'code': 500, u'details': u' File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 202, in decorated_function\n return function(self, context, *args, **kwargs)\n File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3512, in reboot_instance\n self._set_instance_obj_error_state(context, instance)\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3486, in reboot_instance\n bad_volumes_callback=bad_volumes_callback)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 2739, in reboot\n block_device_info)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 2833, in _hard_reboot\n mdevs=mdevs)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5490, in _get_guest_xml\n context, mdevs)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5283, in _get_guest_config\n flavor, guest.os_type)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4093, in _get_guest_storage_config\n self._connect_volume(context, connection_info, instance)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1276, in _connect_volume\n vol_driver.connect_volume(connection_info, instance)\n File "/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/iscsi.py", line 64, in connect_volume\n device_info = self.connector.connect_volume(connection_info[\'data\'])\n File "/usr/lib/python3/dist-packages/os_brick/utils.py", line 137, in trace_logging_wrapper\n return f(*args, **kwargs)\n File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 328, in inner\n return f(*args, **kwargs)\n File "/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py", line 518, in connect_volume\n self._cleanup_connection(connection_properties, force=True)\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py", line 512, in connect_volume\n return self._connect_single_volume(connection_properties)\n File "/usr/lib/python3/dist-packages/os_brick/utils.py", line 61, in _wrapper\n return r.call(f, *args, **kwargs)\n File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call\n raise attempt.get()\n File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get\n six.reraise(self.value[0], self.value[1], self.value[2])\n File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call\n attempt = Attempt(fn(*args, **kwargs), attempt_number, False)\n File "/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py", line 587, in _connect_single_volume\n raise exception.VolumeDeviceNotFound(device=\'\')\n', u'created': u'2019-09-29T23:44:32Z'} |

And on the nova compute logs I see:

2019-09-30 14:15:21.388 4484 WARNING nova.compute.manager [req-1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] While synchronizing instance power states, found 33 instances in the database and 34 instances on the hypervisor.

Something is not well synchronized and I believe this is the reason everything else is failing.

My workaround:

When this happens ostack set the vm-state to ERROR. I change the state to active, and the stop the Instance. then I detach the volume (cinder, iscsi based) start the VM, shutdown the VM, attach the volume agan, and start the VM. This fix it. But if my user do a hard reset again it will happen again.

Let me know if you need more information and I would be eager to provide it.

Tags:

Matt Riedemann (mriedem) on 2019-10-02

tags:

added: libvirt

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-24:

When we hard reset, libvirt redefines the domain so that's why you probably don't see it.

Could you please give us more logs from the compute when rebooting the instance ?

tags:

added: volumes

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-24:

Punting the bug to the Incomplete status for triaging reasons, but please put it back to 'New' once you reply.

Changed in nova:
status:	New → Incomplete

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-24:

This bug actually looks related : https://bugs.launchpad.net/nova/+bug/1738297
I still need to understand what happens with your own node hence the log request I have.

Also, could you please tell us which Nova release are you using ? This is important as the bug could have been fixed in a later release.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-06-24:

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.