Cannot hard reboot a libvirt instance in error state (mdev query fails)

Bug #1764460 reported by Scott Yoder
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Sylvain Bauza
Queens
Fix Released
High
Lee Yarwood

Bug Description

Nova version: stable/queens fda768b304e05821f7479f9698c59d18bf3d3516
Hypervisor: Libvirt + KVM

If an instance doesn't exist in libvirt (failed live migration, compute container rebuilt, etc) a hard reboot or start is no longer able to recreate it. We see this problem occasionally happen for various reasons and in the past a hard reboot would revive the instance.

A recent commit is responsible (libvirt: pass the mdevs when rebooting the guest).

_get_all_assigned_mediated_devices() throws an instanceNotFound exception when trying to start such an instance.

Adding a instance_exists() check solves the issue.

--- driver.py.orig 2018-04-16 16:11:42.865555972 +0000
+++ driver.py 2018-04-16 16:11:55.901773724 +0000
@@ -5966,6 +5966,8 @@
         """
         allocated_mdevs = {}
         if instance:
+ if not self.instance_exists(instance):
+ return {}
             guest = self._host.get_guest(instance)
             guests = [guest]
         else:

Steps to recreate:
1. Stop an instance
2. Delete the instance-XXXXXXX.xml file from /etc/libvirt/qemu/
3. Start the instance

Expected result: instance running
Actual result: error: instanceNotFound from nova-compute

Logs:
2018-04-16 15:41:09.756 2030272 INFO nova.compute.manager [req-ce2e1036-ab7b-4a98-b343-6ab748326963 32bab887a38f4b6cbcaf83297d4b7812 29e87d21ad14403bb789543e8bc0dab7 - default default] [instance: 0130afdf-f5aa-4ec9-8d0a-71080c70f276] Successfully reverted task state from powering-on on failure for instance.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server [req-ce2e1036-ab7b-4a98-b343-6ab748326963 32bab887a38f4b6cbcaf83297d4b7812 29e87d21ad14403bb789543e8bc0dab7 - default default] Exception during message handling: InstanceNotFound: Instance 0130afdf-f5aa-4ec9-8d0a-71080c70f276 could not be found.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/exception_wrapper.py", line 76, in wrapped
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server self.force_reraise()
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/exception_wrapper.py", line 67, in wrapped
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/manager.py", line 186, in decorated_function
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server "Error: %s", e, instance=instance)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server self.force_reraise()
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/manager.py", line 156, in decorated_function
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/utils.py", line 976, in decorated_function
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/manager.py", line 2665, in start_instance
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server self._power_on(context, instance)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/manager.py", line 2635, in _power_on
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server block_device_info)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2908, in power_on
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server self._hard_reboot(context, instance, network_info, block_device_info)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2745, in _hard_reboot
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server mdevs = self._get_all_assigned_mediated_devices(instance)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5969, in _get_all_assigned_mediated_devices
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server guest = self._host.get_guest(instance)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 526, in get_guest
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server return libvirt_guest.Guest(self._get_domain(instance))
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 546, in _get_domain
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server raise exception.InstanceNotFound(instance_id=instance.uuid)
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.rpc.server InstanceNotFound: Instance 0130afdf-f5aa-4ec9-8d0a-71080c70f276 could not be found.

Matt Riedemann (mriedem)
tags: added: libvirt queens-backport-potential
Changed in nova:
status: New → Confirmed
importance: Undecided → High
summary: - Cannot hard reboot an instance in error state
+ Cannot hard reboot a libvirt instance in error state (mdev query fails)
tags: added: vgpu
Changed in nova:
assignee: nobody → Sylvain Bauza (sylvain-bauza)
Revision history for this message
Peter Holden (k-puter-x) wrote :

Confirmed, This is a pretty big bug that occurs quite often in large openstack setups

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/564257

Changed in nova:
status: Confirmed → In Progress
Changed in nova:
assignee: Sylvain Bauza (sylvain-bauza) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Sylvain Bauza (sylvain-bauza)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/564454

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/564454
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a5aa8e2eb8d5a90780211ef67b8f2dcc76af64de
Submitter: Zuul
Branch: stable/queens

commit a5aa8e2eb8d5a90780211ef67b8f2dcc76af64de
Author: Sylvain Bauza <email address hidden>
Date: Wed Apr 25 17:28:08 2018 +0200

    libvirt: fix hard reboot issue with mdevs

    Sometimes, a Nova instance may not have an associated libvirt guest,
    like if a live migration fails and the hypervisor on the source node
    is in an inconsistent state.
    The usual way to fix that is to hard reboot the instance, so the
    virt driver recreates the guest.
    Unfortunately, we introduced a regression in Queens with
    Ie6c9108808a461359d717d8a9e9399c8a407bfe9 because we weren't
    checking whether the guest was there or not.

    This fix aims to catch the InstanceNotFound exception that's
    raised and just tell there are no mediated devices attached to that
    instance (given we can no longer know them).

    Change-Id: Iff30e59ced63a94c5c3377317a2e2a502f96eb91
    Closes-Bug: #1764460
    (cherry picked from commit 8941c79f0924413822b15824a298e081e345e616)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.4

This issue was fixed in the openstack/nova 17.0.4 release.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

For some reason, the git hook didn't close the LP bug but see, the upstream change is merged : https://review.openstack.org/#/c/564257/

Changed in nova:
status: In Progress → Invalid
status: Invalid → Fix Committed
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b2

This issue was fixed in the openstack/nova 18.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.