Cannot hard reboot a libvirt instance in error state (mdev query fails)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Sylvain Bauza | ||
Queens |
Fix Released
|
High
|
Lee Yarwood |
Bug Description
Nova version: stable/queens fda768b304e0582
Hypervisor: Libvirt + KVM
If an instance doesn't exist in libvirt (failed live migration, compute container rebuilt, etc) a hard reboot or start is no longer able to recreate it. We see this problem occasionally happen for various reasons and in the past a hard reboot would revive the instance.
A recent commit is responsible (libvirt: pass the mdevs when rebooting the guest).
_get_all_
Adding a instance_exists() check solves the issue.
--- driver.py.orig 2018-04-16 16:11:42.865555972 +0000
+++ driver.py 2018-04-16 16:11:55.901773724 +0000
@@ -5966,6 +5966,8 @@
"""
if instance:
+ if not self.instance_
+ return {}
guest = self._host.
guests = [guest]
else:
Steps to recreate:
1. Stop an instance
2. Delete the instance-
3. Start the instance
Expected result: instance running
Actual result: error: instanceNotFound from nova-compute
Logs:
2018-04-16 15:41:09.756 2030272 INFO nova.compute.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
2018-04-16 15:41:09.790 2030272 ERROR oslo_messaging.
tags: | added: libvirt queens-backport-potential |
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → High |
summary: |
- Cannot hard reboot an instance in error state + Cannot hard reboot a libvirt instance in error state (mdev query fails) |
tags: | added: vgpu |
Changed in nova: | |
assignee: | nobody → Sylvain Bauza (sylvain-bauza) |
Changed in nova: | |
assignee: | Sylvain Bauza (sylvain-bauza) → Matt Riedemann (mriedem) |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → Sylvain Bauza (sylvain-bauza) |
Confirmed, This is a pretty big bug that occurs quite often in large openstack setups