instance.host not updated on evacuation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Artom Lifshitz | ||
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
Mitaka |
Fix Released
|
Undecided
|
Unassigned | ||
nova-powervm |
Fix Released
|
Undecided
|
Drew Thorstensen | ||
nova (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Xenial |
Fix Released
|
Undecided
|
Seyeong Kim | ||
Zesty |
Fix Released
|
Undecided
|
Unassigned | ||
Artful |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
I created several VM instances and checked they are all ACTIVE state after creating vm.
Right after checking them, shutdown nova-compute on their host(to test in this case).
Then, I tried to evacuate them to the other host. But it is failed with ERROR state.
I did some test and analysis.
I found two commits below are related.(Please refer to [Others] section)
In this context, migration_context is DB field to pass information when migration or evacuation.
for [1], This gets host info from migration_context. if migration_context is abnormal or empty, migration would be fail. actually, with only this patch, migration_context is empty. so [2] is needed. I touched self.client.prepare part in rpcapi.py from original patch which is replaced on newer version. because it is related newer functionality, I remained mitaka's function call for this issue.
for [2], This moves recreation check code to former if condition. and it calls rebuild_claim to create migration_context when recreate state not only scheduled. I adjusted test code which are pop up from backport process and seems to be needed. Someone want to backport or cherrypick code related to this, they could find it is already exist.
Only one patch of them didn’t fix this issue as test said.
[Test case]
In below env,
http://
Network configuration is important in this case, because I tested different configuration. but couldn't reproduce it.
reproduction test script ( based on juju )
http://
[Regression Potential]
Existing ACTIVE instances or newly creating instances are not affected by this code because these commits are only called when doing migration or evacuation. If there are ACTIVE instances and instances with ERROR state caused by this issue in one host, upgrading to have this fix will not affect any existing instances. After upgrading to have this fix and trying to evacuate problematic instance again, ERROR state should be fixed to ACTIVE. I tested this scenario on simple env, but still need to be considered possibility in complex, crowded environment.
[Others]
In test, I should patch two commits, one from
https:/
Related Patches.
[1] https:/
[2] https:/
- https:/
original)
[Original description]
I'm working on the nova-powervm driver for Mitaka and trying to add support for evacuation.
The problem I'm hitting is that instance.host is not updated when the compute driver is called to spawn the instance on the destination host. It is still set to the source host. It's not until after the spawn completes that the compute manager updates instance.host to reflect the destination host.
The nova-powervm driver uses instance events callback mechanism during plug VIF to determine when Neutron has finished provisioning the network. The instance events code sends the event to instance.host and hence is sending the event to the source host (which is down). This causes the spawn to fail and also causes weirdness when the source host gets the events when it's powered back up.
To temporarily work around the problem, I hacked in setting instance.host = CONF.host; instance.save() in the compute driver but that's not a good solution.
Changed in nova: | |
status: | New → Confirmed |
tags: | added: libvirt |
tags: | added: compute |
Changed in nova: | |
assignee: | nobody → Wen Zhi Yu (yuywz) |
status: | Confirmed → In Progress |
affects: | nova → nova-powervm |
Changed in nova-powervm: | |
assignee: | nobody → Drew Thorstensen (thorst) |
Changed in nova: | |
assignee: | nobody → Taylor Peoples (tpeoples) |
Changed in nova: | |
assignee: | Taylor Peoples (tpeoples) → nobody |
Changed in nova: | |
assignee: | nobody → Sridhar Venkat (svenkat) |
Changed in nova: | |
assignee: | Sridhar Venkat (svenkat) → Artom Lifshitz (notartom) |
description: | updated |
description: | updated |
description: | updated |
Changed in cloud-archive: | |
status: | New → Fix Released |
description: | updated |
description: | updated |
Changed in nova (Ubuntu Xenial): | |
assignee: | nobody → Seyeong Kim (xtrusia) |
Changed in nova (Ubuntu Artful): | |
status: | New → Fix Released |
Changed in nova (Ubuntu Zesty): | |
status: | New → Fix Released |
Changed in nova (Ubuntu Xenial): | |
status: | New → In Progress |
description: | updated |
To point out the issue a little more.
The compute manager's virtapi allows the compute driver to wait for external events via wait_for_ instance_ event() method. The common use case is for a compute driver to wait for the vifs to be plugged by neutron before proceeding through the spawn. The pattern is also present in the libvirt driver. See libvirt driver.py -> _create_ domain_ and_network( ). In there you'll see the use of the wait_for_ instance_ event context manager.
The flow for the events to come into Nova is through nova/api/ openstack/ compute/ server_ external_ events. py. which eventually calls compute_ api.external_ instance_ event() to dispatch the events. In external_ instance_ event() you'll see it's using instance.host to call compute_ rpcapi. external_ instance_ event() . So the RPC message will go to whatever host is currently set. In the case of evacuate, at that point in time (while the new host is spawning the recreated VM) it's set to the original host. Which is down. So the compute driver that initiated the action and is waiting for the event will never get it.
The question was raised why libvirt doesn't suffer the same fate. I can't answer that authoritatively, but libvirt has a lot of conditions that have to be met before it'll wait for the event. Here's what it's currently checking before waiting for a plug vif event:
timeout = CONF.vif_ plugging_ timeout conn_supports_ start_paused and
utils. is_neutron( ) and not
vifs_ already_ plugged and power_on and timeout): neutron_ events( network_ info)
if (self._
events = self._get_
else:
events = []
But it does seem (from reading the code) that if all those conditions are met and the operation is an evacuate, it too would fail. Though I have not tried it.