instance.host not updated on evacuation

Bug #1535918 reported by Kyle L. Henderson on 2016-01-19
48
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Artom Lifshitz
Ubuntu Cloud Archive
Undecided
Unassigned
Mitaka
Undecided
Unassigned
nova-powervm
Undecided
Drew Thorstensen
nova (Ubuntu)
Status tracked in Artful
Xenial
Undecided
Seyeong Kim
Zesty
Undecided
Unassigned
Artful
Undecided
Unassigned

Bug Description

[Impact]

I created several VM instances and checked they are all ACTIVE state after creating vm.
Right after checking them, shutdown nova-compute on their host(to test in this case).
Then, I tried to evacuate them to the other host. But it is failed with ERROR state.
I did some test and analysis.
I found two commits below are related.(Please refer to [Others] section)
In this context, migration_context is DB field to pass information when migration or evacuation.

for [1], This gets host info from migration_context. if migration_context is abnormal or empty, migration would be fail. actually, with only this patch, migration_context is empty. so [2] is needed. I touched self.client.prepare part in rpcapi.py from original patch which is replaced on newer version. because it is related newer functionality, I remained mitaka's function call for this issue.

for [2], This moves recreation check code to former if condition. and it calls rebuild_claim to create migration_context when recreate state not only scheduled. I adjusted test code which are pop up from backport process and seems to be needed. Someone want to backport or cherrypick code related to this, they could find it is already exist.
Only one patch of them didn’t fix this issue as test said.

[Test case]

In below env,

http://pastebin.ubuntu.com/25337153/

Network configuration is important in this case, because I tested different configuration. but couldn't reproduce it.
reproduction test script ( based on juju )

http://pastebin.ubuntu.com/25360805/

[Regression Potential]

Existing ACTIVE instances or newly creating instances are not affected by this code because these commits are only called when doing migration or evacuation. If there are ACTIVE instances and instances with ERROR state caused by this issue in one host, upgrading to have this fix will not affect any existing instances. After upgrading to have this fix and trying to evacuate problematic instance again, ERROR state should be fixed to ACTIVE. I tested this scenario on simple env, but still need to be considered possibility in complex, crowded environment.

[Others]

In test, I should patch two commits, one from
https://bugs.launchpad.net/nova/+bug/1686041

Related Patches.
[1] https://github.com/openstack/nova/commit/a5b920a197c70d2ae08a1e1335d979857f923b4f
[2] https://github.com/openstack/nova/commit/0f2d87416eff1e96c0fbf0f4b08bf6b6b22246d5 ( backported to newton from below original)
- https://github.com/openstack/nova/commit/a2b0824aca5cb4a2ae579f625327c51ed0414d35 (
original)

[Original description]

I'm working on the nova-powervm driver for Mitaka and trying to add support for evacuation.

The problem I'm hitting is that instance.host is not updated when the compute driver is called to spawn the instance on the destination host. It is still set to the source host. It's not until after the spawn completes that the compute manager updates instance.host to reflect the destination host.

The nova-powervm driver uses instance events callback mechanism during plug VIF to determine when Neutron has finished provisioning the network. The instance events code sends the event to instance.host and hence is sending the event to the source host (which is down). This causes the spawn to fail and also causes weirdness when the source host gets the events when it's powered back up.

To temporarily work around the problem, I hacked in setting instance.host = CONF.host; instance.save() in the compute driver but that's not a good solution.

Kyle L. Henderson (kyleh) wrote :

To point out the issue a little more.

The compute manager's virtapi allows the compute driver to wait for external events via wait_for_instance_event() method. The common use case is for a compute driver to wait for the vifs to be plugged by neutron before proceeding through the spawn. The pattern is also present in the libvirt driver. See libvirt driver.py -> _create_domain_and_network(). In there you'll see the use of the wait_for_instance_event context manager.

The flow for the events to come into Nova is through nova/api/openstack/compute/server_external_events.py. which eventually calls compute_api.external_instance_event() to dispatch the events. In external_instance_event() you'll see it's using instance.host to call compute_rpcapi.external_instance_event(). So the RPC message will go to whatever host is currently set. In the case of evacuate, at that point in time (while the new host is spawning the recreated VM) it's set to the original host. Which is down. So the compute driver that initiated the action and is waiting for the event will never get it.

The question was raised why libvirt doesn't suffer the same fate. I can't answer that authoritatively, but libvirt has a lot of conditions that have to be met before it'll wait for the event. Here's what it's currently checking before waiting for a plug vif event:

        timeout = CONF.vif_plugging_timeout
        if (self._conn_supports_start_paused and
            utils.is_neutron() and not
            vifs_already_plugged and power_on and timeout):
            events = self._get_neutron_events(network_info)
        else:
            events = []

But it does seem (from reading the code) that if all those conditions are met and the operation is an evacuate, it too would fail. Though I have not tried it.

Changed in nova:
status: New → Confirmed
tags: added: libvirt
tags: added: compute
Wenzhi Yu (yuywz) on 2016-01-27
Changed in nova:
assignee: nobody → Wen Zhi Yu (yuywz)
status: Confirmed → In Progress
Drew Thorstensen (thorst) wrote :

We discussed this issue at the mid cycle. They asked the PowerVM team to re-evaluate because this works in libvirt. What is different in PowerVM's implementation.

I believe both drivers have the same semantic for rebuild/evacuate. The instance is destroyed on the source system and then the spawn is run on the target host. This is the compute manager's default implementation.

The next question was what was different about our criteria to determine if the vif plug time out should be adhered to.

PowerVM's implementation is pretty simple:

        if (utils.is_neutron() and CONF.vif_plugging_timeout):
            return [('network-vif-plugged', vif['id'])
                    for vif in self.network_info
                    if vif.get('active', True) is False]
        else:
            return []

Libvirt's is:
        timeout = CONF.vif_plugging_timeout
        if (self._conn_supports_start_paused and
            utils.is_neutron() and not
            vifs_already_plugged and power_on and timeout):
            events = self._get_neutron_events(network_info)
        else:
            events = []

In a rebuild scenario, the libvirt should hit this.
 - self._conn_supports_start_paused: True if KVM or QEMU
 - utils.is_neutron(): assumed to be true.
 - vifs_already_plugged: False (in that this is a rebuild)
 - power_on: True (in that this is a rebuild)
 - timeout: Assumed to be set to some number.

I guess I'm wondering if libvirt could be affected by this bug? It could be hitting this, but then passing a rebuild test case if the CONF.vif_plugging_is_fatal is set to False.

Another reason that libvirt may not be impacted is perhaps they are doing an instance.save elsewhere in the flow. Thus inadvertantly updating to the right host. But I don't believe this to be the case...it looks like the only places instance.save is called is in cleanup and in the _live_migration_monitor. Also, nothing in the driver is updating the host, that is done solely in the manager (as one would expect).

Kyle - did I mis-interpret the issue?

Kyle L. Henderson (kyleh) wrote :

You documented the issue correct Drew.

One correction on the evacuate semantics: The instance is not destroyed on the source system (since it's down and confirmed to be down by the compute api) until the source host is available again (if ever). This would happen after the rebuild (recreate=True) is completed on on the destination host.

Drew Thorstensen (thorst) wrote :

The issue with the PowerVM driver is actually in neutron. I set up a libvirt environment, and the difference is that the PowerVM VIF is for some reason in a BUILD state, where as it is ACTIVE in libvirt.

If the PowerVM VIF was in an ACTIVE state, this wouldn't occur, and no neutron events would need to be waited for.

I'll investigate what's going on with the port state for networking-powervm. The state up is being sent...so this requires some verification.

It is true that the nova instance.host isn't updated until after the spawn in nova. That could be investigated...but this is the root reason why PowerVM is seeing different behavior than Libvirt.

affects: nova → networking-powervm
Changed in networking-powervm:
assignee: Wen Zhi Yu (yuywz) → Drew Thorstensen (thorst)
Drew Thorstensen (thorst) wrote :

I see the issue. The agent does periodic 'get_device_details' calls. It turns out that nexted within Neutron, if you 'get the device details', it reverts the port state to BUILD. It expects an immediate 'UP' request back. The agent doesn't do this.

Will need to add some logic.

Reviewed: https://review.openstack.org/273728
Committed: https://git.openstack.org/cgit/openstack/networking-powervm/commit/?id=65f53ab2412f1865f50d8dba701420350a7f68ec
Submitter: Jenkins
Branch: master

commit 65f53ab2412f1865f50d8dba701420350a7f68ec
Author: Drew Thorstensen <email address hidden>
Date: Thu Jan 28 19:59:45 2016 +0000

    Update heal code to ensure device up

    The heal code within the networking-powervm project would ensure that
    the VLAN and client device was routed out to the network. However, due
    to it calling 'get_device_details', the neutron code was changing the
    state back to BUILD.

    Given this behavior, it became apparent that the best path forward was
    to have the heal code call a full provision request for the client
    device. This actually will no-op very quickly if the VLAN is already on
    the client device, but tells Neutron that it is not in fact in a build
    state...but rather is now ACTIVE.

    This allows for a more robust provisioning scheme and allows the neutron
    state to reflect reality. It also updates any existing ports in the
    field that may be affected by this with the next 'heal' cycle.

    Change-Id: I02f2c4cd1d63b7a712e50c273e043e6a7ea5a5e1
    Closes-Bug: 1535918

Changed in networking-powervm:
status: In Progress → Fix Released
Kyle L. Henderson (kyleh) wrote :

I pulled the latest code on my systems with devstack. Removed the work around for the issue from the nova-powervm code base (which was to force the update of the instance.host to the target host) and ran an evacuation. I hit the same problem as seen before. While recreating the instance on the target host, the instance.host is pointing to the old source host and the event that is expected to be received by the target host's compute manager is sent to the source host (which is down.)

Changed in networking-powervm:
status: Fix Released → In Progress
Drew Thorstensen (thorst) wrote :

I looked at Kyle's box. The port is going back to a build state for some reason. Need to figure out why...

Reviewed: https://review.openstack.org/281469
Committed: https://git.openstack.org/cgit/openstack/networking-powervm/commit/?id=9f29aa1ef982a1dd421f55bbb5784c5c36b257e0
Submitter: Jenkins
Branch: master

commit 9f29aa1ef982a1dd421f55bbb5784c5c36b257e0
Author: Drew Thorstensen <email address hidden>
Date: Wed Feb 17 14:15:51 2016 -0500

    Fix the heal code to invoke with the rpc_device

    This resolves a bug in the heal code to correctly pass in the right
    parameter to the _get_nb_and_vlan method.

    Change-Id: Ibda1d3581b56a7a4a1fd163b406d28d32f9dd82c
    Closes-Bug: 1535918

Changed in networking-powervm:
status: In Progress → Fix Released
Taylor Peoples (tpeoples) wrote :
Download full text (11.6 KiB)

I am able to reproduce this same issue on a multinode devstack running libvirt.

On the source host, the last call to nova/network/base_api.py::update_instance_cache_with_nw_info for a specific instance before the source host crashes has the nw_info passed in as a VIF object with the "active" attribute set to False. This is because the VM has just been deployed and the network was just created. In other words, the last time the instance's InstanceInfoCache's network_info attribute was updated before the source host went down, the VIF was considered not active. In some environments, especially when doing concurrent deploys, it may take a while for the InstanceInfoCache to update the network_info to show as active.

What this boils down to is that Nova's InstanceInfoCache can potentially have a stale network_info active state. This causes the rebuild flow (which is the same as the spawn flow) to potentially end up waiting for the network-vif-plugged event, which will never come because it was sent to the source host instead of the destination. This results in the rebuild to fail because the VIF plugging times out.

Steps:

1) Deploy VM(s) to host A
2) Take host A down (e.g., kill it's nova api and nova compute processes) once VM(s) from (1) are finished deploying
3) Try to evacuate VM(s) from host A to host B
4) Evacuation will potentially time out based on explanation above. It is much easier to reproduce if you do step (2) as soon as possible after the VM(s) finish deploying

stack@controller:~$ glance image-list
+--------------------------------------+---------------------------------+
| ID | Name |
+--------------------------------------+---------------------------------+
| f91197db-16b5-44b2-beb4-72a9e57041c2 | cirros-0.3.4-x86_64-uec |
| 1348de9b-501d-426c-8cb5-e65381208085 | cirros-0.3.4-x86_64-uec-kernel |
| 790ebadb-bc5b-48be-b1f0-95a9214a11ae | cirros-0.3.4-x86_64-uec-ramdisk |
+--------------------------------------+---------------------------------+
stack@controller:~$
stack@controller:~$ neutron net-list
+--------------------------------------+---------+----------------------------------------------------------+
| id | name | subnets |
+--------------------------------------+---------+----------------------------------------------------------+
| 4ba74a3e-e7a8-4ca4-9de5-8a1d9e1042b8 | public | c9210289-4895-481b-946a-b406ba5889b4 2001:db8::/64 |
| | | 9a044095-ab4d-4767-817e-02d81cbe90ef 172.24.4.0/24 |
| d7faf346-1a26-41a0-bb62-b08808f6ba13 | private | f45ab890-a0d6-48c1-906e-9c8f81659d65 fdfd:f0f5:a83a::/64 |
| | | 0e85f797-0270-49e9-9600-6f21b9cf47d0 10.254.1.0/24 |
+--------------------------------------+---------+----------------------------------------------------------+
stack@controller:~$
stack@controller:~$ nova boot tdp-test-vm --flavor 1 --availability-zone nova:hostA --block-device id=f91197db-16b5-44b2-beb4-72a9e57041c2,source=image,dest=volume,size=1,bootind...

affects: networking-powervm → nova
Changed in nova:
assignee: Drew Thorstensen (thorst) → nobody
affects: nova → nova-powervm
Changed in nova-powervm:
assignee: nobody → Drew Thorstensen (thorst)
Changed in nova:
assignee: nobody → Taylor Peoples (tpeoples)

Change abandoned by Drew Thorstensen (<email address hidden>) on branch: master
Review: https://review.openstack.org/315874
Reason: Superseded by https://review.openstack.org/#/c/316417/

Changed in nova:
assignee: Taylor Peoples (tpeoples) → nobody
Sridhar Venkat (svenkat) on 2016-06-18
Changed in nova:
assignee: nobody → Sridhar Venkat (svenkat)
Sridhar Venkat (svenkat) wrote :

Problem is reproducible when more than one evacuate is attempted simultaneously (4 in my devstack environment). If evacuate is attempted one at a time, this problem is not exhibited.

Sridhar Venkat (svenkat) wrote :

My previous statement needs correction. The problem is reproducible event with one VM. To reproduce, deploy a VM on source Host and shutdown source host before corresponding VIF is activated. Examine nova compute log and searching for "vif_type=" should reveal active state of VIF. If it is 'false', evacuation of such a VM results in the error reported in this bug.

If you can wait till VIF is activated and then shutdown source host, such a VM can be successfully evacuated.

Changed in nova:
status: New → In Progress
Changed in nova:
assignee: Sridhar Venkat (svenkat) → Artom Lifshitz (notartom)
Artom Lifshitz (notartom) wrote :

Since the bot doesn't seem to have picked it up:
Fix proposed to nova (master):
https://review.openstack.org/#/c/371048/

Fix proposed to branch: master
Review: https://review.openstack.org/385086

Reviewed: https://review.openstack.org/371048
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a5b920a197c70d2ae08a1e1335d979857f923b4f
Submitter: Jenkins
Branch: master

commit a5b920a197c70d2ae08a1e1335d979857f923b4f
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 5 14:37:03 2016 -0400

    Send events to all relevant hosts if migrating

    Previously, external events were sent to the instance object's host
    field. This patch fixes the external event dispatching to check for
    migration. If an instance is being migrated, the source and
    destination compute are added to the set of hosts to which the event
    is sent.

    Change-Id: If00736ab36df4a5a3be4f02b0a550e4bcae77b1b
    Closes-bug: 1535918
    Closes-bug: 1624052

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/392219
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5de902a3163c9c079fab22754388bd4e02981298
Submitter: Jenkins
Branch: stable/newton

commit 5de902a3163c9c079fab22754388bd4e02981298
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 5 14:37:03 2016 -0400

    Send events to all relevant hosts if migrating

    Previously, external events were sent to the instance object's host
    field. This patch fixes the external event dispatching to check for
    migration. If an instance is being migrated, the source and
    destination compute are added to the set of hosts to which the event
    is sent.

    Change-Id: If00736ab36df4a5a3be4f02b0a550e4bcae77b1b
    Closes-bug: 1535918
    Closes-bug: 1624052
    (cherry picked from commit a5b920a197c70d2ae08a1e1335d979857f923b4f)

tags: added: in-stable-newton

This issue was fixed in the openstack/nova 15.0.0.0b1 development milestone.

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/331707
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

This issue was fixed in the openstack/nova 14.0.3 release.

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: master
Review: https://review.openstack.org/385086
Reason: Nothing is technically broken anymore, since the patch that actually fixes the bug has merged. The race is still present I believe, but it doesn't actually affect anything now that event dispatching is fixed.

i am hitting this issue.

1) After nova evacuate on two compute setup, instance's host parameter is updated to it's new host.

2) while adding storage to instance, its failing. As rpc call to compute(old host) is getting timeout exception.

minor correction to comment #26
1) After nova evacuate on two compute setup, instance's host parameter is *not* updated to it's new host.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/461678
Reason: mitaka is basically end of life

Seyeong Kim (xtrusia) on 2017-08-18
description: updated
Seyeong Kim (xtrusia) on 2017-08-18
description: updated
Seyeong Kim (xtrusia) wrote :
description: updated
tags: added: sts-sru-needed
Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) on 2017-08-21
description: updated
Changed in cloud-archive:
status: New → Fix Released
Seyeong Kim (xtrusia) on 2017-08-21
description: updated
Seyeong Kim (xtrusia) on 2017-08-21
description: updated
Eric Desrochers (slashd) on 2017-08-21
Changed in nova (Ubuntu Xenial):
assignee: nobody → Seyeong Kim (xtrusia)
Eric Desrochers (slashd) on 2017-08-21
Changed in nova (Ubuntu Artful):
status: New → Fix Released
Changed in nova (Ubuntu Zesty):
status: New → Fix Released
Changed in nova (Ubuntu Xenial):
status: New → In Progress
Seyeong Kim (xtrusia) on 2017-08-22
description: updated
Eric Desrochers (slashd) wrote :

Uploaded in Xenial upload queue.

Hello Kyle, or anyone else affected,

Accepted nova into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:13.1.4-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Seyeong Kim (xtrusia) wrote :

Hello

I tested -proposed is working fine.

rc nova-api 2:13.1.4-0ubuntu3 all OpenStack Compute - API frontend
ii nova-api-os-compute 2:13.1.4-0ubuntu3 all OpenStack Compute - OpenStack Compute API frontend
ii nova-cert 2:13.1.4-0ubuntu3 all OpenStack Compute - certificate management
ii nova-common 2:13.1.4-0ubuntu3 all OpenStack Compute - common files
ii nova-conductor 2:13.1.4-0ubuntu3 all OpenStack Compute - conductor service
ii nova-consoleauth 2:13.1.4-0ubuntu3 all OpenStack Compute - Console Authenticator
ii nova-novncproxy 2:13.1.4-0ubuntu3 all OpenStack Compute - NoVNC proxy
ii nova-scheduler 2:13.1.4-0ubuntu3 all OpenStack Compute - virtual machine scheduler
ii python-nova 2:13.1.4-0ubuntu3 all OpenStack Compute Python libraries

tags: added: verification-done-xenial
removed: verification-needed-xenial
Seyeong Kim (xtrusia) wrote :

I deployed openstack env with my script on description [test case]. and got error as reproduction.

upgraded nova-cloud-controller, nova-compute

then evacuate those error state vm again, got ACTIVE.

James Page (james-page) wrote :

Hello Kyle, or anyone else affected,

Accepted nova into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:13.1.4-0ubuntu3

---------------
nova (2:13.1.4-0ubuntu3) xenial; urgency=medium

  * Fix evacuation error when nova-compute is down just
    after VM is started.

    - d/p/make-sure-to-rebuild-claim-on-recreate.patch
      (backported from newton 0f2d874, upstream a2b0824)

    - d/p/Send-events-to-all-relevant-hosts-if-migrating.patch (LP: #1535918)
      (backported from a5b920)

 -- Seyeong Kim <email address hidden> Fri, 04 Aug 2017 04:46:40 +0900

Changed in nova (Ubuntu Xenial):
status: Fix Committed → Fix Released
Seyeong Kim (xtrusia) wrote :

had same test as xenial, ( this is for mitaka uca )

and verification done

ii nova-api-os-compute 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute - OpenStack Compute API frontend
ii nova-cert 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute - certificate management
ii nova-common 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute - common files
ii nova-conductor 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute - conductor service
ii nova-consoleauth 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute - Console Authenticator
ii nova-novncproxy 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute - NoVNC proxy
ii nova-scheduler 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute - virtual machine scheduler
ii python-nova 2:13.1.4-0ubuntu3~cloud0 all OpenStack Compute Python libraries

tags: added: verification-mitaka-done
removed: verification-mitaka-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers