Evacuation fails if the source host returns while the migration is still in progress

Bug #1764883 reported by Lee Yarwood
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Lee Yarwood
Pike
Fix Committed
Medium
Lee Yarwood
Queens
Fix Committed
Medium
Lee Yarwood

Bug Description

Description
===========

If the migration is in a 'pre-migrating' state this can result in the source compute manager not removing the evacuating instances in question during _destroy_evacuated_instances.

More importantly the source host returning online early allows _init_instance to set instance.status to ERROR and instance.task_state to None thanks to the following failed rebuild logic :

https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L810-L821

As a result the in-progress rebuild will fail when it attempts to save the instance while expecting a certain task_state :

https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3050-L3052

https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3123

This issue was originally reported downstream while testing an instance high-availability feature that uses a mixture of Pacemaker and instance evacuation to keep instances online :

Nova reports overcloud instance in error state after failed double compute failover instance-ha evacuation
https://bugzilla.redhat.com/show_bug.cgi?id=1567606

This report includes an example UnexpectedTaskStateError failure in c#8 :

2018-04-17 11:11:12.999 1 ERROR nova.compute.manager [req-ac20c023-9abf-412f-987f-2981c7837c57 da4d95c480c343c5bf6abe3b789f4c17 d2c2437b7f6642b4a1d5907fa5f373a9 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Setting instance vm_state to ERROR:
UnexpectedTaskStateError_Remote: Conflict updating instance d9419b05-025e-4193-b3f7-7f0efc23593b. Expected: {'task_state': [u'rebuild_spawning']}. Actual: {'task_state': None}

The rally based tests for this feature just happen to use the `b` sysrq-trigger that immediately reboots the host allowing them to recover just in time to hit this.

Steps to reproduce
==================
- Evacuate an instance
- Restart the source compute service before the instance is fully rebuilt

Expected result
===============
The source compute removes the instance and does not attempt to update the instance or task state.

Actual result
=============
The source compute doesn't attempt to remove the instance and attempts to update the instance and task state before the rebuild is complete.

Environment
===========
1. Exact version of OpenStack you are running. See the following

   88adde8bba393b8d08ce21e9e3334a76e853b2e0

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   Local, yet to test with shared storage.

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

  N/A

Logs & Configs
==============

See https://bugzilla.redhat.com/show_bug.cgi?id=1567606#c8

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/562072

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/562284

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → In Progress
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/562072
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b9213398c785a0b54920754d50190e478293fbe8
Submitter: Zuul
Branch: master

commit b9213398c785a0b54920754d50190e478293fbe8
Author: Lee Yarwood <email address hidden>
Date: Tue Apr 17 22:46:04 2018 +0100

    Add regression test for bug #1764883

    Related-Bug: #1764883
    Change-Id: If71727cde51c29231dbb9a51c5babbcdfc802bdd

Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Lee Yarwood (lyarwood)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/562284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c4988cdabf311d29cf64af732091068cfabeedaa
Submitter: Zuul
Branch: master

commit c4988cdabf311d29cf64af732091068cfabeedaa
Author: Lee Yarwood <email address hidden>
Date: Wed Apr 18 14:35:07 2018 +0100

    compute: Ensure pre-migrating instances are destroyed during init_host

    Previously _destroy_evacuated_instances would not remove instances
    associated with evacuation migration records in a pre-migrating state.
    This could lead to a race between the original source host and the new
    destination if the source returned early, calling init_instance during
    the evacuation process.

    This change now includes pre-migrating migration records when looking
    for active evacuations. Additionally the dict of evacuating instances is
    then returned to init_host and used to skip running init_instance
    against such instances ensuring no race occurs between the two computes.

    Closes-bug: #1764883
    Change-Id: I379678dfdb2609f12a572d4f99c8e9da4deab803

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b3

This issue was fixed in the openstack/nova 18.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/621199

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/621200

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/621204

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/621205

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/621199
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d9084d03e9d21fe249349c84549ede302ed1f652
Submitter: Zuul
Branch: stable/queens

commit d9084d03e9d21fe249349c84549ede302ed1f652
Author: Lee Yarwood <email address hidden>
Date: Tue Apr 17 22:46:04 2018 +0100

    Add regression test for bug #1764883

    Related-Bug: #1764883
    Change-Id: If71727cde51c29231dbb9a51c5babbcdfc802bdd
    (cherry picked from commit b9213398c785a0b54920754d50190e478293fbe8)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/621200
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8bba879141d644b248d246cb2ed196068e20785c
Submitter: Zuul
Branch: stable/queens

commit 8bba879141d644b248d246cb2ed196068e20785c
Author: Lee Yarwood <email address hidden>
Date: Wed Apr 18 14:35:07 2018 +0100

    compute: Ensure pre-migrating instances are destroyed during init_host

    Previously _destroy_evacuated_instances would not remove instances
    associated with evacuation migration records in a pre-migrating state.
    This could lead to a race between the original source host and the new
    destination if the source returned early, calling init_instance during
    the evacuation process.

    This change now includes pre-migrating migration records when looking
    for active evacuations. Additionally the dict of evacuating instances is
    then returned to init_host and used to skip running init_instance
    against such instances ensuring no race occurs between the two computes.

    Closes-bug: #1764883
    Change-Id: I379678dfdb2609f12a572d4f99c8e9da4deab803
    (cherry picked from commit c4988cdabf311d29cf64af732091068cfabeedaa)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/621204
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6380226863ecfcba2792c8ada3b2b92503b1d8ec
Submitter: Zuul
Branch: stable/pike

commit 6380226863ecfcba2792c8ada3b2b92503b1d8ec
Author: Lee Yarwood <email address hidden>
Date: Tue Apr 17 22:46:04 2018 +0100

    Add regression test for bug #1764883

    Related-Bug: #1764883
    Change-Id: If71727cde51c29231dbb9a51c5babbcdfc802bdd
    (cherry picked from commit b9213398c785a0b54920754d50190e478293fbe8)
    (cherry picked from commit d9084d03e9d21fe249349c84549ede302ed1f652)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/621205
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1abdb4ca22436c5eabadeab46b1237154c9cca92
Submitter: Zuul
Branch: stable/pike

commit 1abdb4ca22436c5eabadeab46b1237154c9cca92
Author: Lee Yarwood <email address hidden>
Date: Wed Apr 18 14:35:07 2018 +0100

    compute: Ensure pre-migrating instances are destroyed during init_host

    Previously _destroy_evacuated_instances would not remove instances
    associated with evacuation migration records in a pre-migrating state.
    This could lead to a race between the original source host and the new
    destination if the source returned early, calling init_instance during
    the evacuation process.

    This change now includes pre-migrating migration records when looking
    for active evacuations. Additionally the dict of evacuating instances is
    then returned to init_host and used to skip running init_instance
    against such instances ensuring no race occurs between the two computes.

    NOTE(lyarwood): Conflict due to edeeaf9102eccb78f1a2555c7e18c3d706f07639
    not being present in stable/pike.

    Conflicts:
            nova/tests/functional/regressions/test_bug_1764883.py

    Closes-bug: #1764883
    Change-Id: I379678dfdb2609f12a572d4f99c8e9da4deab803
    (cherry picked from commit c4988cdabf311d29cf64af732091068cfabeedaa)
    (cherry picked from commit 8bba879141d644b248d246cb2ed196068e20785c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.7

This issue was fixed in the openstack/nova 16.1.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.9

This issue was fixed in the openstack/nova 17.0.9 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.