migration fails if instance build failed on destination host

Bug #1718512 reported by Eric M Gonzalez on 2017-09-20
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Ocata
Medium
Matt Riedemann
Pike
Medium
Matt Riedemann
Queens
Medium
Matt Riedemann

Bug Description

(OpenStack Nova, commit d8b30c3772, per OSA-14.2.7)

if an instance build fails on a hypervisor the "retry" field of the instance's request spec is populated with which host and how many times it attempted to retry the build. this field remains populated during the life-time of the instance.

if a live-migration for the same instance is requested, the conductor loads this request spec and passes it on to the scheduler. the scheduler will fail the migration request on RetryFilter since the target was already known to have failed (albeit, for the build).

with the help of mriedem and melwitt of #openstack-nova, we determined that migration retries are handled separately from build retries. mriedem suggested a patch to ignore the retry field of the instance request spec during migrations. this patch allowed the failing migration to succeed.

it is important to note that it may fail the migration again, however there is still sufficient reason to ignore the build's failures/retries during a migration.

12:55 < mriedem> it does stand to reason that if this instance failed to build originally on those 2 hosts, that live migrating it there might fail too...but we don't know why it originally failed, could have been a resource claim issue at the time
12:58 < melwitt> yeah, often it's a failed claim. and also what if that compute host is eventually replaced over the lifetime of the cluster, making it a fresh candidate for several instances that might still avoid it because they once failed to build there back when it was a different machine

Fix proposed to branch: master
Review: https://review.openstack.org/505771

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Matt Riedemann (mriedem) on 2017-09-20
Changed in nova:
importance: Undecided → Medium
tags: added: live-migration
Matt Riedemann (mriedem) wrote :

Note this will also be a problem for cold migrate, evacuate and unshelve.

Matt Riedemann (mriedem) on 2018-04-05
no longer affects: nova/newton
Matt Riedemann (mriedem) wrote :

I believe this regression goes back to Mitaka: https://review.openstack.org/#/c/202675/

Matt Riedemann (mriedem) wrote :

I wrote a functional regression test for this with a cold migrate scenario and I'm unable to recreate the failure now. This is the scenario I have:

- I've got 3 weighted hosts: host1 > host2 > host3
- Create the instance, first attempt against host1 fails, reschedule to host2.
- Cold migrate the instance; because of the weights, host1 will be picked. Fail that and reschedule to host3.
- The instance should end up on host3 after the cold migrate, but host1 should have been called for the cold migration (which this bug says it won't be).

What I'm seeing is that host1 is called in the cold migrate scenario even though it failed the original build scenario. Looking back on the code, I'm not sure where the RequestSpec is actually getting persisted with the retry information.

Matt Riedemann (mriedem) wrote :

Looking back at mitaka, the api creates the request spec, but doesn't send it to conductor.build_instances, that just builds it's own fake request spec to pass to the scheduler; conductor passes that fake reqspec down to compute, where it's ignored. There are no request_spec.save() calls in mitaka, but there is one in newton for resize:

https://github.com/openstack/nova/blob/newton-eol/nova/conductor/manager.py#L287

So I wonder if this is what happened: built the server on host1, resized - failed on host2, rescheduled to host3, then tried to live migrate to host2 and that was kicked out because the request_spec.save() during the resize reschedule was persisted.

I can change the functional test to try that scenario.

Matt Riedemann (mriedem) wrote :

If that request_spec.save() is the cause of the issue, that was introduced in Newton:

https://review.openstack.org/#/c/347387/

Matt Riedemann (mriedem) wrote :

I got a recreate based on comment 7, so https://review.openstack.org/#/c/505771/ has been updated to be the functional regression recreate test, and a follow up patch with the fix will go on top.

Fix proposed to branch: master
Review: https://review.openstack.org/559447

Reviewed: https://review.openstack.org/505771
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=89448bea577b30c40ce39185d14fe14f9c61a0c2
Submitter: Zuul
Branch: master

commit 89448bea577b30c40ce39185d14fe14f9c61a0c2
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 14:24:44 2017 -0400

    Add regression test for persisted RequestSpec.retry from failed resize

    Commit 74ab427d4796d8a386f84a15cc49188c2a60f8f1 in Newton added
    code to persist changes to the RequestSpec during a resize since
    the flavor changes.

    That change inadvertantly also persisted any failed hosts during
    the resize that are stored in the RequestSpec.retry field during
    a reschedule.

    The problem is that later those persisted failed hosts are rejected
    by the RetryFilter, which can be confusing if an admin is trying
    to live migrate or evacate the instance to one of those specific
    hosts.

    This adds a functional regression test to show the failure, which
    will be fixed in a separate change that then modifies the assertions.

    Change-Id: Ib8a23db838b0bbf2cfb8123cf6aaa39d00ff0640
    Related-Bug: #1718512

Reviewed: https://review.openstack.org/559447
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6647f11dc1aba89f9b0e2703f236a43f31d88079
Submitter: Zuul
Branch: master

commit 6647f11dc1aba89f9b0e2703f236a43f31d88079
Author: Matt Riedemann <email address hidden>
Date: Fri Apr 6 20:28:53 2018 -0400

    Don't persist RequestSpec.retry

    During a resize, the RequestSpec.flavor is updated
    to the new flavor. If the resize failed on one host
    and was rescheduled, the RequestSpec.retry is updated
    for that failed host and mistakenly persisted, which
    can affect later move operations, like if an admin
    targets one of those previously failed hosts for a
    live migration or evacuate operation.

    This change fixes the problem by not ever persisting
    the RequestSpec.retry field to the database, since
    retries are per-request/operation and not something
    that needs to be persisted.

    Alternative to this, we could reset the retry field
    in the RequestSpec.reset_forced_destinations method
    but that would be slightly overloading the meaning
    of that method, and the approach taken in this patch
    is arguably cleaner since retries shouldn't ever be
    persisted. It should be noted, however, that one
    advantage to resetting the 'retry' field in the
    RequestSpec.reset_forced_destinations method would
    be to avoid this issue for any existing DB entries
    that have this problem.

    The related functional regression test is updated
    to show the bug is now fixed.

    Change-Id: Iadbf8ec935565a6d4ccf6f36ef630ab6bf1bea5d
    Closes-Bug: #1718512

Changed in nova:
status: In Progress → Fix Released

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/560955

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/560955
Reason: rebase squash failure

Reviewed: https://review.openstack.org/560142
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c2dc902e39eb345ebf674ad47422f1e72ec170e6
Submitter: Zuul
Branch: stable/queens

commit c2dc902e39eb345ebf674ad47422f1e72ec170e6
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 14:24:44 2017 -0400

    Add regression test for persisted RequestSpec.retry from failed resize

    Commit 74ab427d4796d8a386f84a15cc49188c2a60f8f1 in Newton added
    code to persist changes to the RequestSpec during a resize since
    the flavor changes.

    That change inadvertantly also persisted any failed hosts during
    the resize that are stored in the RequestSpec.retry field during
    a reschedule.

    The problem is that later those persisted failed hosts are rejected
    by the RetryFilter, which can be confusing if an admin is trying
    to live migrate or evacate the instance to one of those specific
    hosts.

    This adds a functional regression test to show the failure, which
    will be fixed in a separate change that then modifies the assertions.

    Change-Id: Ib8a23db838b0bbf2cfb8123cf6aaa39d00ff0640
    Related-Bug: #1718512
    (cherry picked from commit 89448bea577b30c40ce39185d14fe14f9c61a0c2)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/560143
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=757dbd17cf37aecea005dfdc954bf50bbddedd95
Submitter: Zuul
Branch: stable/queens

commit 757dbd17cf37aecea005dfdc954bf50bbddedd95
Author: Matt Riedemann <email address hidden>
Date: Fri Apr 6 20:28:53 2018 -0400

    Don't persist RequestSpec.retry

    During a resize, the RequestSpec.flavor is updated
    to the new flavor. If the resize failed on one host
    and was rescheduled, the RequestSpec.retry is updated
    for that failed host and mistakenly persisted, which
    can affect later move operations, like if an admin
    targets one of those previously failed hosts for a
    live migration or evacuate operation.

    This change fixes the problem by not ever persisting
    the RequestSpec.retry field to the database, since
    retries are per-request/operation and not something
    that needs to be persisted.

    Alternative to this, we could reset the retry field
    in the RequestSpec.reset_forced_destinations method
    but that would be slightly overloading the meaning
    of that method, and the approach taken in this patch
    is arguably cleaner since retries shouldn't ever be
    persisted. It should be noted, however, that one
    advantage to resetting the 'retry' field in the
    RequestSpec.reset_forced_destinations method would
    be to avoid this issue for any existing DB entries
    that have this problem.

    The related functional regression test is updated
    to show the bug is now fixed.

    NOTE(mriedem): This backport also includes change
    I61f745667f4c003d7e3ca6f2f9a99194930ac892 squashed
    into it in order to not re-introduce that bug.

    Change-Id: Iadbf8ec935565a6d4ccf6f36ef630ab6bf1bea5d
    Closes-Bug: #1718512
    (cherry picked from commit 6647f11dc1aba89f9b0e2703f236a43f31d88079)

Reviewed: https://review.openstack.org/560145
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=004e9acf99964ac78f85d3efbd0a04404bd9a3ef
Submitter: Zuul
Branch: stable/pike

commit 004e9acf99964ac78f85d3efbd0a04404bd9a3ef
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 14:24:44 2017 -0400

    Add regression test for persisted RequestSpec.retry from failed resize

    Commit 74ab427d4796d8a386f84a15cc49188c2a60f8f1 in Newton added
    code to persist changes to the RequestSpec during a resize since
    the flavor changes.

    That change inadvertantly also persisted any failed hosts during
    the resize that are stored in the RequestSpec.retry field during
    a reschedule.

    The problem is that later those persisted failed hosts are rejected
    by the RetryFilter, which can be confusing if an admin is trying
    to live migrate or evacate the instance to one of those specific
    hosts.

    This adds a functional regression test to show the failure, which
    will be fixed in a separate change that then modifies the assertions.

    NOTE(mriedem): The confirmResize API post call in this version
    needed the check_response_status=[204] kwarg because commit
    8ec0b4390401ce62cab0ea9b3786dc487e26c9f7 isn't in Pike.

    Change-Id: Ib8a23db838b0bbf2cfb8123cf6aaa39d00ff0640
    Related-Bug: #1718512
    (cherry picked from commit 89448bea577b30c40ce39185d14fe14f9c61a0c2)
    (cherry picked from commit c2dc902e39eb345ebf674ad47422f1e72ec170e6)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/560146
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=878e99d1f82fbeec840b2dbab8e40c27127d88ba
Submitter: Zuul
Branch: stable/pike

commit 878e99d1f82fbeec840b2dbab8e40c27127d88ba
Author: Matt Riedemann <email address hidden>
Date: Fri Apr 6 20:28:53 2018 -0400

    Don't persist RequestSpec.retry

    During a resize, the RequestSpec.flavor is updated
    to the new flavor. If the resize failed on one host
    and was rescheduled, the RequestSpec.retry is updated
    for that failed host and mistakenly persisted, which
    can affect later move operations, like if an admin
    targets one of those previously failed hosts for a
    live migration or evacuate operation.

    This change fixes the problem by not ever persisting
    the RequestSpec.retry field to the database, since
    retries are per-request/operation and not something
    that needs to be persisted.

    Alternative to this, we could reset the retry field
    in the RequestSpec.reset_forced_destinations method
    but that would be slightly overloading the meaning
    of that method, and the approach taken in this patch
    is arguably cleaner since retries shouldn't ever be
    persisted. It should be noted, however, that one
    advantage to resetting the 'retry' field in the
    RequestSpec.reset_forced_destinations method would
    be to avoid this issue for any existing DB entries
    that have this problem.

    The related functional regression test is updated
    to show the bug is now fixed.

    NOTE(mriedem): This backport also includes change
    I61f745667f4c003d7e3ca6f2f9a99194930ac892 squashed
    into it in order to not re-introduce that bug. On
    Pike it must be adjusted slightly to pass a string
    rather than list to _wait_for_migration_status
    since I752617066bb2167b49239ab9d17b0c89754a3e12 is
    not in Pike.

    Change-Id: Iadbf8ec935565a6d4ccf6f36ef630ab6bf1bea5d
    Closes-Bug: #1718512
    (cherry picked from commit 6647f11dc1aba89f9b0e2703f236a43f31d88079)
    (cherry picked from commit 757dbd17cf37aecea005dfdc954bf50bbddedd95)

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Reviewed: https://review.openstack.org/560162
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0d111f17c4e411a42beb6c80bc27e6455a71c48c
Submitter: Zuul
Branch: stable/ocata

commit 0d111f17c4e411a42beb6c80bc27e6455a71c48c
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 14:24:44 2017 -0400

    Add regression test for persisted RequestSpec.retry from failed resize

    Commit 74ab427d4796d8a386f84a15cc49188c2a60f8f1 in Newton added
    code to persist changes to the RequestSpec during a resize since
    the flavor changes.

    That change inadvertantly also persisted any failed hosts during
    the resize that are stored in the RequestSpec.retry field during
    a reschedule.

    The problem is that later those persisted failed hosts are rejected
    by the RetryFilter, which can be confusing if an admin is trying
    to live migrate or evacate the instance to one of those specific
    hosts.

    This adds a functional regression test to show the failure, which
    will be fixed in a separate change that then modifies the assertions.

    NOTE(mriedem): There are two changes in this backport:

    1. The functional test needed to change slightly to disable the
       DiskFilter since 2fe96819c24eff5a9493a6559f3e8d5b4624a8c9 is
       not in Ocata.

    2. The test needs to use 'api_post' directly on the API client for
       the confirmResize call since the check_response_status kwarg
       wasn't in post_server_action until 8dd11ca1b34e1ed58b4 in Pike.

    Change-Id: Ib8a23db838b0bbf2cfb8123cf6aaa39d00ff0640
    Related-Bug: #1718512
    (cherry picked from commit 89448bea577b30c40ce39185d14fe14f9c61a0c2)
    (cherry picked from commit c2dc902e39eb345ebf674ad47422f1e72ec170e6)
    (cherry picked from commit 004e9acf99964ac78f85d3efbd0a04404bd9a3ef)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/560167
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c61ec9d98b7ca22fb6b68a80a9b382251a537a83
Submitter: Zuul
Branch: stable/ocata

commit c61ec9d98b7ca22fb6b68a80a9b382251a537a83
Author: Matt Riedemann <email address hidden>
Date: Fri Apr 6 20:28:53 2018 -0400

    Don't persist RequestSpec.retry

    During a resize, the RequestSpec.flavor is updated
    to the new flavor. If the resize failed on one host
    and was rescheduled, the RequestSpec.retry is updated
    for that failed host and mistakenly persisted, which
    can affect later move operations, like if an admin
    targets one of those previously failed hosts for a
    live migration or evacuate operation.

    This change fixes the problem by not ever persisting
    the RequestSpec.retry field to the database, since
    retries are per-request/operation and not something
    that needs to be persisted.

    Alternative to this, we could reset the retry field
    in the RequestSpec.reset_forced_destinations method
    but that would be slightly overloading the meaning
    of that method, and the approach taken in this patch
    is arguably cleaner since retries shouldn't ever be
    persisted. It should be noted, however, that one
    advantage to resetting the 'retry' field in the
    RequestSpec.reset_forced_destinations method would
    be to avoid this issue for any existing DB entries
    that have this problem.

    The related functional regression test is updated
    to show the bug is now fixed.

    NOTE(mriedem): This backport also includes change
    I61f745667f4c003d7e3ca6f2f9a99194930ac892 squashed
    into it in order to not re-introduce that bug. On
    Ocata it must be adjusted slightly to pass a string
    rather than list to _wait_for_migration_status
    since I752617066bb2167b49239ab9d17b0c89754a3e12 is
    not in Ocata.

    NOTE(mriedem): This patch has to pull some changes
    from two other patches to make live migration work
    in the fake virt driver: ce893e37f and b97c433f7.

    Change-Id: Iadbf8ec935565a6d4ccf6f36ef630ab6bf1bea5d
    Closes-Bug: #1718512
    (cherry picked from commit 6647f11dc1aba89f9b0e2703f236a43f31d88079)
    (cherry picked from commit 757dbd17cf37aecea005dfdc954bf50bbddedd95)
    (cherry picked from commit 878e99d1f82fbeec840b2dbab8e40c27127d88ba)

This issue was fixed in the openstack/nova 16.1.2 release.

This issue was fixed in the openstack/nova 17.0.3 release.

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers