Instance resize intermittently fails when rescheduling

Bug #1741125 reported by Ed Leafe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Ed Leafe

Bug Description

When resizing an instance, the migration task calls "replace_allocation_with_migration()', which checks to verify that the instance has allocations against the source compute node [0]. It then replaces the instance's allocation with the migration, using it's UUID as the consumer instead of the instance's. However, if the first attempt at migrating fails and the resize is rescheduled, when "replace_allocation_with_migration()" is called again, the allocations now have the migration UUID as the consumer, and so the check that the instance has allocations on the source compute node will fail, and the migration is put in an ERROR state.

[0] https://github.com/openstack/nova/blob/f95f165b49fbc0efe29450b0e858a3ccadecedea/nova/conductor/tasks/migrate.py#L47-L48

Revision history for this message
Matt Riedemann (mriedem) wrote :

I suspect this is also broken for people using the CachingScheduler since the scheduler doesn't create allocations against Placement.

Changed in nova:
status: New → Confirmed
tags: added: queens-rc-potential
tags: added: placement resize
Revision history for this message
Matt Riedemann (mriedem) wrote :

When rescheduling during a resize we should know if we're doing a reschedule because the retry list in the filter_properties will have entries in it for hosts that have been tried (and failed). So we can change the logic a bit in the case of a reschedule to only update the allocation for the instance from the failed node to the next alternative node that we'll try next. The migration record consumer will still continue to have it's allocation held against the original source node.

Revision history for this message
Matt Riedemann (mriedem) wrote :

As for the CachingScheduler case, we'll just have to be graceful about the fact that the instance might not have any allocations against the source node, both in conductor and compute.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Functional test with the recreate is here:

https://review.openstack.org/#/c/531022/

Revision history for this message
Matt Riedemann (mriedem) wrote :

I've created a separate bug for the resize + CachingScheduler issue:

https://bugs.launchpad.net/nova/+bug/1741307

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ed figured out that the failure is due to a race, similar to a race we had to fix in the server create reschedule flow with alternate hosts.

When a resize fails, the compute casts to conductor to reschedule and then reverts the allocation swap done in conductor (so move the instance allocations back to the source node and delete the migration record allocations).

The problem is by the time compute reverts the allocation swap, conductor might already be looking for the instance allocation on the source node, not find it, and fail, which is what Ed was hitting.

summary: - Instance resize always fails when rescheduling
+ Instance resize intermittently fails when rescheduling
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/531405

Changed in nova:
status: Confirmed → In Progress
Changed in nova:
assignee: Ed Leafe (ed-leafe) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/531405
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ce8428f162679a52846b65ba1574090283922f2
Submitter: Zuul
Branch: master

commit 1ce8428f162679a52846b65ba1574090283922f2
Author: Ed Leafe <email address hidden>
Date: Thu Jan 4 20:43:23 2018 +0000

    Add regression test for resize failing during retries

    During testing in the alternate host series, it was found that retries
    when resizing an instance would always fail. This turned out to be true
    even before alternate hosts for resize was introduced. Further
    investigation showed that there was a race in call to retry the resize
    and the revert of the original attempt.

    This adds a functional regression test to show the failure. A follow up
    patch with the fix will modify the test to show it passing again.

    Related-Bug: 1741125

    Change-Id: I0760f3e3bc447c39a6443afd7857046a08b8f01d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/531022
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=db711bc06fcc234488b107dd22dd2a0272939174
Submitter: Zuul
Branch: master

commit db711bc06fcc234488b107dd22dd2a0272939174
Author: Ed Leafe <email address hidden>
Date: Wed Jan 3 21:38:43 2018 +0000

    Fix race condition in retrying migrations

    During testing in the alternate host series, it was found that retries
    when resizing an instance would always fail. This turned out to be true
    even before alternate hosts for resize was introduced. Further
    investigation showed that there was a race in call to retry the resize
    and the revert of the original attempt.

    This patch fixes that race condition, and updates the regression test to
    reflect the correct results.

    Blueprint: return-alternate-hosts

    Closes-Bug #1741125

    Change-Id: I12b4e1d9e50fefd81128df85d75bb34564f9da2c

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Ed Leafe (ed-leafe)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.