OpenStack Compute (nova)

Instance resize intermittently fails when rescheduling

Bug #1741125 reported by Ed Leafe on 2018-01-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Ed Leafe

Bug Description

When resizing an instance, the migration task calls "replace_allocation_with_migration()', which checks to verify that the instance has allocations against the source compute node [0]. It then replaces the instance's allocation with the migration, using it's UUID as the consumer instead of the instance's. However, if the first attempt at migrating fails and the resize is rescheduled, when "replace_allocation_with_migration()" is called again, the allocations now have the migration UUID as the consumer, and so the check that the instance has allocations on the source compute node will fail, and the migration is put in an ERROR state.

[0] https://github.com/openstack/nova/blob/f95f165b49fbc0efe29450b0e858a3ccadecedea/nova/conductor/tasks/migrate.py#L47-L48

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-03:

I suspect this is also broken for people using the CachingScheduler since the scheduler doesn't create allocations against Placement.

Changed in nova:
status:	New → Confirmed
tags:	added: queens-rc-potential
tags:	added: placement resize

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-03:

When rescheduling during a resize we should know if we're doing a reschedule because the retry list in the filter_properties will have entries in it for hosts that have been tried (and failed). So we can change the logic a bit in the case of a reschedule to only update the allocation for the instance from the failed node to the next alternative node that we'll try next. The migration record consumer will still continue to have it's allocation held against the original source node.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-03:

As for the CachingScheduler case, we'll just have to be graceful about the fact that the instance might not have any allocations against the source node, both in conductor and compute.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-04:

Functional test with the recreate is here:

https://review.openstack.org/#/c/531022/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-04:

I've created a separate bug for the resize + CachingScheduler issue:

https://bugs.launchpad.net/nova/+bug/1741307

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-01-05:

Ed figured out that the failure is due to a race, similar to a race we had to fix in the server create reschedule flow with alternate hosts.

When a resize fails, the compute casts to conductor to reschedule and then reverts the allocation swap done in conductor (so move the instance allocations back to the source node and delete the migration record allocations).

The problem is by the time compute reverts the allocation swap, conductor might already be looking for the instance allocation on the source node, not find it, and fail, which is what Ed was hitting.

summary:

- Instance resize always fails when rescheduling
+ Instance resize intermittently fails when rescheduling

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-05: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/531405

OpenStack Infra (hudson-openstack) on 2018-01-05

Changed in nova:
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2018-01-08

Changed in nova:
assignee:	Ed Leafe (ed-leafe) → Matt Riedemann (mriedem)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-09: Related fix merged to nova (master)

Reviewed: https://review.openstack.org/531405
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ce8428f162679a52846b65ba1574090283922f2
Submitter: Zuul
Branch: master

commit 1ce8428f162679a52846b65ba1574090283922f2
Author: Ed Leafe <email address hidden>
Date: Thu Jan 4 20:43:23 2018 +0000

Add regression test for resize failing during retries

    During testing in the alternate host series, it was found that retries
    when resizing an instance would always fail. This turned out to be true
    even before alternate hosts for resize was introduced. Further
    investigation showed that there was a race in call to retry the resize
    and the revert of the original attempt.

This adds a functional regression test to show the failure. A follow up
patch with the fix will modify the test to show it passing again.

Related-Bug: 1741125

Change-Id: I0760f3e3bc447c39a6443afd7857046a08b8f01d

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-09: Fix merged to nova (master)

Reviewed: https://review.openstack.org/531022
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=db711bc06fcc234488b107dd22dd2a0272939174
Submitter: Zuul
Branch: master

commit db711bc06fcc234488b107dd22dd2a0272939174
Author: Ed Leafe <email address hidden>
Date: Wed Jan 3 21:38:43 2018 +0000

Fix race condition in retrying migrations

This patch fixes that race condition, and updates the regression test to
reflect the correct results.

Blueprint: return-alternate-hosts

Closes-Bug #1741125

Change-Id: I12b4e1d9e50fefd81128df85d75bb34564f9da2c

Changed in nova:
status:	In Progress → Fix Released

Matt Riedemann (mriedem) on 2018-01-24

Changed in nova:
assignee:	Matt Riedemann (mriedem) → Ed Leafe (ed-leafe)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.