OpenStack Compute (nova)

_update_usage_from_migrations() can end up processing stale migrations

Bug #1600304 reported by Chris Friesen on 2016-07-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Chris Friesen

Bug Description

I recently found a bug in Mitaka, and it appears to be still present in master.

I was testing a separate patch by doing resizes, and bugs in my code had resulted in a number of incomplete resizes involving compute-1. I then did a resize from compute-0 to compute-0, and saw compute-1's resource usage go up when it ran the resource audit.

This got me curious, so I went digging and discovered a gap in the current resource audit logic. The problem arises if:

1) You have one or more stale migrations which didn't complete
properly that involve the current compute node.

2) The instance from the uncompleted migration is currently doing a
resize/migration that does not involve the current compute node.

When this happens, _update_usage_from_migrations() will be passed in the stale migration, and since the instance is in fact in a resize state, the current compute node will erroneously account for the instance. (Even though the instance isn't doing anything involving the current compute node.)

The fix is to check that the instance migration ID matches the ID of the migration being analyzed. This will work because in the case of the stale migration we will have hit the error case in _pair_instances_to_migrations(), and so the instance will be lazy-loaded from the DB, ensuring that its migration ID is up-to-date.

Tags:

OpenStack Infra (hudson-openstack) on 2016-07-08

Changed in nova:
assignee:	nobody → Chris Friesen (cbf123)
status:	New → In Progress

Revision history for this message

Chris Friesen (cbf123) wrote on 2016-07-08:

Not sure why it didn't link, but https://review.openstack.org/#/c/339715/ is the proposed fix.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-27: Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/339715
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Jay Pipes (jaypipes) on 2016-11-07

Changed in nova:
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-11-08:

How does one go about recreating this w/o buggy local patches? I have a hard time trusting a fix for bugs in local code out of tree.

Revision history for this message

Chris Friesen (cbf123) wrote on 2016-11-09:

I believe you could get into this case as follows:

1) There is a migration in progress and the source or dest compute node dies.
2) The migrating instance goes into an ERROR state, so you evacuate it to another compute node.
3) You recover the compute node that died.
4) You migrate the instance between two other compute nodes.
5) While the migration is in progress, the resource audit runs on the recovered compute node.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-11-09:

I'd think that rather than the resource tracker dealing with failed migrations, we'd clean those up when the failed compute node is recovered, something like what we have here:

https://github.com/openstack/nova/blob/569e463f02b0a631d341b06849217657a22c4569/nova/compute/manager.py#L636

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-16: Fix merged to nova (master)

Reviewed: https://review.openstack.org/339715
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0bf9c91bb7a98d0ba8a0565d555560936262e635
Submitter: Zuul
Branch: master

commit 0bf9c91bb7a98d0ba8a0565d555560936262e635
Author: Chris Friesen <email address hidden>
Date: Fri Oct 28 03:57:37 2016 -0600

Filter out stale migrations in resource audit

When doing the resource audit there is a subtle bug in the current
code. The problem arises if:

1) You have one or more stale migrations which didn't complete
properly that involve the current compute node.

2) The instance from the uncompleted migration is currently doing a
resize/migration that does not involve the current compute node.

    When this happens, _update_usage_from_migrations() will be passed in
    the stale migration, and the instance is in fact in a resize state,
    so the current compute node will erroneously account for the instance.

    The fix is to check that the instance migration ID matches the ID
    of the migration being analyzed. This will work because in the case
    of the stale migration we will have hit the error case in
    _pair_instances_to_migrations(), and so the instance will be
    lazy-loaded from the DB, ensuring that its migration ID is up-to-date.

    If the IDs don't match, we'll set the migration status to "error" (to
    prevent retrieving that migration the next time) and skip updating
    the usage from the stale migration.

Closes-Bug: #1600304
Change-Id: I6f5ad01cb1392db3e2b71e322c5be353de9071a2

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-24: Fix included in openstack/nova 17.0.0.0b1

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.