Inconsistencies with resource tracking in the case of resize operation.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
All of these are being reported upon code inspection - I have yet to confirm all of these as they are in fact edge cases and subtle race conditions:
* We update the instance.host field to the value of the destination_node in resize_migration which runs on the source host. (https:/
* There is very similar racy-ness in the revert_resize path as described in the following comment (https:/
* drop_move_claim method makes sense only when called on the source node, so it's name should be reflected to change that. It's really an optimization where we free the resources sooner than the next RT pass which will not see the migration as in progress. This should be documented better
* drop_move_claim looks up the new_flavor to compare it with the flavor that was used to track the migration, but on the source node it's certain to be the old_flavor. Thus as it stands now drop_move_claim (only ran on source nodes) doesn't do anything. Not a big deal, but we should probably fix it.
tags: | added: compute resize resource-tracker |
Changed in nova: | |
importance: | Undecided → Medium |
status: | New → Confirmed |
Changed in nova: | |
assignee: | nobody → Chris Martin (cm876n) |
Just thought I'd mention that I just finished investigating an issue that turned out to be the first item above, so it's a practical problem rather than theoretical.
We had a race (in kilo, but with very similar code to what is in liberty) between instances being migrated that are in the RESIZE_MIGRATED state (so the host/node have been updated but the numa_topology is stale) and the resource audit running on the destination.
The audit sees the instance and processes it in _update_ usage_from_ instances( ) but using the stale instance. numa_topology, thus possibly accounting for the wrong host CPUs.
We've just submitted a local workaround that modifies _update_ usage_from_ instances( ) to ignore instances with a task_state of RESIZE_MIGRATED. (So that they get handled by _update_ usage_from_ migrations( ). So far it seems to help.