Inconsistencies with resource tracking in the case of resize operation.

Bug #1498126 reported by Nikola Đipanov
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

All of these are being reported upon code inspection - I have yet to confirm all of these as they are in fact edge cases and subtle race conditions:

* We update the instance.host field to the value of the destination_node in resize_migration which runs on the source host. (https://github.com/openstack/nova/blob/1df8248b6ad7982174c417abf80070107eac8909/nova/compute/manager.py#L3750) This means that in between that DB write, and changing the flavor and applying the migration context (which happens in finish_resize ran on destination host) all resource tracking runs on the destination host will be wrong (they will use the instance record and thus use the wrong .

* There is very similar racy-ness in the revert_resize path as described in the following comment (https://github.com/openstack/nova/blob/1df8248b6ad7982174c417abf80070107eac8909/nova/compute/manager.py#L3448) - we should fix that too.

* drop_move_claim method makes sense only when called on the source node, so it's name should be reflected to change that. It's really an optimization where we free the resources sooner than the next RT pass which will not see the migration as in progress. This should be documented better

* drop_move_claim looks up the new_flavor to compare it with the flavor that was used to track the migration, but on the source node it's certain to be the old_flavor. Thus as it stands now drop_move_claim (only ran on source nodes) doesn't do anything. Not a big deal, but we should probably fix it.

tags: added: compute resize resource-tracker
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Chris Friesen (cbf123) wrote :

Just thought I'd mention that I just finished investigating an issue that turned out to be the first item above, so it's a practical problem rather than theoretical.

We had a race (in kilo, but with very similar code to what is in liberty) between instances being migrated that are in the RESIZE_MIGRATED state (so the host/node have been updated but the numa_topology is stale) and the resource audit running on the destination.

The audit sees the instance and processes it in _update_usage_from_instances() but using the stale instance.numa_topology, thus possibly accounting for the wrong host CPUs.

We've just submitted a local workaround that modifies _update_usage_from_instances() to ignore instances with a task_state of RESIZE_MIGRATED. (So that they get handled by _update_usage_from_migrations(). So far it seems to help.

Chris Martin (cm876n)
Changed in nova:
assignee: nobody → Chris Martin (cm876n)
Revision history for this message
Anusha Unnam (anusha-unnam) wrote :

As there is no patch submitted by the assignee for long time, removing the assignee.

Changed in nova:
assignee: Chris Martin (cm876n) → nobody
Revision history for this message
Matt Riedemann (mriedem) wrote :

I think the fourth item in the description of the bug no longer applies, see bug 1818914.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/641521

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/641521
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=54877e06f13b13134f2030557cd779f947a43c24
Submitter: Zuul
Branch: master

commit 54877e06f13b13134f2030557cd779f947a43c24
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 6 18:46:22 2019 -0500

    Add functional recreate test for bug 1818914

    The confirm resize flow in the compute manager
    runs on the source host. It calls RT.drop_move_claim
    to drop resource usage from the source host for the
    old flavor. The problem with drop_move_claim is it
    only decrements the old flavor from the reported usage
    if the instance is in RT.tracked_migrations, which will
    only be there on the source host if the update_available_resource
    periodic task runs before the resize is confirmed, otherwise
    the instance is still just tracked in RT.tracked_instances on
    the source host. This leaves the source compute incorrectly
    reporting resource usage for the old flavor until the next
    periodic runs, which could be a large window if resizes are
    configured to automatically confirm, e.g. resize_confirm_window=1,
    and the periodic interval is big, e.g. update_resources_interval=600.

    This change adds a functional recreate test for the bug which will
    be updated in the change that fixes the bug.

    Change-Id: I4aac187283c2f341b5c2712be85f722156e14f63
    Related-Bug: #1818914
    Related-Bug: #1498126

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/641806
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Submitter: Zuul
Branch: master

commit ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 7 16:07:18 2019 -0500

    Update usage in RT.drop_move_claim during confirm resize

    The confirm resize flow in the compute manager
    runs on the source host. It calls RT.drop_move_claim
    to drop resource usage from the source host for the
    old flavor. The problem with drop_move_claim is it
    only decrements the old flavor from the reported usage
    if the instance is in RT.tracked_migrations, which will
    only be there on the source host if the update_available_resource
    periodic task runs before the resize is confirmed, otherwise
    the instance is still just tracked in RT.tracked_instances on
    the source host. This leaves the source compute incorrectly
    reporting resource usage for the old flavor until the next
    periodic runs, which could be a large window if resizes are
    configured to automatically confirm, e.g. resize_confirm_window=1,
    and the periodic interval is big, e.g. update_resources_interval=600.

    This fixes the issue by also updating usage in drop_move_claim
    when the instance is not in tracked_migrations but is in
    tracked_instances.

    Because of the tight coupling with the instance.migration_context
    we need to ensure the migration_context still exists before
    drop_move_claim is called during confirm_resize, so a test wrinkle
    is added to enforce that.

    test_drop_move_claim_on_revert also needed some updating for
    reality because of how drop_move_claim is called during
    revert_resize.

    And finally, the functional recreate test is updated to show the
    bug is fixed.

    Change-Id: Ia6d8a7909081b0b856bd7e290e234af7e42a2b38
    Closes-Bug: #1818914
    Related-Bug: #1641750
    Related-Bug: #1498126

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/665138

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/665253

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.