Comment 17 for bug 1879878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/747745
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Submitter: Zuul
Branch: master

commit a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 16:54:16 2020 +0100

    Move confirm resize under semaphore

    The 'ResourceTracker.update_available_resource' periodic task builds
    usage information for the current host by inspecting instances and
    in-progress migrations, combining the two. Specifically, it finds all
    instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state,
    calculates the usage from these, then finds all in-progress migrations
    for the host that don't have an associated instance (to prevent double
    accounting) and includes the usage for these.

    In addition to the periodic task, the 'ResourceTracker' class has a
    number of helper functions to make or drop claims for the inventory
    generated by the 'update_available_resource' periodic task as part of
    the various instance operations. These helpers naturally assume that
    when making a claim for a particular instance or migration, there
    shouldn't already be resources allocated for same. Conversely, when
    dropping claims, the resources should currently be allocated. However,
    the check for *active* instances and *in-progress* migrations in the
    periodic task means we have to be careful in how we make changes to a
    given instance or migration record. Running the periodic task between
    such an operation and an attempt to make or drop a claim can result in
    TOCTOU-like races.

    This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE'
    semaphore to prevent the periodic task running while we're claiming
    resources in helpers like 'ResourceTracker.instance_claim' and we make
    our changes to the instances and migrations within this context. There
    is one exception though: the 'drop_move_claim' helper. This function is
    used when dropping a claim for either a cold migration, a resize or a
    live migration, and will drop usage from either the source host (based
    on the "old" flavor) for a resize confirm or the destination host (based
    on the "new" flavor) for a resize revert or live migration rollback.
    Unfortunately, while the function itself is wrapped in the semaphore, no
    changes to the state or the instance or migration in question are
    protected by it.

    Consider the confirm resize case, which we're addressing here. If we
    mark the migration as 'confirmed' before running 'drop_move_claim', then
    the periodic task running between these steps will not account for the
    usage on the source since the migration is allegedly 'confirmed'. The
    call to 'drop_move_claim' will then result in the tracker dropping usage
    that we're no longer accounting for. This "set migration status before
    dropping usage" is the current behaviour for both same-cell and
    cross-cell resize, via the 'ComputeManager.confirm_resize' and
    'ComputeManager.confirm_snapshot_based_resize_at_source' functions,
    respectively. We could reverse those calls and run 'drop_move_claim'
    before marking the migration as 'confirmed', but while our usage will be
    momentarily correct, the periodic task running between these steps will
    re-add the usage we just dropped since the migration isn't yet
    'confirmed'. The correct solution is to close this gap between setting
    the migration status and dropping the move claim to zero. We do this by
    putting both operations behind the 'COMPUTE_RESOURCE_SEMAPHORE', just
    like the claim operations.

    Change-Id: I26b050c402f5721fc490126e9becb643af9279b4
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878