Bug #1879878 “VM become Error after confirming resize with Error...” : Bugs : OpenStack Compute (nova)

kevinzhao (kego) on 2020-05-21

Changed in nova:
assignee:	nobody → kevinzhao (kego)

kevinzhao (kego) on 2020-05-21

tags:

added: resize

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2020-07-06:

#1

This looks like the resize just hit a race condition.
when using cpu pinning cpus are not claimed on the compute host numa toplogy blob
so its perfectly feasible for a compute node to claim to fail even if schduler and placement claims
passed.

marking this as incompleted as we dont know what release of nova this is being reported against.
if its pre train this is invalid as we expect this to race. if its after train then we should see why we are not doing the cliam
as part of the resize verify step. we should be in which case this maybe vaild.

in general however we cannot fully resolve this untill we have numa in placement as there will always be a race between the schduler and compute node resouce tracker but there should not be one between resize_verify and the perodic task even today
unless the vm raced with another vm for the resouces on the host.

Changed in nova:
status:	New → Incomplete

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2020-07-06:

#2

also can you conrim if you are using the vcpu_pin_set or are you using the cpu_dedicated_set to specify the cpus to use for pinning.

if you are using train+ and are using the cpu_dedicated_set placemnt should not allow this edgecase to happen. if you are using the vcpu_pin_set with the fallback mechanium enabled then its possibel
to still hit this race in train.

sean mooney (sean-k-mooney) on 2020-07-06

tags:

added: numa

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2020-07-09:

#3

http://paste.openstack.org/show/795679/
i was able to repoduce this once on master but not reliably yet.
so im moving this to confimed.

we also have a downstream report of this on train
https://bugzilla.redhat.com/show_bug.cgi?id=1850400
sill at that to the affeted versions

i am setting the importance to medium as this seams to be quite hard to trigger as all but 1 out of the 10-12 attempts i made failed so i think this will be hit rarely.

when this happens the vm is left in a running state runing on the target host. stopping the vm and starting it restores it to an active state.

Changed in nova:
importance:	Undecided → Medium
status:	Incomplete → Confirmed

Revision history for this message

Stephen Finucane (stephenfinucane) wrote on 2020-08-05:

#4

Thanks for the great reproduction steps. I am able to reproduce this reliably on master by slowing down the resize confirm step in the virt driver via a simple sleep, with a timeout > the default periodic task interval (currently 60 seconds). We should be able to reproduce in a functional test by manually running periodic tasks between the API request to confirm the resize and the driver cleaning things up.

Changed in nova:
assignee:	kevinzhao (kego) → Stephen Finucane (stephenfinucane)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-05: Related fix proposed to nova (master)

#5

Related fix proposed to branch: master
Review: https://review.opendev.org/744950

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-05: Fix proposed to nova (master)

#6

Fix proposed to branch: master
Review: https://review.opendev.org/744958

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-20: Related fix merged to nova (master)

#7

Reviewed: https://review.opendev.org/744950
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=10f0a42de162c90c701f70c9c28dc31bfada87db
Submitter: Zuul
Branch: master

commit 10f0a42de162c90c701f70c9c28dc31bfada87db
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:11:59 2020 +0100

tests: Add reproducer for bug #1879878

    When one resizes a pinned instance, the instance claims host CPUs for
    pinning purposes on the destination. However, the host CPUs on the
    source are not immediately relinquished. Rather, they are held by the
    migration record, to handle the event that the resize is reverted. It is
    only when one confirms this resize that the old cores are finally
    relinquished.

    It appears there is a potential race between the resource tracker's
    periodic task and the freeing of these resources, resulting in attempts
    to unpin host cores that have already been unpinned. This test
    highlights that bug pending a fix.

    Change-Id: Ie092628ac71eb87c9dfa7220255a2953ada9e04d
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1879878

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-24: Fix proposed to nova (master)

#8

Fix proposed to branch: master
Review: https://review.opendev.org/747745

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-24:

#9

Fix proposed to branch: master
Review: https://review.opendev.org/747746

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-02: Change abandoned on nova (master)

#10

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/741995
Reason: Abandoning this in favor of https://review.opendev.org/#/c/744958/8 which is the same code change but with a better commit message that provides context on a bug the change also addresses.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-02: Fix merged to nova (master)

#11

Reviewed: https://review.opendev.org/744958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Submitter: Zuul
Branch: master

commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914

Reviewed:  https://review.opendev.org/744958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Submitter: Zuul
Branch:    master

commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Wed Aug 5 14:27:06 2020 +0100

Don't unset Instance.old_flavor, new_flavor until necessary
    
    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.
    
    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.
    
    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300
    
    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-03: Related fix proposed to nova (master)

#12

Related fix proposed to branch: master
Review: https://review.opendev.org/749713

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-07:

#13

Related fix proposed to branch: master
Review: https://review.opendev.org/750186

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-07: Fix proposed to nova (master)

#14

Fix proposed to branch: master
Review: https://review.opendev.org/750187

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-08: Related fix merged to nova (master)

#15

Reviewed: https://review.opendev.org/750186
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=27b37ed5c85464fd7d0a9aa43bf1d491496c7915
Submitter: Zuul
Branch: master

commit 27b37ed5c85464fd7d0a9aa43bf1d491496c7915
Author: Stephen Finucane <email address hidden>
Date: Tue Sep 8 09:54:38 2020 +0100

Expand generic reproducer for bug #1879878

This shows that this bug doesn't affect cross-cell resize in the same
way, but does highlight some gaps we need to close.

    Change-Id: Iee2c5521f41a2d061d21d4556cfbf7a898ad96c9
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1879878

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-08: Fix merged to nova (master)

#16

Reviewed: https://review.opendev.org/750187
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b2fbaa87679629dfb0be7a76ee7dc57980e21dcb
Submitter: Zuul
Branch: master

commit b2fbaa87679629dfb0be7a76ee7dc57980e21dcb
Author: Stephen Finucane <email address hidden>
Date: Mon Sep 7 11:27:30 2020 +0100

Set 'old_flavor', 'new_flavor' on source before resize

Cross-cell resize is confusing. We need to set this information ahead of
time.

    Change-Id: I5a403c072b9f03074882b552e1925f22cb5b15b6
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-10:

#17

Download full text (3.8 KiB)

Reviewed: https://review.opendev.org/747745
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Submitter: Zuul
Branch: master

commit a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 16:54:16 2020 +0100

Move confirm resize under semaphore

    The 'ResourceTracker.update_available_resource' periodic task builds
    usage information for the current host by inspecting instances and
    in-progress migrations, combining the two. Specifically, it finds all
    instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state,
    calculates the usage from these, then finds all in-progress migrations
    for the host that don't have an associated instance (to prevent double
    accounting) and includes the usage for these.

    In addition to the periodic task, the 'ResourceTracker' class has a
    number of helper functions to make or drop claims for the inventory
    generated by the 'update_available_resource' periodic task as part of
    the various instance operations. These helpers naturally assume that
    when making a claim for a particular instance or migration, there
    shouldn't already be resources allocated for same. Conversely, when
    dropping claims, the resources should currently be allocated. However,
    the check for *active* instances and *in-progress* migrations in the
    periodic task means we have to be careful in how we make changes to a
    given instance or migration record. Running the periodic task between
    such an operation and an attempt to make or drop a claim can result in
    TOCTOU-like races.

    This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE'
    semaphore to prevent the periodic task running while we're claiming
    resources in helpers like 'ResourceTracker.instance_claim' and we make
    our changes to the instances and migrations within this context. There
    is one exception though: the 'drop_move_claim' helper. This function is
    used when dropping a claim for either a cold migration, a resize or a
    live migration, and will drop usage from either the source host (based
    on the "old" flavor) for a resize confirm or the destination host (based
    on the "new" flavor) for a resize revert or live migration rollback.
    Unfortunately, while the function itself is wrapped in the semaphore, no
    changes to the state or the instance or migration in question are
    protected by it.

    Consider the confirm resize case, which we're addressing here. If we
    mark the migration as 'confirmed' before running 'drop_move_claim', then
    the periodic task running between these steps will not account for the
    usage on the source since the migration is allegedly 'confirmed'. The
    call to 'drop_move_claim' will then result in the tracker dropping usage
    that we're no longer accounting for. This "set migration status before
    dropping usage" is the current behaviour for both same-cell and
    cross-cell resize, via the 'ComputeManager.confirm_resize' and
    'ComputeManager.confirm_snapshot_based_resize_at_source' function...

Reviewed:  https://review.opendev.org/747745
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Submitter: Zuul
Branch:    master

commit a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Fri Aug 21 16:54:16 2020 +0100

Move confirm resize under semaphore
    
    The 'ResourceTracker.update_available_resource' periodic task builds
    usage information for the current host by inspecting instances and
    in-progress migrations, combining the two. Specifically, it finds all
    instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state,
    calculates the usage from these, then finds all in-progress migrations
    for the host that don't have an associated instance (to prevent double
    accounting) and includes the usage for these.
    
    In addition to the periodic task, the 'ResourceTracker' class has a
    number of helper functions to make or drop claims for the inventory
    generated by the 'update_available_resource' periodic task as part of
    the various instance operations. These helpers naturally assume that
    when making a claim for a particular instance or migration, there
    shouldn't already be resources allocated for same. Conversely, when
    dropping claims, the resources should currently be allocated. However,
    the check for *active* instances and *in-progress* migrations in the
    periodic task means we have to be careful in how we make changes to a
    given instance or migration record. Running the periodic task between
    such an operation and an attempt to make or drop a claim can result in
    TOCTOU-like races.
    
    This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE'
    semaphore to prevent the periodic task running while we're claiming
    resources in helpers like 'ResourceTracker.instance_claim' and we make
    our changes to the instances and migrations within this context. There
    is one exception though: the 'drop_move_claim' helper. This function is
    used when dropping a claim for either a cold migration, a resize or a
    live migration, and will drop usage from either the source host (based
    on the "old" flavor) for a resize confirm or the destination host (based
    on the "new" flavor) for a resize revert or live migration rollback.
    Unfortunately, while the function itself is wrapped in the semaphore, no
    changes to the state or the instance or migration in question are
    protected by it.
    
    Consider the confirm resize case, which we're addressing here. If we
    mark the migration as 'confirmed' before running 'drop_move_claim', then
    the periodic task running between these steps will not account for the
    usage on the source since the migration is allegedly 'confirmed'. The
    call to 'drop_move_claim' will then result in the tracker dropping usage
    that we're no longer accounting for. This "set migration status before
    dropping usage" is the current behaviour for both same-cell and
    cross-cell resize, via the 'ComputeManager.confirm_resize' and
    'ComputeManager.confirm_snapshot_based_resize_at_source' functions,
    respectively. We could reverse those calls and run 'drop_move_claim'
    before marking the migration as 'confirmed', but while our usage will be
    momentarily correct, the periodic task running between these steps will
    re-add the usage we just dropped since the migration isn't yet
    'confirmed'. The correct solution is to close this gap between setting
    the migration status and dropping the move claim to zero. We do this by
    putting both operations behind the 'COMPUTE_RESOURCE_SEMAPHORE', just
    like the claim operations.
    
    Change-Id: I26b050c402f5721fc490126e9becb643af9279b4
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Partial-Bug: #1879878

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11:

#18

Reviewed: https://review.opendev.org/747746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc9c7a5ebf11253f86127238d33dff7401465155
Submitter: Zuul
Branch: master

commit dc9c7a5ebf11253f86127238d33dff7401465155
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 17:43:36 2020 +0100

Move revert resize under semaphore

    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.

    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.

    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.

    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.

    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <email address hidden>
    Closes-Bug: #1879878

Reviewed:  https://review.opendev.org/747746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc9c7a5ebf11253f86127238d33dff7401465155
Submitter: Zuul
Branch:    master

commit dc9c7a5ebf11253f86127238d33dff7401465155
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Fri Aug 21 17:43:36 2020 +0100

Move revert resize under semaphore
    
    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.
    
    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.
    
    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.
    
    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.
    
    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Closes-Bug: #1879878

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11: Related fix proposed to nova (stable/ussuri)

#19

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751349

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11: Fix proposed to nova (stable/ussuri)

#20

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751352

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11:

#21

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751353

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11:

#22

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751354

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11: Related fix proposed to nova (stable/train)

#23

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/751365

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11: Fix proposed to nova (stable/train)

#24

Fix proposed to branch: stable/train
Review: https://review.opendev.org/751367

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11:

#25

Fix proposed to branch: stable/train
Review: https://review.opendev.org/751368

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11:

#26

Fix proposed to branch: stable/train
Review: https://review.opendev.org/751369

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-15: Related fix merged to nova (stable/ussuri)

#27

Reviewed: https://review.opendev.org/751349
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8ffaac493288c73badfa4f1ec6021ecb4f3137b7
Submitter: Zuul
Branch: stable/ussuri

commit 8ffaac493288c73badfa4f1ec6021ecb4f3137b7
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:11:59 2020 +0100

tests: Add reproducer for bug #1879878

    When one resizes a pinned instance, the instance claims host CPUs for
    pinning purposes on the destination. However, the host CPUs on the
    source are not immediately relinquished. Rather, they are held by the
    migration record, to handle the event that the resize is reverted. It is
    only when one confirms this resize that the old cores are finally
    relinquished.

    It appears there is a potential race between the resource tracker's
    periodic task and the freeing of these resources, resulting in attempts
    to unpin host cores that have already been unpinned. This test
    highlights that bug pending a fix.

    Change-Id: Ie092628ac71eb87c9dfa7220255a2953ada9e04d
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1879878
    (cherry picked from commit 10f0a42de162c90c701f70c9c28dc31bfada87db)

tags:

added: in-stable-ussuri

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-17: Fix merged to nova (stable/ussuri)

#28

Reviewed: https://review.opendev.org/751352
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Submitter: Zuul
Branch: stable/ussuri

commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)

Reviewed:  https://review.opendev.org/751352
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Submitter: Zuul
Branch:    stable/ussuri

commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Wed Aug 5 14:27:06 2020 +0100

Don't unset Instance.old_flavor, new_flavor until necessary
    
    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.
    
    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.
    
    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300
    
    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-17:

#29

Download full text (3.9 KiB)

Reviewed: https://review.opendev.org/751353
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4fcada57d6c569a3b9c8094cdfb37b77ec2a8cb5
Submitter: Zuul
Branch: stable/ussuri

commit 4fcada57d6c569a3b9c8094cdfb37b77ec2a8cb5
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 16:54:16 2020 +0100

Move confirm resize under semaphore

    The 'ResourceTracker.update_available_resource' periodic task builds
    usage information for the current host by inspecting instances and
    in-progress migrations, combining the two. Specifically, it finds all
    instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state,
    calculates the usage from these, then finds all in-progress migrations
    for the host that don't have an associated instance (to prevent double
    accounting) and includes the usage for these.

    In addition to the periodic task, the 'ResourceTracker' class has a
    number of helper functions to make or drop claims for the inventory
    generated by the 'update_available_resource' periodic task as part of
    the various instance operations. These helpers naturally assume that
    when making a claim for a particular instance or migration, there
    shouldn't already be resources allocated for same. Conversely, when
    dropping claims, the resources should currently be allocated. However,
    the check for *active* instances and *in-progress* migrations in the
    periodic task means we have to be careful in how we make changes to a
    given instance or migration record. Running the periodic task between
    such an operation and an attempt to make or drop a claim can result in
    TOCTOU-like races.

    This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE'
    semaphore to prevent the periodic task running while we're claiming
    resources in helpers like 'ResourceTracker.instance_claim' and we make
    our changes to the instances and migrations within this context. There
    is one exception though: the 'drop_move_claim' helper. This function is
    used when dropping a claim for either a cold migration, a resize or a
    live migration, and will drop usage from either the source host (based
    on the "old" flavor) for a resize confirm or the destination host (based
    on the "new" flavor) for a resize revert or live migration rollback.
    Unfortunately, while the function itself is wrapped in the semaphore, no
    changes to the state or the instance or migration in question are
    protected by it.

    Consider the confirm resize case, which we're addressing here. If we
    mark the migration as 'confirmed' before running 'drop_move_claim', then
    the periodic task running between these steps will not account for the
    usage on the source since the migration is allegedly 'confirmed'. The
    call to 'drop_move_claim' will then result in the tracker dropping usage
    that we're no longer accounting for. This "set migration status before
    dropping usage" is the current behaviour for both same-cell and
    cross-cell resize, via the 'ComputeManager.confirm_resize' and
    'ComputeManager.confirm_snapshot_based_resize_at_source' f...

Reviewed:  https://review.opendev.org/751353
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4fcada57d6c569a3b9c8094cdfb37b77ec2a8cb5
Submitter: Zuul
Branch:    stable/ussuri

commit 4fcada57d6c569a3b9c8094cdfb37b77ec2a8cb5
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Fri Aug 21 16:54:16 2020 +0100

Move confirm resize under semaphore
    
    The 'ResourceTracker.update_available_resource' periodic task builds
    usage information for the current host by inspecting instances and
    in-progress migrations, combining the two. Specifically, it finds all
    instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state,
    calculates the usage from these, then finds all in-progress migrations
    for the host that don't have an associated instance (to prevent double
    accounting) and includes the usage for these.
    
    In addition to the periodic task, the 'ResourceTracker' class has a
    number of helper functions to make or drop claims for the inventory
    generated by the 'update_available_resource' periodic task as part of
    the various instance operations. These helpers naturally assume that
    when making a claim for a particular instance or migration, there
    shouldn't already be resources allocated for same. Conversely, when
    dropping claims, the resources should currently be allocated. However,
    the check for *active* instances and *in-progress* migrations in the
    periodic task means we have to be careful in how we make changes to a
    given instance or migration record. Running the periodic task between
    such an operation and an attempt to make or drop a claim can result in
    TOCTOU-like races.
    
    This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE'
    semaphore to prevent the periodic task running while we're claiming
    resources in helpers like 'ResourceTracker.instance_claim' and we make
    our changes to the instances and migrations within this context. There
    is one exception though: the 'drop_move_claim' helper. This function is
    used when dropping a claim for either a cold migration, a resize or a
    live migration, and will drop usage from either the source host (based
    on the "old" flavor) for a resize confirm or the destination host (based
    on the "new" flavor) for a resize revert or live migration rollback.
    Unfortunately, while the function itself is wrapped in the semaphore, no
    changes to the state or the instance or migration in question are
    protected by it.
    
    Consider the confirm resize case, which we're addressing here. If we
    mark the migration as 'confirmed' before running 'drop_move_claim', then
    the periodic task running between these steps will not account for the
    usage on the source since the migration is allegedly 'confirmed'. The
    call to 'drop_move_claim' will then result in the tracker dropping usage
    that we're no longer accounting for. This "set migration status before
    dropping usage" is the current behaviour for both same-cell and
    cross-cell resize, via the 'ComputeManager.confirm_resize' and
    'ComputeManager.confirm_snapshot_based_resize_at_source' functions,
    respectively. We could reverse those calls and run 'drop_move_claim'
    before marking the migration as 'confirmed', but while our usage will be
    momentarily correct, the periodic task running between these steps will
    re-add the usage we just dropped since the migration isn't yet
    'confirmed'. The correct solution is to close this gap between setting
    the migration status and dropping the move claim to zero. We do this by
    putting both operations behind the 'COMPUTE_RESOURCE_SEMAPHORE', just
    like the claim operations.
    
    Change-Id: I26b050c402f5721fc490126e9becb643af9279b4
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Partial-Bug: #1879878
    (cherry picked from commit a57800d3825ef2fb833d8024c6f7e5c550f55e2f)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25:

#30

Reviewed: https://review.opendev.org/751354
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=79a467e1cfa05b1c281e592e69549716c11892e5
Submitter: Zuul
Branch: stable/ussuri

commit 79a467e1cfa05b1c281e592e69549716c11892e5
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 17:43:36 2020 +0100

Move revert resize under semaphore

    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.

    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.

    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.

    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.

    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <email address hidden>
    Closes-Bug: #1879878
    (cherry picked from commit dc9c7a5ebf11253f86127238d33dff7401465155)

Reviewed:  https://review.opendev.org/751354
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=79a467e1cfa05b1c281e592e69549716c11892e5
Submitter: Zuul
Branch:    stable/ussuri

commit 79a467e1cfa05b1c281e592e69549716c11892e5
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Fri Aug 21 17:43:36 2020 +0100

Move revert resize under semaphore
    
    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.
    
    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.
    
    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.
    
    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.
    
    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Closes-Bug: #1879878
    (cherry picked from commit dc9c7a5ebf11253f86127238d33dff7401465155)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-28: Related fix merged to nova (master)

#31

Reviewed: https://review.opendev.org/749713
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=99850a04812fe1530b5555f08bd9afba41cbddce
Submitter: Zuul
Branch: master

commit 99850a04812fe1530b5555f08bd9afba41cbddce
Author: Stephen Finucane <email address hidden>
Date: Thu Sep 3 12:56:41 2020 +0100

Add reproducer for bug #1894095

    As with the resize/cold migration confirm/resize flow, the rolling back
    of live migrations can be racy due to us doing various things without
    the benefit of locks. Add a reproducer demonstrating this.

    Change-Id: I276d2bbce4a0390a6cfd597bb51481cc53eca918
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1894095
    Related-Bug: #1879878

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-15: Related fix merged to nova (stable/train)

#32

Reviewed: https://review.opendev.org/c/openstack/nova/+/751365
Committed: https://opendev.org/openstack/nova/commit/9f789caded517b5503e6728c3c5859fb5525ad39
Submitter: "Zuul (22348)"
Branch: stable/train

commit 9f789caded517b5503e6728c3c5859fb5525ad39
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:11:59 2020 +0100

tests: Add reproducer for bug #1879878

    When one resizes a pinned instance, the instance claims host CPUs for
    pinning purposes on the destination. However, the host CPUs on the
    source are not immediately relinquished. Rather, they are held by the
    migration record, to handle the event that the resize is reverted. It is
    only when one confirms this resize that the old cores are finally
    relinquished.

    It appears there is a potential race between the resource tracker's
    periodic task and the freeing of these resources, resulting in attempts
    to unpin host cores that have already been unpinned. This test
    highlights that bug pending a fix.

Changes:
nova/tests/functional/libvirt/test_numa_servers.py

    NOTE(stephenfin): We don't yet have the '_create_server' helper or the
    more sensible '_wait_for_state_change' behavior on 'stable/train', so we
    have to revert to '_build_server' and checking for the state before the
    one we want.

    Change-Id: Ie092628ac71eb87c9dfa7220255a2953ada9e04d
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1879878
    (cherry picked from commit 10f0a42de162c90c701f70c9c28dc31bfada87db)
    (cherry picked from commit 8ffaac493288c73badfa4f1ec6021ecb4f3137b7)

tags:

added: in-stable-train

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-20: Fix merged to nova (stable/train)

#33

Reviewed: https://review.opendev.org/c/openstack/nova/+/751367
Committed: https://opendev.org/openstack/nova/commit/75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Submitter: "Zuul (22348)"
Branch: stable/train

commit 75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

Conflicts:
nova/compute/manager.py

NOTE(stephenfin): Conflicts are due to a number of missing cross-cell
resize changes.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)
    (cherry picked from commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5)

Reviewed:  https://review.opendev.org/c/openstack/nova/+/751367
Committed: https://opendev.org/openstack/nova/commit/75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Submitter: "Zuul (22348)"
Branch:    stable/train

commit 75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Wed Aug 5 14:27:06 2020 +0100

Don't unset Instance.old_flavor, new_flavor until necessary
    
    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.
    
    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.
    
    Conflicts:
      nova/compute/manager.py
    
    NOTE(stephenfin): Conflicts are due to a number of missing cross-cell
    resize changes.
    
    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300
    
    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)
    (cherry picked from commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-01: Change abandoned on nova (stable/train)

#34

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/751368
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-01:

#35

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/751369
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

OpenStack Compute (nova)

VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Stephen Finucane
Train	Fix Released	Undecided	Stephen Finucane
Ussuri	Fix Released	Undecided	Stephen Finucane