VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node

Bug #1879878 reported by kevinzhao
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Stephen Finucane
Train
Undecided
Stephen Finucane
Ussuri
Undecided
Stephen Finucane

Bug Description

Description
===========

In my environmet, it will take some time to clean VM on source node in confirming resize.
during confirming resize process, periodic_task update_available_resource may update resource usage at the same time.
It may cause ERROR like:
CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []
during confirming resize process.

Steps to reproduce
==================
* Set /etc/nova/nova.conf "update_resources_interval" to small value, let's say 30 seconds on compute nodes. This step will increase the probability of error.

* create a "dedicated" VM, the flavor can be
+----------------------------+--------------------------------------+
| Property | Value |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| disk | 80 |
| extra_specs | {"hw:cpu_policy": "dedicated"} |
| id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 |
| name | 4vcpu.4mem.80ssd.0eph.numa |
| os-flavor-access:is_public | True |
| ram | 4096 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 4 |
+----------------------------+--------------------------------------+

* Resize the VM with a new flavor to another node.

* Confirm resize.
Make sure it will take some time to undefine the vm on source node, 30 seconds will lead to inevitable results.

* Then you will see the ERROR notice on dashboard, And the VM become ERROR

Expected result
===============
VM resized successfuly, vm state is active

Actual result
=============

* VM become ERROR

* On dashboard you can see this notice:
Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []].

Environment
===========
1. Exact version of OpenStack you are running.

  Newton version with patch https://review.opendev.org/#/c/641806/21
  I am sure it will happen to other new vision with https://review.opendev.org/#/c/641806/21
  such as Train and Ussuri

2. Which hypervisor did you use?
   Libvirt + KVM

3. Which storage type did you use?
   local disk

4. Which networking type did you use?
   Neutron with OpenVSwitch

Logs & Configs
==============

ERROR log on source node
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 95262f1f45f14170b91cd8054bb36512 - - -] [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Setting instance vm_state to ERROR
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Traceback (most recent call last):
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6661, in _error_out_instance_on_exception
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] yield
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3444, in _confirm_resize
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] prefix='old_')
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] return f(*args, **kwargs)
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 379, in drop_move_claim
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] self._update_usage(usage, sign=-1)
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 724, in _update_usage
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] self.compute_node, usage, free)
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1542, in get_host_numa_usage_from_instance
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] host_numa_topology, instance_numa_topology, free=free))
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1409, in numa_usage_from_instances
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] newcell.unpin_cpus(pinned_cpus)
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 95, in unpin_cpus
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] pinned=list(self.pinned_cpus))
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c]

kevinzhao (kego)
Changed in nova:
assignee: nobody → kevinzhao (kego)
kevinzhao (kego)
tags: added: resize
Revision history for this message
sean mooney (sean-k-mooney) wrote :

This looks like the resize just hit a race condition.
when using cpu pinning cpus are not claimed on the compute host numa toplogy blob
so its perfectly feasible for a compute node to claim to fail even if schduler and placement claims
passed.

marking this as incompleted as we dont know what release of nova this is being reported against.
if its pre train this is invalid as we expect this to race. if its after train then we should see why we are not doing the cliam
as part of the resize verify step. we should be in which case this maybe vaild.

in general however we cannot fully resolve this untill we have numa in placement as there will always be a race between the schduler and compute node resouce tracker but there should not be one between resize_verify and the perodic task even today
unless the vm raced with another vm for the resouces on the host.

Changed in nova:
status: New → Incomplete
Revision history for this message
sean mooney (sean-k-mooney) wrote :

also can you conrim if you are using the vcpu_pin_set or are you using the cpu_dedicated_set to specify the cpus to use for pinning.

if you are using train+ and are using the cpu_dedicated_set placemnt should not allow this edgecase to happen. if you are using the vcpu_pin_set with the fallback mechanium enabled then its possibel
to still hit this race in train.

tags: added: numa
Revision history for this message
sean mooney (sean-k-mooney) wrote :

http://paste.openstack.org/show/795679/
i was able to repoduce this once on master but not reliably yet.
so im moving this to confimed.

we also have a downstream report of this on train
https://bugzilla.redhat.com/show_bug.cgi?id=1850400
sill at that to the affeted versions

i am setting the importance to medium as this seams to be quite hard to trigger as all but 1 out of the 10-12 attempts i made failed so i think this will be hit rarely.

when this happens the vm is left in a running state runing on the target host. stopping the vm and starting it restores it to an active state.

Changed in nova:
importance: Undecided → Medium
status: Incomplete → Confirmed
Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

Thanks for the great reproduction steps. I am able to reproduce this reliably on master by slowing down the resize confirm step in the virt driver via a simple sleep, with a timeout > the default periodic task interval (currently 60 seconds). We should be able to reproduce in a functional test by manually running periodic tasks between the API request to confirm the resize and the driver cleaning things up.

Changed in nova:
assignee: kevinzhao (kego) → Stephen Finucane (stephenfinucane)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/744950

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/744958

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/744950
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=10f0a42de162c90c701f70c9c28dc31bfada87db
Submitter: Zuul
Branch: master

commit 10f0a42de162c90c701f70c9c28dc31bfada87db
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:11:59 2020 +0100

    tests: Add reproducer for bug #1879878

    When one resizes a pinned instance, the instance claims host CPUs for
    pinning purposes on the destination. However, the host CPUs on the
    source are not immediately relinquished. Rather, they are held by the
    migration record, to handle the event that the resize is reverted. It is
    only when one confirms this resize that the old cores are finally
    relinquished.

    It appears there is a potential race between the resource tracker's
    periodic task and the freeing of these resources, resulting in attempts
    to unpin host cores that have already been unpinned. This test
    highlights that bug pending a fix.

    Change-Id: Ie092628ac71eb87c9dfa7220255a2953ada9e04d
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1879878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/747745

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/747746

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/741995
Reason: Abandoning this in favor of https://review.opendev.org/#/c/744958/8 which is the same code change but with a better commit message that provides context on a bug the change also addresses.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/744958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Submitter: Zuul
Branch: master

commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/749713

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/750186

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/750187

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/750186
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=27b37ed5c85464fd7d0a9aa43bf1d491496c7915
Submitter: Zuul
Branch: master

commit 27b37ed5c85464fd7d0a9aa43bf1d491496c7915
Author: Stephen Finucane <email address hidden>
Date: Tue Sep 8 09:54:38 2020 +0100

    Expand generic reproducer for bug #1879878

    This shows that this bug doesn't affect cross-cell resize in the same
    way, but does highlight some gaps we need to close.

    Change-Id: Iee2c5521f41a2d061d21d4556cfbf7a898ad96c9
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1879878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/750187
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b2fbaa87679629dfb0be7a76ee7dc57980e21dcb
Submitter: Zuul
Branch: master

commit b2fbaa87679629dfb0be7a76ee7dc57980e21dcb
Author: Stephen Finucane <email address hidden>
Date: Mon Sep 7 11:27:30 2020 +0100

    Set 'old_flavor', 'new_flavor' on source before resize

    Cross-cell resize is confusing. We need to set this information ahead of
    time.

    Change-Id: I5a403c072b9f03074882b552e1925f22cb5b15b6
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Download full text (3.8 KiB)

Reviewed: https://review.opendev.org/747745
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Submitter: Zuul
Branch: master

commit a57800d3825ef2fb833d8024c6f7e5c550f55e2f
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 16:54:16 2020 +0100

    Move confirm resize under semaphore

    The 'ResourceTracker.update_available_resource' periodic task builds
    usage information for the current host by inspecting instances and
    in-progress migrations, combining the two. Specifically, it finds all
    instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state,
    calculates the usage from these, then finds all in-progress migrations
    for the host that don't have an associated instance (to prevent double
    accounting) and includes the usage for these.

    In addition to the periodic task, the 'ResourceTracker' class has a
    number of helper functions to make or drop claims for the inventory
    generated by the 'update_available_resource' periodic task as part of
    the various instance operations. These helpers naturally assume that
    when making a claim for a particular instance or migration, there
    shouldn't already be resources allocated for same. Conversely, when
    dropping claims, the resources should currently be allocated. However,
    the check for *active* instances and *in-progress* migrations in the
    periodic task means we have to be careful in how we make changes to a
    given instance or migration record. Running the periodic task between
    such an operation and an attempt to make or drop a claim can result in
    TOCTOU-like races.

    This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE'
    semaphore to prevent the periodic task running while we're claiming
    resources in helpers like 'ResourceTracker.instance_claim' and we make
    our changes to the instances and migrations within this context. There
    is one exception though: the 'drop_move_claim' helper. This function is
    used when dropping a claim for either a cold migration, a resize or a
    live migration, and will drop usage from either the source host (based
    on the "old" flavor) for a resize confirm or the destination host (based
    on the "new" flavor) for a resize revert or live migration rollback.
    Unfortunately, while the function itself is wrapped in the semaphore, no
    changes to the state or the instance or migration in question are
    protected by it.

    Consider the confirm resize case, which we're addressing here. If we
    mark the migration as 'confirmed' before running 'drop_move_claim', then
    the periodic task running between these steps will not account for the
    usage on the source since the migration is allegedly 'confirmed'. The
    call to 'drop_move_claim' will then result in the tracker dropping usage
    that we're no longer accounting for. This "set migration status before
    dropping usage" is the current behaviour for both same-cell and
    cross-cell resize, via the 'ComputeManager.confirm_resize' and
    'ComputeManager.confirm_snapshot_based_resize_at_source' function...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/747746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc9c7a5ebf11253f86127238d33dff7401465155
Submitter: Zuul
Branch: master

commit dc9c7a5ebf11253f86127238d33dff7401465155
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 17:43:36 2020 +0100

    Move revert resize under semaphore

    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.

    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.

    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.

    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.

    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <email address hidden>
    Closes-Bug: #1879878

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751349

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751352

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751353

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/751365

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/751367

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/train
Review: https://review.opendev.org/751368

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/train
Review: https://review.opendev.org/751369

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/751349
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8ffaac493288c73badfa4f1ec6021ecb4f3137b7
Submitter: Zuul
Branch: stable/ussuri

commit 8ffaac493288c73badfa4f1ec6021ecb4f3137b7
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:11:59 2020 +0100

    tests: Add reproducer for bug #1879878

    When one resizes a pinned instance, the instance claims host CPUs for
    pinning purposes on the destination. However, the host CPUs on the
    source are not immediately relinquished. Rather, they are held by the
    migration record, to handle the event that the resize is reverted. It is
    only when one confirms this resize that the old cores are finally
    relinquished.

    It appears there is a potential race between the resource tracker's
    periodic task and the freeing of these resources, resulting in attempts
    to unpin host cores that have already been unpinned. This test
    highlights that bug pending a fix.

    Change-Id: Ie092628ac71eb87c9dfa7220255a2953ada9e04d
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1879878
    (cherry picked from commit 10f0a42de162c90c701f70c9c28dc31bfada87db)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/751352
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Submitter: Zuul
Branch: stable/ussuri

commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Download full text (3.9 KiB)

Reviewed: https://review.opendev.org/751353
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4fcada57d6c569a3b9c8094cdfb37b77ec2a8cb5
Submitter: Zuul
Branch: stable/ussuri

commit 4fcada57d6c569a3b9c8094cdfb37b77ec2a8cb5
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 16:54:16 2020 +0100

    Move confirm resize under semaphore

    The 'ResourceTracker.update_available_resource' periodic task builds
    usage information for the current host by inspecting instances and
    in-progress migrations, combining the two. Specifically, it finds all
    instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state,
    calculates the usage from these, then finds all in-progress migrations
    for the host that don't have an associated instance (to prevent double
    accounting) and includes the usage for these.

    In addition to the periodic task, the 'ResourceTracker' class has a
    number of helper functions to make or drop claims for the inventory
    generated by the 'update_available_resource' periodic task as part of
    the various instance operations. These helpers naturally assume that
    when making a claim for a particular instance or migration, there
    shouldn't already be resources allocated for same. Conversely, when
    dropping claims, the resources should currently be allocated. However,
    the check for *active* instances and *in-progress* migrations in the
    periodic task means we have to be careful in how we make changes to a
    given instance or migration record. Running the periodic task between
    such an operation and an attempt to make or drop a claim can result in
    TOCTOU-like races.

    This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE'
    semaphore to prevent the periodic task running while we're claiming
    resources in helpers like 'ResourceTracker.instance_claim' and we make
    our changes to the instances and migrations within this context. There
    is one exception though: the 'drop_move_claim' helper. This function is
    used when dropping a claim for either a cold migration, a resize or a
    live migration, and will drop usage from either the source host (based
    on the "old" flavor) for a resize confirm or the destination host (based
    on the "new" flavor) for a resize revert or live migration rollback.
    Unfortunately, while the function itself is wrapped in the semaphore, no
    changes to the state or the instance or migration in question are
    protected by it.

    Consider the confirm resize case, which we're addressing here. If we
    mark the migration as 'confirmed' before running 'drop_move_claim', then
    the periodic task running between these steps will not account for the
    usage on the source since the migration is allegedly 'confirmed'. The
    call to 'drop_move_claim' will then result in the tracker dropping usage
    that we're no longer accounting for. This "set migration status before
    dropping usage" is the current behaviour for both same-cell and
    cross-cell resize, via the 'ComputeManager.confirm_resize' and
    'ComputeManager.confirm_snapshot_based_resize_at_source' f...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/751354
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=79a467e1cfa05b1c281e592e69549716c11892e5
Submitter: Zuul
Branch: stable/ussuri

commit 79a467e1cfa05b1c281e592e69549716c11892e5
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 17:43:36 2020 +0100

    Move revert resize under semaphore

    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.

    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.

    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.

    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.

    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <email address hidden>
    Closes-Bug: #1879878
    (cherry picked from commit dc9c7a5ebf11253f86127238d33dff7401465155)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/749713
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=99850a04812fe1530b5555f08bd9afba41cbddce
Submitter: Zuul
Branch: master

commit 99850a04812fe1530b5555f08bd9afba41cbddce
Author: Stephen Finucane <email address hidden>
Date: Thu Sep 3 12:56:41 2020 +0100

    Add reproducer for bug #1894095

    As with the resize/cold migration confirm/resize flow, the rolling back
    of live migrations can be racy due to us doing various things without
    the benefit of locks. Add a reproducer demonstrating this.

    Change-Id: I276d2bbce4a0390a6cfd597bb51481cc53eca918
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1894095
    Related-Bug: #1879878

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.