simultaneous boot of multiple instances leads to cpu pinning overlap

Bug #1454451 reported by Chris Friesen
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Chris Friesen
Nominated for Juno by Nikola Đipanov
Kilo
Fix Released
Undecided
Nikola Đipanov

Bug Description

I'm running into an issue with kilo-3 that I think is present in current trunk. Basically it results in multiple instances (with dedicated cpus) being pinned to the same physical cpus.

I think there is a race between the claimed CPUs of an instance being persisted to the DB, and the resource audit scanning the DB for instances and subtracting pinned CPUs from the list of available CPUs.

The problem only shows up when the following sequence happens:
1) instance A (with dedicated cpus) boots on a compute node
2) resource audit runs on that compute node
3) instance B (with dedicated cpus) boots on the same compute node

So you need to either be booting many instances, or limiting the valid compute nodes (host aggregate or server groups), or have a small cluster in order to hit this.

The nitty-gritty view looks like this:

When booting up an instance we hold the COMPUTE_RESOURCE_SEMAPHORE in compute.resource_tracker.ResourceTracker.instance_claim() and that covers updating the resource usage on the compute node. But we don't persist the instance numa topology to the database until after instance_claim() returns, in compute.manager.ComputeManager._build_instance(). Note that this is done *after* we've given up the semaphore, so there is no longer any sort of ordering guarantee.

compute.resource_tracker.ResourceTracker.update_available_resource() then aquires COMPUTE_RESOURCE_SEMAPHORE, queries the database for a list of instances and uses that to calculate a new view of what resources are available. If the numa topology of the most recent instance hasn't been persisted yet, then the new view of resources won't include any pCPUs pinned by that instance.

compute.manager.ComputeManager._build_instance() runs for the next instance and based on the new view of available resources it allocates the same pCPU(s) used by the earlier instance. Boom, overlapping pinned pCPUs.

Lastly, the same bug applies to the compute.manager.ComputeManager.rebuild_instance() case. It uses the same pattern of doing the claim and then updating the instance numa topology after releasing the semaphore.

Tags: compute
Chris Friesen (cbf123)
Changed in nova:
assignee: nobody → Chris Friesen (cbf123)
description: updated
Chris Friesen (cbf123)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/182766

Changed in nova:
status: New → In Progress
Changed in nova:
assignee: Chris Friesen (cbf123) → Dan Smith (danms)
Changed in nova:
assignee: Dan Smith (danms) → Chris Friesen (cbf123)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/182766
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2427d288bc017a5b91430ffe16419d47703d2060
Submitter: Jenkins
Branch: master

commit 2427d288bc017a5b91430ffe16419d47703d2060
Author: Chris Friesen <email address hidden>
Date: Wed May 13 11:15:25 2015 -0600

    Fix race between resource audit and cpu pinning

    This fixes a race between the claimed CPUs of an instance being
    persisted to the DB, and the resource audit scanning the DB for
    instances and subtracting pinned CPUs from the list of available CPUs.

    The problem only shows up when the following sequence happens:
    1) instance A (with dedicated cpus) boots on a compute node
    2) resource audit runs on that compute node
    3) instance B (with dedicated cpus) boots on the same compute node

    The bug is that the claimed numa topology isn't updated until
    after we release COMPUTE_RESOURCES_SEMAPHORE, so when the resource
    audit retrieves the list of instances the numa_topology hasn't
    been updated yet for the most recent one.

    The fix is to persist the claimed numa topology before releasing
    COMPUTE_RESOURCES_SEMAPHORE.

    Closes-Bug: #1454451
    Co-Authored-By: Dan Smith <email address hidden>
    Change-Id: I553f2e43a68577c83d890c3671380af68f9e725a

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/185591

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/185647

Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/185591
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b13726bcf6b9e6a006ec9bfcde051331741ded5a
Submitter: Jenkins
Branch: stable/kilo

commit b13726bcf6b9e6a006ec9bfcde051331741ded5a
Author: Chris Friesen <email address hidden>
Date: Wed May 13 11:15:25 2015 -0600

    Fix race between resource audit and cpu pinning

    This fixes a race between the claimed CPUs of an instance being
    persisted to the DB, and the resource audit scanning the DB for
    instances and subtracting pinned CPUs from the list of available CPUs.

    The problem only shows up when the following sequence happens:
    1) instance A (with dedicated cpus) boots on a compute node
    2) resource audit runs on that compute node
    3) instance B (with dedicated cpus) boots on the same compute node

    The bug is that the claimed numa topology isn't updated until
    after we release COMPUTE_RESOURCES_SEMAPHORE, so when the resource
    audit retrieves the list of instances the numa_topology hasn't
    been updated yet for the most recent one.

    The fix is to persist the claimed numa topology before releasing
    COMPUTE_RESOURCES_SEMAPHORE.

    Closes-Bug: #1454451
    Co-Authored-By: Dan Smith <email address hidden>
    (cherry picked from commit 2427d288bc017a5b91430ffe16419d47703d2060)

    Conflicts:
     nova/compute/manager.py
     nova/tests/unit/compute/test_resource_tracker.py
     nova/tests/unit/compute/test_tracker.py

    Change-Id: I553f2e43a68577c83d890c3671380af68f9e725a

Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-1 → 12.0.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/juno)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/185647
Reason: Juno is EOL soon.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.