performance degradation in placement with large number of resource providers

Bug #1786055 reported by Chris Dent
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Chris Dent

Bug Description

Using today's master, there is a big performance degradation in GET /allocation_candidates when there is a large number of resource providers (in my tests 1000, each with the same inventory as described in [1]). 17s when querying all three resource classes with http://127.0.0.1:8081/allocation_candidates?resources=VCPU:1,MEMORY_MB:256,DISK_GB:10

Using a limit does not make any difference, the cost is in generating the original data.

I did some advanced LOG.debug based benchmarking to determine three places where things are a problem, and maybe even fixed the worst one. See the diff below. The two main culprits are ResourceProvider.get_by_uuid calls looping over the full set. These can be replaced by either using data we already have from early queries, or by changing so we are making single queries.

In the diff I've already changed one of them (the second chunk) to use the data that _build_provider_summaries is already getting. (functional tests still pass with this change)

The third chunk is because we have a big loop, but I suspect there is some duplication that can be avoided. I have no investigated that closely (yet).

-=-=-
diff --git a/nova/api/openstack/placement/objects/resource_provider.py b/nova/api/openstack/placement/objects/resource_provider.py
index 851f9719e4..e6c894b8fe 100644
--- a/nova/api/openstack/placement/objects/resource_provider.py
+++ b/nova/api/openstack/placement/objects/resource_provider.py
@@ -3233,6 +3233,8 @@ def _build_provider_summaries(context, usages, prov_traits):
         if not summary:
             summary = ProviderSummary(
                 context,
+ # This is _expensive_ when there are a large number of rps.
+ # Building the objects differently may be better.
                 resource_provider=ResourceProvider.get_by_uuid(context,
                                                                uuid=rp_uuid),
                 resources=[],
@@ -3519,8 +3521,7 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources,
         rp_uuid = rp_summary.resource_provider.uuid
         tree_dict[root_id][rc_id].append(
             AllocationRequestResource(
- ctx, resource_provider=ResourceProvider.get_by_uuid(ctx,
- rp_uuid),
+ ctx, resource_provider=rp_summary.resource_provider,
                 resource_class=_RC_CACHE.string_from_id(rc_id),
                 amount=requested_resources[rc_id]))

@@ -3535,6 +3536,8 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources,
     alloc_prov_ids = []

     # Let's look into each tree
+ # With many resource providers this takes a long time, but each trip
+ # through the loop is not too bad.
     for root_id, alloc_dict in tree_dict.items():
         # Get request_groups, which is a list of lists of
         # AllocationRequestResource(ARR) per requested resource class(rc).
-=-=-

[1] https://github.com/cdent/placeload/blob/master/placeload/__init__.py#L23

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/589941

Changed in nova:
assignee: nobody → Chris Dent (cdent)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/589945

melanie witt (melwitt)
tags: added: rocky-rc-potential
Revision history for this message
melanie witt (melwitt) wrote :

Just adding an informational note here from discussion in #openstack-placement, the changes that regressed the performance are these patches that added calls of ResourceProvider.get_by_uuid:

https://review.openstack.org/567113
https://review.openstack.org/564351

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/589945
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f0cc6e68a9b6e6ca052866ba6583a4914ab6cb4e
Submitter: Zuul
Branch: master

commit f0cc6e68a9b6e6ca052866ba6583a4914ab6cb4e
Author: Jay Pipes <email address hidden>
Date: Wed Aug 8 11:36:51 2018 -0400

    get provider IDs once when building summaries

    We were calling ResourceProvider.get_by_uuid() inside
    _build_provider_summaries() main loop over all providers involved in the
    resulting allocation candidates. This results in a query per provider
    involved, which is quite obviously not going to perform well. This patch
    modifies the _build_provider_summaries() function to make a single call
    to a new _provider_ids_from_rp_ids() function instead of multiple calls
    to ResourceProvider.get_by_uuid().

    Change-Id: I0e0a44e833afece0775ec712fbdf9fcf4eae7a93
    Related-bug: #1786055

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/589941
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3c96ce9158924476d88c815d0889d6d52ff62c76
Submitter: Zuul
Branch: master

commit 3c96ce9158924476d88c815d0889d6d52ff62c76
Author: Chris Dent <email address hidden>
Date: Wed Aug 8 16:22:53 2018 +0100

    [placement] Avoid rp.get_by_uuid in allocation_candidates

    We already have a fully loaded resource provider object in the loop, so
    we don't need to create another one. Doing so has a very large
    performance impact, especially when there are many resource providers
    in the collection of summaries (which will be true in a large and
    sparsely used cloud).

    The code which creates the summaries used here as a data source has the
    same expensive use of get_by_uuid in a loop. That will be fixed in a
    separate patch.

    Existing functional tests cover this code.

    Change-Id: I6068db78240c33a1dcefedc0c94e76740fd8d6e2
    Partial-Bug: #1786055

Revision history for this message
melanie witt (melwitt) wrote :

I'm going to close this bug to help keep our backlog down since the two patches to fix the major factor in the performance degradation have landed.

"<cdent> melwitt: hmmm. There is more than can be done, but not likely that more will be done _now_, so I would guess closed is probably a reasonable state. The major factor has been addressed. Fixing the rest will involve considerable refactoring"

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/590388
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Submitter: Zuul
Branch: master

commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Author: Jay Pipes <email address hidden>
Date: Thu Aug 9 10:46:20 2018 -0400

    placement: use simple code paths when possible

    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.

    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.

    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.

    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.

    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1

Matt Riedemann (mriedem)
tags: removed: rocky-rc-potential
tags: added: performance
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/600447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/600447
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1cae9b8d4392ef597d93f2934ce18ead8828da98
Submitter: Zuul
Branch: stable/rocky

commit 1cae9b8d4392ef597d93f2934ce18ead8828da98
Author: Jay Pipes <email address hidden>
Date: Thu Aug 9 10:46:20 2018 -0400

    placement: use simple code paths when possible

    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.

    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.

    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.

    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.

    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1
    (cherry picked from commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa)

tags: added: in-stable-rocky
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.