Compute unnecessarily gets resource provider aggregates during every update_available_resource run

Bug #1742467 reported by Matt Riedemann
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Eric Fried
Ocata
New
Undecided
Unassigned
Pike
New
Undecided
Unassigned

Bug Description

This was provided by Kris Lindgren from GoDaddy on his Pike deployment that is now running with Placement.

He noted that for every update_available_resource periodic task run, these are the calls to Placement if inventory didn't change:

https://paste.ubuntu.com/26356656/

So there are 5 GET requests in there.

In this run, there isn't a call to get the resource provider itself because the SchedulerReportClient has it cached in the _resource_providers dict.

But it still gets aggregates for the provider twice because it always wants to be up to date.

The aggregates are put in the _provider_aggregate_map, however, that code is never used by anything since nova doesn't yet support resource provider aggregates since those are needed for shared resource providers, like a shared storage pool.

Until nova supports shared providers, we likely should just comment the _provider_aggregate_map code out if nothing is using it to avoid the extra HTTP requests to Placement every minute (the default periodic task interval).

Revision history for this message
Matt Riedemann (mriedem) wrote :

Nice write up from cdent in the mailing list awhile ago on this same issue:

http://lists.openstack.org/pipermail/openstack-dev/2017-January/110953.html

Revision history for this message
Matt Riedemann (mriedem) wrote :

As for the two GET calls to /inventories every periodic task run:

(9:44:36 AM) mriedem: _update_available_resource is the call from the compute periodic task,
(9:44:41 AM) mriedem: which calls _init_compute_node
(9:44:51 AM) mriedem: when we already have the compute node, it calls _update
(9:45:00 AM) mriedem: which eventually does the update_inventory_attempt stuff in the report client
(9:45:10 AM) mriedem: then at the end of _update_available_resource,
(9:45:12 AM) mriedem: we call _update again
(9:45:21 AM) mriedem: so that's your 2 inventory updates
(9:45:34 AM) mriedem: which, johnthetubaguy changed in queens
(9:45:44 AM) mriedem: or wait,no
(9:46:05 AM) mriedem: https://review.openstack.org/#/c/520024/
(9:46:12 AM) mriedem: that would fix the double GET inventories

Revision history for this message
Matt Riedemann (mriedem) wrote :

The call to "GET /resource_providers/37bef394-ca42-4926-8899-68e986f66abf/allocations" should only happen, I think, if there are any ocata computes in the deployment or if this is running on a compute host with the ironic driver.

Matt Riedemann (mriedem)
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: performance
Revision history for this message
Chris Dent (cdent) wrote :

"The call to "GET /resource_providers/37bef394-ca42-4926-8899-68e986f66abf/allocations" should only happen"

I want us to preserve a way, somehow, of validating and confirming allocations, so that the compute node itself can take on the role of the "source of truth". Maybe just after a HUP or startup, but it needs to be in there somewhere. See the last bullet point on https://anticdent.org/placement-extraction.html

Revision history for this message
Chris Dent (cdent) wrote :

We/I probably need to redo the analysis of the requests that the resource tracker is making these days. I forget whether it will have changed or not.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Pulling aggregates in the compute starting in Rocky might be necessary solely for the provider tree saying if there are shared providers in the tree, like for shared storage. But that's not all teased out yet I don't think, i.e. we haven't really tried testing things out with shared storage modeled like that in placement in a multi-node CI environment like in a ceph job. There was something the other day I was commenting on that won't work with shared storage too, that had to do with move operations, it might have been these types of TODOs:

https://github.com/openstack/nova/blob/c8b93fa2493dce82ef4c0b1e7a503ba9b81c2e86/nova/conductor/tasks/migrate.py#L56

As for comment 4 and allocations, we have the "nova-manage placement heal_allocations" CLI now to fix up allocations for instances if necessary. That compat code in the compute service can likely be deleted at this point.

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/615606/ and the series of changes above it are probably relevant to this bug, since I think now if you disable that refresh the compute will stop pulling aggregates on every periodic.

Changed in nova:
assignee: nobody → Eric Fried (efried)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/615677
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=deef31729bd54f3747b7adba4132f148559c2242
Submitter: Zuul
Branch: master

commit deef31729bd54f3747b7adba4132f148559c2242
Author: Eric Fried <email address hidden>
Date: Mon Nov 5 16:04:10 2018 -0600

    Reduce calls to placement from _ensure

    Prior to this patch, the report client's update_from_provider_tree
    method would, upon failure of any placement API call, invalidate the
    cache *just* for the failing provider (and any descendants) and attempt
    to continue operating on any other providers in the tree.

    With this patch, we instead invalidate the tree around the failing
    provider and fail right away.

    In real life, since we don't yet have any implementations of nested,
    this would have been effectively a null change.

    Except: this allows us to resolve a TODO whereby we would *always*
    _ensure_resource_provider (including a call to GET
    /resource_providers?in_tree=$compute_rp) on every periodic. Now we can
    optimize that out.

    This should reduce the number of calls to placement per RT periodic to
    zero in steady state when [compute]resource_provider_association_refresh
    is zero.

    Closes-Bug: #1742467

    Change-Id: Ieeaad9783e0ff93377fbc6c7932618d2fac8946a

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.