SchedulerReporClient init slows down nova-api startup

Bug #1807219 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

This is split out from bug 1807044 where nova-api sometimes takes more than 60 seconds to start on slow CI nodes which causes devstack to timeout and fail.

Specifically, every nova.compute.api.API constructs a SchedulerReportClient, which grabs an in-memory lock per API worker during init:

Dec 05 20:14:27.694593 ubuntu-xenial-ovh-bhs1-0000959981 <email address hidden>[23459]: DEBUG oslo_concurrency.lockutils [None req-dfdfad07-2ff4-43ed-9f67-2acd59687e0c None None] Lock "placement_client" released by "nova.scheduler.client.report._create_client" :: held 0.006s {{(pid=23462) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:339}}

We could probably be smarter about either making that a singleton in the API or only init on first access since most of the API extensions aren't going to even use that SchedulerReportClient.

There are at least 30 extensions in nova-api that create an API class, and there are 2 workers in devstack jobs, and each API class constructs that report client which has a lock during init, so it will have a snowball effect on startup.

Furthermore, with this change:

https://review.openstack.org/#/c/615641/

The NOTE in the API is no longer true:

https://github.com/openstack/nova/blob/c9dca64fa64005e5bea327f06a7a3f4821ab72b1/nova/compute/api.py#L256

So the API likely just needs to add it's own lazy-load behavior for that client.

Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/623246

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/624770

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/623246
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=66e44c64297e77395195d77104017fb6fcea58d2
Submitter: Zuul
Branch: master

commit 66e44c64297e77395195d77104017fb6fcea58d2
Author: Matt Riedemann <email address hidden>
Date: Thu Dec 6 11:17:15 2018 -0500

    Only construct SchedulerReportClient on first access from API

    With commit 5d1a50018510b2b281ad33895ae2d9555f5d5b05 each
    API/AggregateAPI class instance constructs a SchedulerReportClient
    which holds an in-memory lock during its initialization.

    With at least 34 API extensions constructing at least
    one of those two API classes, the accumulated affect of the
    SchedulerReportClient construction can slow down nova-api startup
    times, especially when running with multiple API workers, like
    in our tempest-full CI job (there are 2 workers, so 68 inits).

    This change simply defers constructing the SchedulerReportClient
    until it is used, which is only in a few spots in the API code,
    which should help with nova-api start times.

    The AggregateAPI also has to construct the SchedulerQueryClient
    separately because SchedulerClient creates both the query and
    report clients.

    Long-term we could consider making it a singleton in nova.compute.api
    if that is safe (the aggregate code might be relying on some caching
    aspects in the SchedulerReportClient).

    Change-Id: Idf6e548d725db0181629a451f46b6a3a5850d186
    Closes-Bug: #1807219

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/624770
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=901fb4542154ff25c7be6f62bacabbdbd3f57b11
Submitter: Zuul
Branch: master

commit 901fb4542154ff25c7be6f62bacabbdbd3f57b11
Author: Matt Riedemann <email address hidden>
Date: Wed Dec 12 12:32:23 2018 -0500

    Remove lock on SchedulerReportClient._create_client

    This is a follow up to change Idf6e548d725db0181629a451f46b6a3a5850d186
    where the compute API started lazy-loading the SchedulerReportClient
    because of slow startup times, presumably because of the locking
    when creating SchedulerReportClient.

    The lock on SchedulerReportClient._create_client is removed
    since there is no clear indication that it is necessary. It was
    added in change I02ac615dc26a4a0d1fd28a638f622777e41d14e4 as a
    precaution:

      "I've made the _client method synchronized so that in the unlikely event
       that the resource tracker is trying to do its update job while some
       other thing is happening, we won't waste the client. This may not be
       necessary, but probably doesn't harm anything."

    Long-term we could consider making SchedulerReportClient a global
    singleton so a TODO is left for that since it would require inspecting
    cache-sensitive methods to see if a lock is necessary.

    Change-Id: I127e8b78c0dfd5c42a1e1f509dc4fca1af3d5f89
    Related-Bug: #1807219
    Related-Bug: #1806912

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/649197

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/649197
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ae659668b5679cf7223474193d3b9a584dd3f016
Submitter: Zuul
Branch: master

commit ae659668b5679cf7223474193d3b9a584dd3f016
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 1 17:41:15 2019 -0400

    Make nova.compute.rpcapi.ComputeAPI.router a singleton

    When starting nova-api before any nova-computes are started
    and registered in the cell DBs, and with
    [upgrade_levels]/compute=auto, the compute RPC API client
    construction will iterate all cells looking for a minimum
    nova-compute service version, not find one, and thus not
    cache the result in the LAST_VERSION global.

    There are 30+ API controller classes that construct an
    instance of nova.compute.api.API which itself constructs
    a nova.compute.rpcapi.ComputeAPI object which determines
    the version cap as described above, and that is per API
    worker. Each cell DB call goes through RequestContext.set_target_cell
    which has a lock in it, so in this scenario on start of
    nova-api there can be a lot of locking log messages for
    get_or_set_cached_cell_and_set_connections.

    The RPC API ClientRouter can be a singleton and just constructed
    on first access to avoid the redundant database queries which
    is what this change does.

    To preserve the LAST_VERSION re-calculation that was in
    ComputeManager.reset(), we have to also reset the _ROUTER global
    so ComputeManager.reset() now resets all of the compute RPC API
    globals.

    Change-Id: I48109d5e32a2e9635c240da1c77f7f6cc7e3c76d
    Related-Bug: #1807219
    Related-Bug: #1815697

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/684405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/684405
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d54f91e9f5782413807b284a77b3f34f5551fa0
Submitter: Zuul
Branch: stable/stein

commit 7d54f91e9f5782413807b284a77b3f34f5551fa0
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 1 17:41:15 2019 -0400

    Make nova.compute.rpcapi.ComputeAPI.router a singleton

    When starting nova-api before any nova-computes are started
    and registered in the cell DBs, and with
    [upgrade_levels]/compute=auto, the compute RPC API client
    construction will iterate all cells looking for a minimum
    nova-compute service version, not find one, and thus not
    cache the result in the LAST_VERSION global.

    There are 30+ API controller classes that construct an
    instance of nova.compute.api.API which itself constructs
    a nova.compute.rpcapi.ComputeAPI object which determines
    the version cap as described above, and that is per API
    worker. Each cell DB call goes through RequestContext.set_target_cell
    which has a lock in it, so in this scenario on start of
    nova-api there can be a lot of locking log messages for
    get_or_set_cached_cell_and_set_connections.

    The RPC API ClientRouter can be a singleton and just constructed
    on first access to avoid the redundant database queries which
    is what this change does.

    To preserve the LAST_VERSION re-calculation that was in
    ComputeManager.reset(), we have to also reset the _ROUTER global
    so ComputeManager.reset() now resets all of the compute RPC API
    globals.

    Change-Id: I48109d5e32a2e9635c240da1c77f7f6cc7e3c76d
    Related-Bug: #1807219
    Related-Bug: #1815697
    (cherry picked from commit ae659668b5679cf7223474193d3b9a584dd3f016)

tags: added: in-stable-stein
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.