[upgrade_levels]compute=auto grinds the API response times when a cell is down

Bug #1815697 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

A lot of my notes are in https://review.openstack.org/#/c/591657/ where I was testing a down cell on a devstack deployment.

To simulate a down cell, I changed the database_connection value for the cell1 cell to be an invalid IP (192.0.0.1) and then restarted <email address hidden>.

With the default configs in devstack, the service was hanging trying to respond to a simple GET / request to list versions. It looks like the problem is because each nova.compute.api.API object that gets created for each route handler (for each API worker, which in my case is 2) tries to get the minimum nova-compute service version across all cells:

https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/api.py#L261

https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L373

https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L395

This is a snip of the API log while waiting for the GET / response:

http://paste.openstack.org/show/744983/

As a result I got this unhelpful client side error:

http://paste.openstack.org/show/744984/

I know that's where the failure was because I was also getting this:

Feb 13 00:09:57 downcell <email address hidden>[14623]: DEBUG nova.compute.rpcapi [None req-53ebccae-d210-4b14-af5c-02775f3d36e8 None None] Not caching compute RPC version_cap, because min service_version is 0. Please ensure a nova-compute service has been started. Defaulting to current version. {{(pid=14625) _determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:410}}

The minimum nova-compute service version isn't getting cached in nova-api if running under uwsgi anyway for which I reported bug 1815692.

The way I worked around the issue was by setting [upgrade_levels]/compute=rocky but that's probably not something we want to rely on when we can set to 'auto' and have the code calculate it for us, but it can hang the API workers.

Also note the default database max_attempts and retry_interval are 10 which means for each API object created that hits this, it's going to take 100 seconds to timeout per route handler per API worker. I count 31 route handlers that create an API object, so that's by default 3100 seconds or about ~52 minutes per worker on startup.

Revision history for this message
Surya Seetharaman (tssurya) wrote :

Thanks for reporting this, I do change my max_retries value in devstack during testing to avoid long waits.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I was hoping that maybe Dan's change here would help with the API startup issue:

https://review.openstack.org/#/c/623283/

but that won't cache the minimum compute version until it gets a non-0 minimum which I don't think will happen while there is a down cell.

I wonder if we could enhance that code to filter out disabled cells? So if you have a down cell, you disable it and then we exclude it from the cache results because the scheduler shouldn't pick it for anything (new server creates or move operations).

Other things I was thinking about are if we can make the _determine_version_cap result a global or singleton or something so that each of the 31 nova.compute.api.API initializations don't go through the same cell timeout loop to find out there is a down cell and hang the API.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/649197
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ae659668b5679cf7223474193d3b9a584dd3f016
Submitter: Zuul
Branch: master

commit ae659668b5679cf7223474193d3b9a584dd3f016
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 1 17:41:15 2019 -0400

    Make nova.compute.rpcapi.ComputeAPI.router a singleton

    When starting nova-api before any nova-computes are started
    and registered in the cell DBs, and with
    [upgrade_levels]/compute=auto, the compute RPC API client
    construction will iterate all cells looking for a minimum
    nova-compute service version, not find one, and thus not
    cache the result in the LAST_VERSION global.

    There are 30+ API controller classes that construct an
    instance of nova.compute.api.API which itself constructs
    a nova.compute.rpcapi.ComputeAPI object which determines
    the version cap as described above, and that is per API
    worker. Each cell DB call goes through RequestContext.set_target_cell
    which has a lock in it, so in this scenario on start of
    nova-api there can be a lot of locking log messages for
    get_or_set_cached_cell_and_set_connections.

    The RPC API ClientRouter can be a singleton and just constructed
    on first access to avoid the redundant database queries which
    is what this change does.

    To preserve the LAST_VERSION re-calculation that was in
    ComputeManager.reset(), we have to also reset the _ROUTER global
    so ComputeManager.reset() now resets all of the compute RPC API
    globals.

    Change-Id: I48109d5e32a2e9635c240da1c77f7f6cc7e3c76d
    Related-Bug: #1807219
    Related-Bug: #1815697

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/684405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/684405
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d54f91e9f5782413807b284a77b3f34f5551fa0
Submitter: Zuul
Branch: stable/stein

commit 7d54f91e9f5782413807b284a77b3f34f5551fa0
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 1 17:41:15 2019 -0400

    Make nova.compute.rpcapi.ComputeAPI.router a singleton

    When starting nova-api before any nova-computes are started
    and registered in the cell DBs, and with
    [upgrade_levels]/compute=auto, the compute RPC API client
    construction will iterate all cells looking for a minimum
    nova-compute service version, not find one, and thus not
    cache the result in the LAST_VERSION global.

    There are 30+ API controller classes that construct an
    instance of nova.compute.api.API which itself constructs
    a nova.compute.rpcapi.ComputeAPI object which determines
    the version cap as described above, and that is per API
    worker. Each cell DB call goes through RequestContext.set_target_cell
    which has a lock in it, so in this scenario on start of
    nova-api there can be a lot of locking log messages for
    get_or_set_cached_cell_and_set_connections.

    The RPC API ClientRouter can be a singleton and just constructed
    on first access to avoid the redundant database queries which
    is what this change does.

    To preserve the LAST_VERSION re-calculation that was in
    ComputeManager.reset(), we have to also reset the _ROUTER global
    so ComputeManager.reset() now resets all of the compute RPC API
    globals.

    Change-Id: I48109d5e32a2e9635c240da1c77f7f6cc7e3c76d
    Related-Bug: #1807219
    Related-Bug: #1815697
    (cherry picked from commit ae659668b5679cf7223474193d3b9a584dd3f016)

tags: added: in-stable-stein
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.