nova-api startup does not scan cells looking for minimum nova-compute service version

Bug #1807044 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Dan Smith
Pike
Won't Fix
Medium
Unassigned
Queens
Won't Fix
Medium
Unassigned
Rocky
Fix Committed
Medium
Matt Riedemann

Bug Description

This CI job failed devstack setup because nova-api took longer than 60 seconds to start (it took 64 seconds):

http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz

Looking at what could be taking time in there, it was noticed that this is logged a lot:

Dec 05 20:14:00.919520 ubuntu-xenial-ovh-bhs1-0000959981 <email address hidden>[23459]: DEBUG nova.compute.rpcapi [None req-dfdfad07-2ff4-43ed-9f67-2acd59687e0c None None] Not caching compute RPC version_cap, because min service_version is 0. Please ensure a nova-compute service has been started. Defaulting to current version. {{(pid=23462) _determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:397}}

That's coming from here:

https://github.com/openstack/nova/blob/aed3e240138ad84352688611246363339414c98f/nova/compute/rpcapi.py#L396

Which is when the compute rpcapi client is initialized, which happens when nova.compute.api.API is initialized:

https://github.com/openstack/nova/blob/aed3e240138ad84352688611246363339414c98f/nova/compute/api.py#L266

Which happens for most of the API extensions, e.g.:

https://github.com/openstack/nova/blob/aed3e240138ad84352688611246363339414c98f/nova/api/openstack/compute/servers.py#L78

So that init and DB query happens num_workers * num_extensions times (we have 2 workers in this case and it looks like there are at least 29 instantiations of the compute API code in the extensions).

The bigger problem is that in this case, nova-api is configured to hit cell0 for the [database]/connection:

http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/etc/nova/nova_conf.txt.gz

[database]
connection = mysql+pymysql://root:secretdatabase@127.0.0.1/nova_cell0?charset=utf8

And there will not be nova-compute services in the cell0 database (if configured properly).

So this query is always going to return 0, at least for devstack:

https://github.com/openstack/nova/blob/aed3e240138ad84352688611246363339414c98f/nova/compute/rpcapi.py#L387

We should really be scanning the cells to get the minimum nova-compute version using this:

https://github.com/openstack/nova/blob/aed3e240138ad84352688611246363339414c98f/nova/objects/service.py#L481

But even on first startup of nova-api before any computes are started and registered with a cell, that initial query will return 0 which means we won't cache the result and continue to do that query and log that message for extensions * workers.

So there are really kind of two issues here:

1. We're not iterating cells properly for that version check. This is the more important issue.

2. We're needlessly doing this query on initial startup which slows down the startup (and contributes to timeouts the devstack jobs on slow nodes) and lots of excessive logging.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Another thing that is probably contributing to the slow nova-api start time is that every nova.compute.api.API constructs a SchedulerReportClient, which grabs an in-memory lock per API worker during init:

Dec 05 20:14:27.694593 ubuntu-xenial-ovh-bhs1-0000959981 <email address hidden>[23459]: DEBUG oslo_concurrency.lockutils [None req-dfdfad07-2ff4-43ed-9f67-2acd59687e0c None None] Lock "placement_client" released by "nova.scheduler.client.report._create_client" :: held 0.006s {{(pid=23462) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:339}}

We could probably be smarter about either making that a singleton in the API or only init on first access since most of the API extensions aren't going to even use that SchedulerReportClient.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I've created bug 1807219 for the issue described in comment 1.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/623282

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/623283

Changed in nova:
assignee: nobody → Dan Smith (danms)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/623284

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/623282
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=183f6238c1d98b04fdcee0fb70212f01f4012ae4
Submitter: Zuul
Branch: master

commit 183f6238c1d98b04fdcee0fb70212f01f4012ae4
Author: Dan Smith <email address hidden>
Date: Thu Dec 6 07:56:04 2018 -0800

    Only warn about not having computes nodes once in rpcapi

    When we instantiate the compute rpcapi, with upgrade_levels=auto,
    we determine the minimum service version of nova-compute. If we
    find no computes, we log a message. This may be known, if we are
    starting at zero day and no computes have started, but may also
    indicate a more serious problem. In either case, logging this once
    per process (especially in the case of multi-worker, multi-module
    services) is more than enough.

    Change-Id: I3350026c872a548839b857c53529c1dbbb88091c
    Related-Bug: #1807044

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/624677

Revision history for this message
Matt Riedemann (mriedem) wrote :

It should also be noted that you only hit this if you configure nova with [upgrade_levels]compute=auto.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/623284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d779b33d72d6ef41651ce7b93fe982f121bae2d7
Submitter: Zuul
Branch: master

commit d779b33d72d6ef41651ce7b93fe982f121bae2d7
Author: Dan Smith <email address hidden>
Date: Thu Dec 6 08:07:37 2018 -0800

    Make compute rpcapi version calculation check all cells

    For top-level services, the compute rpcapi module needs to check for the
    minimum version of nova-compute in all cells in order to function
    properly during an upgrade. This moves that module to use the all-cells
    variant of this routine in the case where [api_database]/connection is
    set.

    Change-Id: Icddbe4760eaff30e4e13c1e8d3d5d3f489dac3c4
    Closes-Bug: #1807044

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/624677
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=58d3ac01a041326a831d6799b217255881d825a7
Submitter: Zuul
Branch: stable/rocky

commit 58d3ac01a041326a831d6799b217255881d825a7
Author: Dan Smith <email address hidden>
Date: Thu Dec 6 07:56:04 2018 -0800

    Only warn about not having computes nodes once in rpcapi

    When we instantiate the compute rpcapi, with upgrade_levels=auto,
    we determine the minimum service version of nova-compute. If we
    find no computes, we log a message. This may be known, if we are
    starting at zero day and no computes have started, but may also
    indicate a more serious problem. In either case, logging this once
    per process (especially in the case of multi-worker, multi-module
    services) is more than enough.

    Change-Id: I3350026c872a548839b857c53529c1dbbb88091c
    Related-Bug: #1807044
    (cherry picked from commit 183f6238c1d98b04fdcee0fb70212f01f4012ae4)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/624982

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/625051

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/625060

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/624982
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=db6fc29ad2b817652d6bceea6f5bd0bdd1cba3bd
Submitter: Zuul
Branch: stable/rocky

commit db6fc29ad2b817652d6bceea6f5bd0bdd1cba3bd
Author: Dan Smith <email address hidden>
Date: Thu Dec 6 08:07:37 2018 -0800

    Make compute rpcapi version calculation check all cells

    For top-level services, the compute rpcapi module needs to check for the
    minimum version of nova-compute in all cells in order to function
    properly during an upgrade. This moves that module to use the all-cells
    variant of this routine in the case where [api_database]/connection is
    set.

    Depends-On: https://review.openstack.org/625039/

    Change-Id: Icddbe4760eaff30e4e13c1e8d3d5d3f489dac3c4
    Closes-Bug: #1807044
    (cherry picked from commit d779b33d72d6ef41651ce7b93fe982f121bae2d7)

Revision history for this message
Matt Riedemann (mriedem) wrote :

Note this resulted in a side effect bug 1815697 when there is at least one down cell in the deployment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.0

This issue was fixed in the openstack/nova 18.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/625051
Reason: Not worth the complexity at this point for queens.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/625060
Reason: Not worth the complexity at this point for queens.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/675148

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/675148
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d7d58509d5e60ec19c6310931dc62eeff033595
Submitter: Zuul
Branch: master

commit 7d7d58509d5e60ec19c6310931dc62eeff033595
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 7 12:23:15 2019 -0400

    Add useful error log when _determine_version_cap raises DBNotAllowed

    Change Icddbe4760eaff30e4e13c1e8d3d5d3f489dac3c4 was intended for the
    API service to check all cells for the minimum nova-compute service
    version when [upgrade_levels]/compute=auto.

    That worked in the gate with devstack because we don't configure
    nova-compute with access to the database and run nova-compute with
    a separate nova-cpu.conf so even if nova-compute is on the same
    host as the nova-api service, they aren't using the same config
    file (nova-api runs with nova.conf which has access to the API DB
    obviously).

    The problem is when nova-compute is configured with
    [upgrade_levels]/compute=auto and an [api_database]/connection,
    there are flows that can try to hit the API database directly
    because of the _determine_version_cap method. For example, the
    _sync_power_states periodic task trying to stop an instance,
    or even simple inter-compute communication over RPC like during
    a resize.

    This change simply catches the DBNotAllowed exception, logs a more
    useful error message, and re-raises the exception. In addition,
    the config help for the [api_database] group and "configuration"
    option specifically are updated to mention they should not be set
    on the nova-compute service.

    Change-Id: Iac2911a7a305a9d14bc6dadb364998f3ecb9ce42
    Related-Bug: #1807044
    Closes-Bug: #1839360

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/675714

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/675714
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bd03723a0c1a16d67433658fb486a84bb1bddf02
Submitter: Zuul
Branch: stable/stein

commit bd03723a0c1a16d67433658fb486a84bb1bddf02
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 7 12:23:15 2019 -0400

    Add useful error log when _determine_version_cap raises DBNotAllowed

    Change Icddbe4760eaff30e4e13c1e8d3d5d3f489dac3c4 was intended for the
    API service to check all cells for the minimum nova-compute service
    version when [upgrade_levels]/compute=auto.

    That worked in the gate with devstack because we don't configure
    nova-compute with access to the database and run nova-compute with
    a separate nova-cpu.conf so even if nova-compute is on the same
    host as the nova-api service, they aren't using the same config
    file (nova-api runs with nova.conf which has access to the API DB
    obviously).

    The problem is when nova-compute is configured with
    [upgrade_levels]/compute=auto and an [api_database]/connection,
    there are flows that can try to hit the API database directly
    because of the _determine_version_cap method. For example, the
    _sync_power_states periodic task trying to stop an instance,
    or even simple inter-compute communication over RPC like during
    a resize.

    This change simply catches the DBNotAllowed exception, logs a more
    useful error message, and re-raises the exception. In addition,
    the config help for the [api_database] group and "configuration"
    option specifically are updated to mention they should not be set
    on the nova-compute service.

    NOTE(mriedem): The test was modified to set the LAST_VERSION
    global to None since change I48109d5e32a2e9635c240da1c77f7f6cc7e3c76d
    is not in Stein.

    Change-Id: Iac2911a7a305a9d14bc6dadb364998f3ecb9ce42
    Related-Bug: #1807044
    Closes-Bug: #1839360
    (cherry picked from commit 7d7d58509d5e60ec19c6310931dc62eeff033595)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/679449

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/679449
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7732f0e6f3691025a06f2abfc5a57c71d83e5e72
Submitter: Zuul
Branch: stable/rocky

commit 7732f0e6f3691025a06f2abfc5a57c71d83e5e72
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 7 12:23:15 2019 -0400

    Add useful error log when _determine_version_cap raises DBNotAllowed

    Change Icddbe4760eaff30e4e13c1e8d3d5d3f489dac3c4 was intended for the
    API service to check all cells for the minimum nova-compute service
    version when [upgrade_levels]/compute=auto.

    That worked in the gate with devstack because we don't configure
    nova-compute with access to the database and run nova-compute with
    a separate nova-cpu.conf so even if nova-compute is on the same
    host as the nova-api service, they aren't using the same config
    file (nova-api runs with nova.conf which has access to the API DB
    obviously).

    The problem is when nova-compute is configured with
    [upgrade_levels]/compute=auto and an [api_database]/connection,
    there are flows that can try to hit the API database directly
    because of the _determine_version_cap method. For example, the
    _sync_power_states periodic task trying to stop an instance,
    or even simple inter-compute communication over RPC like during
    a resize.

    This change simply catches the DBNotAllowed exception, logs a more
    useful error message, and re-raises the exception. In addition,
    the config help for the [api_database] group and "configuration"
    option specifically are updated to mention they should not be set
    on the nova-compute service.

    Change-Id: Iac2911a7a305a9d14bc6dadb364998f3ecb9ce42
    Related-Bug: #1807044
    Closes-Bug: #1839360
    (cherry picked from commit 7d7d58509d5e60ec19c6310931dc62eeff033595)
    (cherry picked from commit bd03723a0c1a16d67433658fb486a84bb1bddf02)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.