n-cpu fails init on timeout calling n-cond-cell1

Bug #1789484 reported by Sean McGinnis on 2018-08-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Dan Smith
Queens
Medium
Matt Riedemann
Rocky
Medium
Matt Riedemann

Bug Description

We've seen several cases now where n-cpu fails to start in gate runs. It appears that during the service init it tries to call out to the cell1 conductor. Logs show n-cond-cell1 did start about 1 second before this, but maybe too soon to accept API calls. Traceback in the n-cpu log has the message:

"MessagingTimeout: Timed out waiting for a reply to message ID"

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-08-28.log.html#t2018-08-28T19:28:52

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22_determine_version_cap%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22&from=7d

Matt Riedemann (mriedem) on 2018-08-28
tags: added: compute gate-failure
Matt Riedemann (mriedem) wrote :
Download full text (8.3 KiB)

(2:38:34 PM) mriedem: 2018-08-27 14:30:31.159 | + functions-common:_run_under_systemd:1482 : sudo systemctl start <email address hidden>
(2:38:54 PM) mriedem: 2018-08-27 14:30:32.414 | + functions-common:_run_under_systemd:1482 : sudo systemctl start <email address hidden>

http://logs.openstack.org/93/596493/1/check/cinder-tempest-dsvm-lvm-lio-barbican/d716acf/logs/screen-n-cpu.txt.gz#_Aug_27_14_31_35_678991

Aug 27 14:31:35.678991 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: CRITICAL nova [None req-90f5253f-cd43-47e6-b247-24f3404706e8 None None] Unhandled error: MessagingTimeout: Timed out waiting for a reply to message ID e7cfc4318ed54ad2adad8b98ae5211c9
Aug 27 14:31:35.679306 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova Traceback (most recent call last):
Aug 27 14:31:35.679462 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/bin/nova-compute", line 10, in <module>
Aug 27 14:31:35.679584 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova sys.exit(main())
Aug 27 14:31:35.679704 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/cmd/compute.py", line 57, in main
Aug 27 14:31:35.679836 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova topic=compute_rpcapi.RPC_TOPIC)
Aug 27 14:31:35.679960 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/service.py", line 259, in create
Aug 27 14:31:35.680117 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova periodic_interval_max=periodic_interval_max)
Aug 27 14:31:35.680244 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/service.py", line 128, in __init__
Aug 27 14:31:35.680362 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova self.manager = manager_class(host=self.host, *args, **kwargs)
Aug 27 14:31:35.680475 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/manager.py", line 505, in __init__
Aug 27 14:31:35.680600 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova self.compute_api = compute.API()
Aug 27 14:31:35.680724 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/__init__.py", line 39, in API
Aug 27 14:31:35.680846 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova return importutils.import_object(class_name, *args, **kwargs)
Aug 27 14:31:35.680960 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_utils/importutils.py", line 44, in import_object
Aug 27 14:31:35.681076 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova return import_class(import_str)(*args, **kwargs)
Aug 27 14:31:35.681188 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/api.py", line 266, in __init__
Aug 27 14:31:35.681304 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova self.compute_rpcapi = compute_rpcapi.ComputeAPI()
Aug 27 14:31:35.68143...

Read more...

Matt Riedemann (mriedem) wrote :

Correction on comment 2, we call post_start_hook before joining the service group:

https://github.com/openstack/nova/blob/843bffca47be71995d9120b9bebcc65689a58ad2/nova/service.py#L200

Changed in nova:
status: New → Triaged
importance: Undecided → Medium

Fix proposed to branch: master
Review: https://review.openstack.org/597330

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/598353

Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/597330
Reason: https://review.openstack.org/#/c/598353/ is what we want, I just didn't know we had that conductor wait_until_ready thing.

Reviewed: https://review.openstack.org/598353
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=df058720434cdca2726eb8c16f323a98bfc73ce3
Submitter: Zuul
Branch: master

commit df058720434cdca2726eb8c16f323a98bfc73ce3
Author: Dan Smith <email address hidden>
Date: Thu Aug 30 14:18:16 2018 -0700

    Move conductor wait_until_ready() delay before manager init

    The Service base class already waits for conductor to be responsive
    before allowing service startup to continue. However, it does this
    after the manager __init__, which could do some things expecting
    conductor to be up (and does now in compute). This moves the check
    and wait loop above the manager init so we don't start anything in
    the manager until we know conductor is alive.

    Change-Id: Ib56fdb38839434d504fadccbeda65ab2e267b514
    Closes-Bug: #1789484

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/599064
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3987f597823c8760e953717d0a7d9ab64a913435
Submitter: Zuul
Branch: stable/rocky

commit 3987f597823c8760e953717d0a7d9ab64a913435
Author: Dan Smith <email address hidden>
Date: Thu Aug 30 14:18:16 2018 -0700

    Move conductor wait_until_ready() delay before manager init

    The Service base class already waits for conductor to be responsive
    before allowing service startup to continue. However, it does this
    after the manager __init__, which could do some things expecting
    conductor to be up (and does now in compute). This moves the check
    and wait loop above the manager init so we don't start anything in
    the manager until we know conductor is alive.

    Change-Id: Ib56fdb38839434d504fadccbeda65ab2e267b514
    Closes-Bug: #1789484
    (cherry picked from commit df058720434cdca2726eb8c16f323a98bfc73ce3)

Reviewed: https://review.openstack.org/599200
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8ca3b075644109795d2c2f277f0bf857c1bd6b88
Submitter: Zuul
Branch: stable/queens

commit 8ca3b075644109795d2c2f277f0bf857c1bd6b88
Author: Dan Smith <email address hidden>
Date: Thu Aug 30 14:18:16 2018 -0700

    Move conductor wait_until_ready() delay before manager init

    The Service base class already waits for conductor to be responsive
    before allowing service startup to continue. However, it does this
    after the manager __init__, which could do some things expecting
    conductor to be up (and does now in compute). This moves the check
    and wait loop above the manager init so we don't start anything in
    the manager until we know conductor is alive.

    Change-Id: Ib56fdb38839434d504fadccbeda65ab2e267b514
    Closes-Bug: #1789484
    (cherry picked from commit df058720434cdca2726eb8c16f323a98bfc73ce3)
    (cherry picked from commit 3987f597823c8760e953717d0a7d9ab64a913435)

This issue was fixed in the openstack/nova 17.0.6 release.

This issue was fixed in the openstack/nova 18.0.1 release.

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers