grenade jobs failing due to "Timed out waiting for response from cell" in scheduler

Bug #1844929 reported by Matt Riedemann on 2019-09-22
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Unassigned

Bug Description

Seen here:

https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368

Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: WARNING nova.context [None req-1929039e-1517-4326-9700-738d4b570ba6 tempest-AttachInterfacesUnderV243Test-2009753731 tempest-AttachInterfacesUnderV243Test-2009753731] Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90

Looks like something is causing timeouts reaching cell1 during grenade runs. The only errors I see in the rabbit logs are these for the uwsgi (API) servers:

=ERROR REPORT==== 22-Sep-2019::00:35:30 ===

closing AMQP connection <0.1511.0> (217.182.141.188:48492 -> 217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e-f0605979ed7d):

missed heartbeats from client, timeout: 60s

--

It looks like we don't have mysql logs in this grenade run, maybe we need a fix like this somewhere for grenade:

https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc

logstash shows 1101 hits in the last 7 days, since Sept 17 actually:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d

check and gate queues, all failures. It also appears to only show up on fortnebula and OVH nodes, primarily fortnebula. I wonder if there is a performing/timing issue if those nodes are slower and we aren't waiting for something during the grenade upgrade before proceeding.

Matt Riedemann (mriedem) wrote :

I don't see any changes since Sept 17 in nova, grenade or devstack that appear in any way related to this, so I'm guessing it's something being tickled by slower infra nodes.

Matt Riedemann (mriedem) on 2019-09-22
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

Here is a fix to devstack-gate for collecting the mysql logs during grenade runs:

https://review.opendev.org/#/c/684042/

Matt Riedemann (mriedem) wrote :
Download full text (4.8 KiB)

The confusing thing is it looks like grenade restarts the scheduler:

Sep 22 00:37:32.126839 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Stopped Devstack <email address hidden>.

Sep 22 00:45:55.359862 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Started Devstack <email address hidden>.
...
Sep 22 00:45:59.927792 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO oslo_service.service [None req-d45bc5b0-cd18-49d3-afe4-a9a40f37aefd None None] Starting 2 workers
...
Sep 22 00:46:00.068988 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.230765 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.568873 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Updating service version for nova-scheduler on ubuntu-bionic-ovh-gra1-0011664420 from 37 to 41
...
Sep 22 00:46:00.599841 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Creating RPC server for service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:183}}
...
Sep 22 00:46:00.656407 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Join ServiceGroup membership for this service scheduler {{(pid=19220)
Sep 22 00:46:00.657550 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19220) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}
...
Sep 22 00:46:00.694504 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Join ServiceGroup membership for this service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:201}}
Sep 22 00:46:00.695624 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19219) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}

And then I see this run which is connecting to the cell1 database to pull compute nodes and instances:

Sep 22 00:46:00.734968 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.host_manager [None req-ba4a446d-5a73-4994-8077-366f115fd4ba None None] Total number of compute nodes: 1 {{(pid=18043) _async_init_instance_info /opt/stack/new/nova/nova/scheduler/host_manager.py:428}}

Sep 22 00...

Read more...

Matt Riedemann (mriedem) wrote :

Looking at logstash again this goes back to at least Sept 14 (or older, logstash only saves up to 10 days of logs).

Note that as of Sept 13 upper-constraints is using oslo.service 1.40.2:

https://github.com/openstack/requirements/commit/4d3c335b5cd37dee768927b5360debfe4db7f696

Which is important because it has restart changes in it for a long-standing bug with SIGHUP:

https://review.opendev.org/#/c/641907/ (actually that was released in 1.40.1)

We've been using oslo.service 1.40.1 since Sept 3 in upper-constraints:

https://github.com/openstack/requirements/commit/d09bde76d6aed2a5e26c2018acdfa6b4d43f5456

So that might just be a red herring.

Matt Riedemann (mriedem) wrote :

I think comment 4 can be ignored, we're not doing a SIGHUP:

Sep 22 00:37:27.786606 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[25563]: INFO oslo_service.service [None req-91e88f0d-9b5c-4cb7-a5e9-e7309f922832 None None] Caught SIGTERM, stopping children

Matt Riedemann (mriedem) wrote :

Another data point about post-restart of the scheduler and before that timeout, the verify_instance command runs successfully which connects to the cell1 database:

2019-09-22 00:47:24.934 | + /opt/stack/new/grenade/projects/60_nova/resources.sh:verify:175 : nova-manage cell_v2 verify_instance --uuid 44d7efdc-c048-4dca-8b4b-3d518321eddd

2019-09-22 00:47:29.331 | Instance 44d7efdc-c048-4dca-8b4b-3d518321eddd is in cell: cell1 (8acfb79b-2e40-4e1c-bc3d-d404dac6db90)

Matt Riedemann (mriedem) on 2019-09-23
tags: added: gate-failure
Matt Riedemann (mriedem) wrote :

I got mysqld logs published in the grenade jobs and it's pretty clear that we're starting mysqld 3 times in the grenade run, which 2 I can understand because old and new devstack don't know they are in a grenade context and will (re)start grenade at least twice, but I'm not sure where the 3rd start is coming from. If we're using bionic nodes on both the old and new side I don't think we'd be upgrading mysqld packages but I need to confirm.

In one recent failure this is where things start to go south in the mysqld logs and that's around the time we hit the cell timeout failures:

https://zuul.opendev.org/t/openstack/build/4085120e390f4f1e971c6ff61304a596/log/logs/mysql/error.txt.gz#213

....

OK it looks like it's the same mysqld package version on all 3 starts:

mysqld (mysqld 5.7.27-0ubuntu0.18.04.1) starting as process

So we're not upgrading the package at all, but we are restarting it.

Matt Riedemann (mriedem) wrote :

The 3 restarts are probably:

1. initial package install on the old side
2. re-config for stack user on the old side and restart
3. re-config (same data) for the stack user on the new side and restart

Reviewed: https://review.opendev.org/684118
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0436a95f37df086ddc99017376cb9a312e40517a
Submitter: Zuul
Branch: master

commit 0436a95f37df086ddc99017376cb9a312e40517a
Author: Matt Riedemann <email address hidden>
Date: Mon Sep 23 14:57:44 2019 -0400

    Log CellTimeout traceback in scatter_gather_cells

    When a call to a cell in scatter_gather_cells times out
    we log a warning and set the did_not_respond_sentinel for
    that cell but it would be useful if we logged the traceback
    with the warning for debugging where the call is happening.

    Change-Id: I8f4069474a3955eea6c967d3090f2960e739224c
    Related-Bug: #1844929

Matt Riedemann (mriedem) wrote :

With the traceback logging patch applied it's not really useful:

Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.context [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timed out waiting for response from cell: CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context Traceback (most recent call last):
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/opt/stack/new/nova/nova/context.py", line 443, in scatter_gather_cells
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context cell_uuid, result = queue.get()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 322, in get
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return waiter.wait()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 141, in wait
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return get_hub().switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 298, in switch
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return self.greenlet.switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context
Oct 08 17:02:27.231340 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.scheduler.host_manager [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timeout getting computes for cell 0d45ce2b-0fcc-4d71-9578-84b44fb2fbf0

Matt Riedemann (mriedem) wrote :
Download full text (3.2 KiB)

Also seeing this in a ceph job in n-api:

https://d494348350733031166c-4e71828f84900af50a9a26357b84a827.ssl.cf1.rackcdn.com/689842/5/check/devstack-plugin-ceph-tempest/962455b/controller/logs/screen-n-api.txt.gz

Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi [None req-a94464d7-d36c-405f-a42f-344e543cb205 tempest-TestShelveInstance-198061268 tempest-TestShelveInstance-198061268] Unexpected exception in API method: NovaException: Cell 901092b8-6de2-4aad-b21a-e1c21691eb30 is not responding and hence instance info is not available.
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi Traceback (most recent call last):
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 671, in wrapped
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 471, in show
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 374, in _get_server
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/common.py", line 472, in get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2605, in get
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2552, in _get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2545, in _get_instance_from_cell
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.opensta...

Read more...

Reviewed: https://review.opendev.org/690417
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9377d00ccf7a73071b4fb75d66ce5ad7bd321174
Submitter: Zuul
Branch: master

commit 9377d00ccf7a73071b4fb75d66ce5ad7bd321174
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 22 17:10:19 2019 -0400

    Revert "Log CellTimeout traceback in scatter_gather_cells"

    This reverts commit 0436a95f37df086ddc99017376cb9a312e40517a.

    This was meant to get us more debug details when hitting the
    failure but the results are not helpful [1] so revert this
    and the fix for the resulting regression [2].

    [1] http://paste.openstack.org/show/782116/
    [2] I7f9edc9a4b4930f4dce98df271888fa8082a1701

    Change-Id: Iab8029f081a654278ea7dbbec79a766aea6764ae
    Related-Bug: #1844929

Matt Riedemann (mriedem) wrote :
Download full text (3.7 KiB)

I just noticed this in a nova functional job as well:

https://zuul.opendev.org/t/openstack/build/63001bbd58c244cea70c995f1ebf61fb/log/job-output.txt#3092

In this case it looks like a result of a slow node:

2019-11-05 23:12:57.389624 | ubuntu-bionic | 2019-11-05 23:10:12,022 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'host2'. Re-created its InstanceList.

2019-11-05 23:12:57.389814 | ubuntu-bionic | 2019-11-05 23:10:12,569 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572/servers/9ceb2e4d-9bb3-425b-9454-b171221d6c93" status: 200 len: 2019 microversion: 2.15 time: 0.048796

2019-11-05 23:12:57.389957 | ubuntu-bionic | 2019-11-05 23:11:23,400 WARNING [oslo.service.loopingcall] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 59.77 sec

2019-11-05 23:12:57.390092 | ubuntu-bionic | 2019-11-05 23:11:23,414 WARNING [nova.context] Timed out waiting for response from cell 3d90ac16-9173-4df8-8ef9-1d713ccd8a98

2019-11-05 23:12:57.390189 | ubuntu-bionic | 2019-11-05 23:11:23,415 ERROR [nova.api.openstack.wsgi] Unexpected exception in API method

2019-11-05 23:12:57.390236 | ubuntu-bionic | Traceback (most recent call last):

2019-11-05 23:12:57.390299 | ubuntu-bionic | File "nova/api/openstack/wsgi.py", line 671, in wrapped

2019-11-05 23:12:57.390340 | ubuntu-bionic | return f(*args, **kwargs)

2019-11-05 23:12:57.390406 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390450 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390515 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390559 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390625 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390668 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390734 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390778 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390850 | ubuntu-bionic | File "nova/api/openstack/compute/evacuate.py", line 85, in _evacuate

2019-11-05 23:12:57.390918 | ubuntu-bionic | instance = common.get_instance(self.compute_api, context, id)

2019-11-05 23:12:57.390986 | ubuntu-bionic | File "nova/api/openstack/common.py", line 472, in get_instance

2019-11-05 23:12:57.391035 | ubuntu-bionic | cell_down_support=cell_down_support)

2019-11-05 23:12:57.391104 | ubuntu-bionic | File "nova/compute/api.py", line 2722, in get

2019-11-05 23:12:57.391173 | ubuntu-bionic | expected_attrs, cell_down_support=cell_down_support)

2019-11-05 23:12:57.391236 | ubuntu-bionic | File "nova/compute/api.py", line 2669, in _get_instance

2019-11-05 23:12:57.391284 | ubuntu-bionic | expected_attrs, cell_down_support)

2019-11-05 23:12:57.391354 | ubuntu-bionic | File "nova/compute/api.py", line 2662, in _get_instance_from...

Read more...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers