grenade jobs failing due to "Timed out waiting for response from cell" in scheduler

Bug #1844929 reported by Matt Riedemann on 2019-09-22
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
melanie witt
Queens
Undecided
Elod Illes
Rocky
Undecided
melanie witt
Stein
Undecided
melanie witt
Train
High
melanie witt
grenade
Undecided
Unassigned

Bug Description

Seen here:

https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368

Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: WARNING nova.context [None req-1929039e-1517-4326-9700-738d4b570ba6 tempest-AttachInterfacesUnderV243Test-2009753731 tempest-AttachInterfacesUnderV243Test-2009753731] Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90

Looks like something is causing timeouts reaching cell1 during grenade runs. The only errors I see in the rabbit logs are these for the uwsgi (API) servers:

=ERROR REPORT==== 22-Sep-2019::00:35:30 ===

closing AMQP connection <0.1511.0> (217.182.141.188:48492 -> 217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e-f0605979ed7d):

missed heartbeats from client, timeout: 60s

--

It looks like we don't have mysql logs in this grenade run, maybe we need a fix like this somewhere for grenade:

https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc

logstash shows 1101 hits in the last 7 days, since Sept 17 actually:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d

check and gate queues, all failures. It also appears to only show up on fortnebula and OVH nodes, primarily fortnebula. I wonder if there is a performing/timing issue if those nodes are slower and we aren't waiting for something during the grenade upgrade before proceeding.

Matt Riedemann (mriedem) wrote :

I don't see any changes since Sept 17 in nova, grenade or devstack that appear in any way related to this, so I'm guessing it's something being tickled by slower infra nodes.

Matt Riedemann (mriedem) on 2019-09-22
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

Here is a fix to devstack-gate for collecting the mysql logs during grenade runs:

https://review.opendev.org/#/c/684042/

Matt Riedemann (mriedem) wrote :
Download full text (4.8 KiB)

The confusing thing is it looks like grenade restarts the scheduler:

Sep 22 00:37:32.126839 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Stopped Devstack <email address hidden>.

Sep 22 00:45:55.359862 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Started Devstack <email address hidden>.
...
Sep 22 00:45:59.927792 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO oslo_service.service [None req-d45bc5b0-cd18-49d3-afe4-a9a40f37aefd None None] Starting 2 workers
...
Sep 22 00:46:00.068988 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.230765 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.568873 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Updating service version for nova-scheduler on ubuntu-bionic-ovh-gra1-0011664420 from 37 to 41
...
Sep 22 00:46:00.599841 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Creating RPC server for service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:183}}
...
Sep 22 00:46:00.656407 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Join ServiceGroup membership for this service scheduler {{(pid=19220)
Sep 22 00:46:00.657550 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19220) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}
...
Sep 22 00:46:00.694504 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Join ServiceGroup membership for this service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:201}}
Sep 22 00:46:00.695624 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19219) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}

And then I see this run which is connecting to the cell1 database to pull compute nodes and instances:

Sep 22 00:46:00.734968 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.host_manager [None req-ba4a446d-5a73-4994-8077-366f115fd4ba None None] Total number of compute nodes: 1 {{(pid=18043) _async_init_instance_info /opt/stack/new/nova/nova/scheduler/host_manager.py:428}}

Sep 22 00...

Read more...

Matt Riedemann (mriedem) wrote :

Looking at logstash again this goes back to at least Sept 14 (or older, logstash only saves up to 10 days of logs).

Note that as of Sept 13 upper-constraints is using oslo.service 1.40.2:

https://github.com/openstack/requirements/commit/4d3c335b5cd37dee768927b5360debfe4db7f696

Which is important because it has restart changes in it for a long-standing bug with SIGHUP:

https://review.opendev.org/#/c/641907/ (actually that was released in 1.40.1)

We've been using oslo.service 1.40.1 since Sept 3 in upper-constraints:

https://github.com/openstack/requirements/commit/d09bde76d6aed2a5e26c2018acdfa6b4d43f5456

So that might just be a red herring.

Matt Riedemann (mriedem) wrote :

I think comment 4 can be ignored, we're not doing a SIGHUP:

Sep 22 00:37:27.786606 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[25563]: INFO oslo_service.service [None req-91e88f0d-9b5c-4cb7-a5e9-e7309f922832 None None] Caught SIGTERM, stopping children

Matt Riedemann (mriedem) wrote :

Another data point about post-restart of the scheduler and before that timeout, the verify_instance command runs successfully which connects to the cell1 database:

2019-09-22 00:47:24.934 | + /opt/stack/new/grenade/projects/60_nova/resources.sh:verify:175 : nova-manage cell_v2 verify_instance --uuid 44d7efdc-c048-4dca-8b4b-3d518321eddd

2019-09-22 00:47:29.331 | Instance 44d7efdc-c048-4dca-8b4b-3d518321eddd is in cell: cell1 (8acfb79b-2e40-4e1c-bc3d-d404dac6db90)

Matt Riedemann (mriedem) on 2019-09-23
tags: added: gate-failure
Matt Riedemann (mriedem) wrote :

I got mysqld logs published in the grenade jobs and it's pretty clear that we're starting mysqld 3 times in the grenade run, which 2 I can understand because old and new devstack don't know they are in a grenade context and will (re)start grenade at least twice, but I'm not sure where the 3rd start is coming from. If we're using bionic nodes on both the old and new side I don't think we'd be upgrading mysqld packages but I need to confirm.

In one recent failure this is where things start to go south in the mysqld logs and that's around the time we hit the cell timeout failures:

https://zuul.opendev.org/t/openstack/build/4085120e390f4f1e971c6ff61304a596/log/logs/mysql/error.txt.gz#213

....

OK it looks like it's the same mysqld package version on all 3 starts:

mysqld (mysqld 5.7.27-0ubuntu0.18.04.1) starting as process

So we're not upgrading the package at all, but we are restarting it.

Matt Riedemann (mriedem) wrote :

The 3 restarts are probably:

1. initial package install on the old side
2. re-config for stack user on the old side and restart
3. re-config (same data) for the stack user on the new side and restart

Reviewed: https://review.opendev.org/684118
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0436a95f37df086ddc99017376cb9a312e40517a
Submitter: Zuul
Branch: master

commit 0436a95f37df086ddc99017376cb9a312e40517a
Author: Matt Riedemann <email address hidden>
Date: Mon Sep 23 14:57:44 2019 -0400

    Log CellTimeout traceback in scatter_gather_cells

    When a call to a cell in scatter_gather_cells times out
    we log a warning and set the did_not_respond_sentinel for
    that cell but it would be useful if we logged the traceback
    with the warning for debugging where the call is happening.

    Change-Id: I8f4069474a3955eea6c967d3090f2960e739224c
    Related-Bug: #1844929

Matt Riedemann (mriedem) wrote :

With the traceback logging patch applied it's not really useful:

Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.context [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timed out waiting for response from cell: CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context Traceback (most recent call last):
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/opt/stack/new/nova/nova/context.py", line 443, in scatter_gather_cells
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context cell_uuid, result = queue.get()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 322, in get
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return waiter.wait()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 141, in wait
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return get_hub().switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 298, in switch
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return self.greenlet.switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context
Oct 08 17:02:27.231340 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.scheduler.host_manager [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timeout getting computes for cell 0d45ce2b-0fcc-4d71-9578-84b44fb2fbf0

Matt Riedemann (mriedem) wrote :
Download full text (3.2 KiB)

Also seeing this in a ceph job in n-api:

https://d494348350733031166c-4e71828f84900af50a9a26357b84a827.ssl.cf1.rackcdn.com/689842/5/check/devstack-plugin-ceph-tempest/962455b/controller/logs/screen-n-api.txt.gz

Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi [None req-a94464d7-d36c-405f-a42f-344e543cb205 tempest-TestShelveInstance-198061268 tempest-TestShelveInstance-198061268] Unexpected exception in API method: NovaException: Cell 901092b8-6de2-4aad-b21a-e1c21691eb30 is not responding and hence instance info is not available.
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi Traceback (most recent call last):
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 671, in wrapped
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 471, in show
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 374, in _get_server
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/common.py", line 472, in get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2605, in get
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2552, in _get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2545, in _get_instance_from_cell
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.opensta...

Read more...

Reviewed: https://review.opendev.org/690417
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9377d00ccf7a73071b4fb75d66ce5ad7bd321174
Submitter: Zuul
Branch: master

commit 9377d00ccf7a73071b4fb75d66ce5ad7bd321174
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 22 17:10:19 2019 -0400

    Revert "Log CellTimeout traceback in scatter_gather_cells"

    This reverts commit 0436a95f37df086ddc99017376cb9a312e40517a.

    This was meant to get us more debug details when hitting the
    failure but the results are not helpful [1] so revert this
    and the fix for the resulting regression [2].

    [1] http://paste.openstack.org/show/782116/
    [2] I7f9edc9a4b4930f4dce98df271888fa8082a1701

    Change-Id: Iab8029f081a654278ea7dbbec79a766aea6764ae
    Related-Bug: #1844929

Matt Riedemann (mriedem) wrote :
Download full text (3.7 KiB)

I just noticed this in a nova functional job as well:

https://zuul.opendev.org/t/openstack/build/63001bbd58c244cea70c995f1ebf61fb/log/job-output.txt#3092

In this case it looks like a result of a slow node:

2019-11-05 23:12:57.389624 | ubuntu-bionic | 2019-11-05 23:10:12,022 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'host2'. Re-created its InstanceList.

2019-11-05 23:12:57.389814 | ubuntu-bionic | 2019-11-05 23:10:12,569 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572/servers/9ceb2e4d-9bb3-425b-9454-b171221d6c93" status: 200 len: 2019 microversion: 2.15 time: 0.048796

2019-11-05 23:12:57.389957 | ubuntu-bionic | 2019-11-05 23:11:23,400 WARNING [oslo.service.loopingcall] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 59.77 sec

2019-11-05 23:12:57.390092 | ubuntu-bionic | 2019-11-05 23:11:23,414 WARNING [nova.context] Timed out waiting for response from cell 3d90ac16-9173-4df8-8ef9-1d713ccd8a98

2019-11-05 23:12:57.390189 | ubuntu-bionic | 2019-11-05 23:11:23,415 ERROR [nova.api.openstack.wsgi] Unexpected exception in API method

2019-11-05 23:12:57.390236 | ubuntu-bionic | Traceback (most recent call last):

2019-11-05 23:12:57.390299 | ubuntu-bionic | File "nova/api/openstack/wsgi.py", line 671, in wrapped

2019-11-05 23:12:57.390340 | ubuntu-bionic | return f(*args, **kwargs)

2019-11-05 23:12:57.390406 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390450 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390515 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390559 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390625 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390668 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390734 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390778 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390850 | ubuntu-bionic | File "nova/api/openstack/compute/evacuate.py", line 85, in _evacuate

2019-11-05 23:12:57.390918 | ubuntu-bionic | instance = common.get_instance(self.compute_api, context, id)

2019-11-05 23:12:57.390986 | ubuntu-bionic | File "nova/api/openstack/common.py", line 472, in get_instance

2019-11-05 23:12:57.391035 | ubuntu-bionic | cell_down_support=cell_down_support)

2019-11-05 23:12:57.391104 | ubuntu-bionic | File "nova/compute/api.py", line 2722, in get

2019-11-05 23:12:57.391173 | ubuntu-bionic | expected_attrs, cell_down_support=cell_down_support)

2019-11-05 23:12:57.391236 | ubuntu-bionic | File "nova/compute/api.py", line 2669, in _get_instance

2019-11-05 23:12:57.391284 | ubuntu-bionic | expected_attrs, cell_down_support)

2019-11-05 23:12:57.391354 | ubuntu-bionic | File "nova/compute/api.py", line 2662, in _get_instance_from...

Read more...

Matt Riedemann (mriedem) wrote :

(11:04:52 AM) mriedem: unrelated, but would like to do something about http://status.openstack.org/elastic-recheck/#1844929 since we have no fixes in sight
(11:05:04 AM) mriedem: looking at logstash, that's primarily hitting on ovh nodes,
(11:05:17 AM) mriedem: i wonder if there is a way to exclude certain node providers from grenade jobs?
(11:05:29 AM) mriedem: i know you can specify a label to say a job should run on a given provider, but is there a NOT version of that?
(11:05:36 AM) clarkb: mriedem: we would have to make new nodepool flavors. I think there is a really good chance that swapping is what causes those problems
(11:05:43 AM) clarkb: particularly in ovh where we get fewer iops
(11:06:50 AM) mriedem: hmm, ok - i'm kind of looking for any workaround atm because that bug has been crushing us for months now

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/700214
Reason: Need to do this via the feature support test matrix in devstack-gate:

https://review.opendev.org/#/c/700233/

melanie witt (melwitt) wrote :
Download full text (3.3 KiB)

I dug into this more about a month ago and unfortunately came up empty-handed. Going to dump some info I gathered at the time on a DNM patch [1] here for the record.

"Looking at the mysql error log:

https://zuul.opendev.org/t/openstack/build/833a46b05c9641b9b22b3ee7f394e80b/log/logs/mysql/error.txt.gz

I see lots of errors [2] that I think must be why we get the cell timeouts:

...

2020-01-09T00:11:53.330894Z 157 [Note] Aborted connection 157 to db: 'keystone' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:12:19.047684Z 158 [Note] Aborted connection 158 to db: 'placement' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:13:09.703587Z 161 [Note] Aborted connection 161 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:13:09.703775Z 162 [Note] Aborted connection 162 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:41.215651Z 171 [Note] Aborted connection 171 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.841855Z 173 [Note] Aborted connection 173 to db: 'nova_cell0' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.842886Z 174 [Note] Aborted connection 174 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.958218Z 172 [Note] Aborted connection 172 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:48.749196Z 178 [Note] Aborted connection 178 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:48.749216Z 179 [Note] Aborted connection 179 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)

...

I note that all services are getting the same error, not just nova. Includes cinder, neutron, placement, glance, keystone, and nova."

"For this failure, trying to correlate the cell timeout with interesting things from mysql and dstat logging:

screen-n-sch.txt [3]:

Jan 16 18:12:50.807350 ubuntu-bionic-ovh-bhs1-0013905596 nova-scheduler[17316]: WARNING nova.context [None req-d3eeda8c-74d5-451c-8262-e80fa1e45c0f tempest-ServersTestManualDisk-965972951 tempest-ServersTestManualDisk-965972951] Timed out waiting for response from cell 948fa69c-ada4-488c-9808-83dd645d7069

mysql error.txt [4]:

2020-01-16T18:12:41.934112Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4279ms. The settings might not be optimal. (flushed=4 and evicted=0, during the time.)

screen-dstat.txt [5]:

Jan 16 18:12:40.399103 ubuntu-bionic-ovh-bhs1-0013905596 dstat.sh[1006]: 16-01 18:12:40| 8 1 66 25 0|4785M 168M 438M 2024M| 46B 46B| 0 8008k| 0 144 |2371 4119 |2.27 2.00 2.20| 0 2.0 2.0| 0 0 |uwsgi 164902.5%3585B 62k|python2 1036 179k 457B 0%|mysqld 517M|4096k 8188M| 29 334 0 739 0"

Latest query shows all hits showing up on OVH nodes.

[1] https://review.opendev.org/701...

Read more...

melanie witt (melwitt) wrote :
Download full text (3.4 KiB)

I started looking at this again after lyarwood mentioned it in the nova meeting today.

Looking at the logs/mysql/error.txt of some successful grenade runs, there are a lot of messages like this regarding aborted connections:

2020-03-12T19:07:34.435762Z 4 [Note] Aborted connection 4 to db: 'keystone' user: 'root' host: 'localhost' (Got an error reading communication packets)

so that looks likely to be a red herring.

The mysql logs aren't indexed by our logstash, so looking at a few by hand, there seems to be this consistent pattern in the failed jobs, that is not present in the succeeded jobs:

2020-03-11T12:09:58.384142Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18540ms. The settings might not be optimal. (flushed=200 and evicted=0, during the time.)

2020-03-11T11:40:53.524707Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4382ms. The settings might not be optimal. (flushed=3 and evicted=0, during the time.)

2020-03-11T11:41:05.482158Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8976ms. The settings might not be optimal. (flushed=44 and evicted=0, during the time.)

2020-03-11T11:41:41.406597Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4915ms. The settings might not be optimal. (flushed=200 and evicted=0, during the time.)

2020-03-11T10:37:04.469735Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5434ms. The settings might not be optimal. (flushed=5 and evicted=0, during the time.)

I googled about this and learned it's a periodic task in mysql server that flushes dirty pages every second [1]. From the stackoverflow answer, they say:

"Once per second, the page cleaner scans the buffer pool for dirty pages to flush from the buffer pool to disk. The warning you saw shows that it has lots of dirty pages to flush, and it takes over 4 seconds to flush a batch of them to disk, when it should complete that work in under 1 second. In other words, it's biting off more than it can chew."

They go on to say that this issue can be exacerbated if it's happening on a machine with slow disks as that would also cause the page cleaning to fall behind.

The person who asked the question solved their issue by setting innodb_lru_scan_depth=256 to make the page cleaner process smaller chunks at a time (default is 1024). The person who answered the question noted that this will only work if page cleaner can keep up with the average rate of creating new dirty pages. If it cannot, the flushing rate will be automatically increased once innodb_max_dirty_page_pct is exceeded and may result in page cleaner warnings all over again.

They say:

"Another solution would be to put MySQL on a server with faster disks. You need an I/O system that can handle the throughput demanded by your page flushing.

If you see this warning all the time under average traffic, you might be trying to do too many write queries on this MySQL server. It might be time to scale out, and split the writes over multiple MySQL instances, each with their own disk system."

This again seems to point back at slow nodes.

I'm trying out a DNM devstack patch [2] to set innodb_lru_scan_depth=256 and keep rechecking the DNM nova cha...

Read more...

melanie witt (melwitt) wrote :

Here's a link to a related ML thread having to do with gate failures on slow nodes and mentions the innodb page cleaner warning. Several of the posts in the thread show investigation that has happened and is relevant to this bug:

http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010504.html

melanie witt (melwitt) wrote :

> I'm trying out a DNM devstack patch [2] to set innodb_lru_scan_depth=256 and keep rechecking the DNM nova change [3] to see if it has any effect on the failures.

Well, this seems to have the opposite of the desired effect [4]:

2020-03-12T22:59:10.304511Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 7728ms. The settings might not be optimal. (flushed=30 and evicted=0, during the time.)
2020-03-12T22:59:33.848105Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 11540ms. The settings might not be optimal. (flushed=1 and evicted=0, during the time.)
2020-03-12T23:00:23.206041Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 11356ms. The settings might not be optimal. (flushed=1 and evicted=0, during the time.)
2020-03-12T23:00:44.698049Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8507ms. The settings might not be optimal. (flushed=7 and evicted=0, during the time.)
2020-03-12T23:01:21.877705Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6176ms. The settings might not be optimal. (flushed=17 and evicted=0, during the time.)

but at least I can see the setting change did something.

It also happened to not fail the job run either. I'm going to try out some other settings related to the buffer pool next.

[1] https://stackoverflow.com/questions/41134785/how-to-solve-mysql-warning-innodb-page-cleaner-1000ms-intended-loop-took-xxx
[2] https://review.opendev.org/712805
[3] https://review.opendev.org/701478
[4] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_18a/701478/5/check/grenade-py3/18a9555/controller/logs/mysql/error_log.txt

melanie witt (melwitt) wrote :

I tried some settings in devstack:

[mysqld]
innodb_lru_scan_depth 256
innodb_buffer_pool_size 1G
innodb_buffer_pool_instances 2

but still got a nova-grenade-multinode failure [1] after some rechecks. This time, there is no warning message about page cleaner being behind. So that proves that the page cleaner message is not a fingerprint for this bug.

Recall that I am running tests on a patch that has enabled db connection debug logging via:

[database]
use_db_reconnect "True"
connection_debug "100"

With that in mind, looking at the screen-n-sch.txt log on the failed nova-grenade-multinode run [1], I see one query result for compute_nodes [2] but then after that, I never see another query result for compute_nodes logged again.

Contrast with a passing grenade-py3 run [3], where there are a total of 22 query results for compute_nodes logged in screen-n-sch.txt.

This is showing that we are consistently not receiving rows back from the database in the fail case.

[1] https://zuul.opendev.org/t/openstack/build/8c91fd21815148d9894ac2bf60893a9e
[2] http://paste.openstack.org/show/790679
[3] https://zuul.opendev.org/t/openstack/build/9de69db4ffba4ceeba73b959653618e2/log/controller/logs/screen-n-sch.txt

melanie witt (melwitt) wrote :

Another data point, I tried a new PS of my DNM patch where I reverted the commit that added use of the scatter gather utility via eventlet greenthreads to the scheduler host manager [1][2] and Depends-On a revert on stable/train too (since grenade upgrades N-1 => N) [3]. I did this to test whether we would still see a failure to retrieve compute_nodes rows from the database if we did _not_ use the scatter gather ... or if the problem would disappear and indicate that there is instead a bug or problem with the scatter gather utility.

We just got a fail result back [4] and see the same behavior that is seen with scatter gather in place. We only succeed at receiving rows back from the compute_nodes table once and then never again. Same symptoms as with scatter gather in place, not getting any compute_nodes back from the database.

This failure happened on an OVH node.

[1] https://github.com/openstack/nova/commit/fdea8b723ba5a25ea9dc0917401fbb1401e05ee3
[2] https://review.opendev.org/#/c/701478/7
[3] https://review.opendev.org/#/c/713116/1
[4] https://zuul.opendev.org/t/openstack/build/1e61d0fca1f04b8583ef7463091ec7be/log/logs/screen-n-sch.txt#1041

melanie witt (melwitt) wrote :

After more digging than I'd care to admit, I think I have finally gotten to the bottom of what's happening with this bug.

Through DNM patch debug logging in oslo.db [1], I found that during a grenade run, after a nova-scheduler service stop and start, that child processes of the nova-scheduler (workers) were starting off with already locked internal oslo.db locks occasionally. This can happen if requests are flowing in to the service while it is in the middle of forking child process workers. The first database request fires and takes the lock, and then child processes are forked while the lock is held.

When this happened, database accesses for the particular cell-cached database transaction context manager object could never acquire the lock and would just get stuck, eventually failing with a CellTimeout error.

Here's aggregated snippets of the DNM patch debug logging showing the inherited held locks:

http://paste.openstack.org/show/791646

This behavior of not "resetting" or sanitizing standard library locks at fork is a known issue in python [2] that's currently being worked on.

In the meantime, I think we can handle this on our side by clearing our cell cache that holds oslo.db database transaction context manager objects during service start(). This way, we get fresh oslo.db locks that are in an unlocked state when a child process begins.

[1] https://review.opendev.org/#/c/714802/6
[2] https://bugs.python.org/issue40089

Fix proposed to branch: master
Review: https://review.opendev.org/717662

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: Confirmed → In Progress
melanie witt (melwitt) wrote :

> This behavior of not "resetting" or sanitizing standard library locks at fork is a known issue in python [2] that's currently being worked on.

Correction: this issue [3] is not considered a bug in python and the recommended way of handling the problem is to use the os.register_at_fork() method (new in python 3.7) in oslo.db to reinitialize its lock [4].

For now, we still support python 3.6, so we have to handle it outside of oslo.db some way. But we can also add use of os.register_at_fork() in oslo.db for those who are running >= python 3.7 to handle the problem more generally.

[3] https://bugs.python.org/issue6721
[4] https://bugs.python.org/issue40091#msg365959

Reviewed: https://review.opendev.org/717662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=941559042f609ee43ff3160c0f0d0c45187be17f
Submitter: Zuul
Branch: master

commit 941559042f609ee43ff3160c0f0d0c45187be17f
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000

    Reset the cell cache for database access in Service

    We have had a gate bug for a long time where occasionally the scheduler
    service gets into a state where many requests fail in it with
    CellTimeout errors. Example:

      Timed out waiting for response from cell <cell uuid>

    Through the use of much DNM patch debug logging in oslo.db, it was
    revealed that service child processes (workers) were sometimes starting
    off with already locked internal oslo.db locks. This is a known issue
    in python [1] where if a parent process forks a child process while a
    lock is held, the child will inherit the held lock which can never be
    acquired.

    The python issue is not considered a bug and the recommended way to
    handle it is by making use of the os.register_at_fork() in the oslo.db
    to reinitialize its lock. The method is new in python 3.7, so as long
    as we still support python 3.6, we must handle the situation outside of
    oslo.db.

    We can do this by clearing the cell cache that holds oslo.db database
    transaction context manager objects during service start(). This way,
    we get fresh oslo.db locks that are in an unlocked state when a child
    process begins.

    We can also take this opportunity to resolve part of a TODO to clear
    the same cell cache during service reset() (SIGHUP) since it is another
    case where we intended to clear it. The rest of the TODO related to
    periodic clearing of the cache is removed after discussion on the
    review, as such clearing would be unsynchronized among multiple
    services and for periods of time each service might have a different
    view of cached cells than another.

    Closes-Bug: #1844929

    [1] https://bugs.python.org/issue6721

    Change-Id: Id233f673a57461cc312e304873a41442d732c051

Changed in nova:
status: In Progress → Fix Released
melanie witt (melwitt) wrote :

Through my testing, I found this not to be a problem in grenade.

Changed in grenade:
status: New → Invalid

It seems the problem is still happening even after the fix merged to master.

Here is a recent appearance https://7ad29d1b700c1da60ae0-1bae5319fe4594ade335a46ad1c3bcc9.ssl.cf2.rackcdn.com/717083/5/check/neutron-grenade-multinode/2be9497/logs/screen-n-sch.txt

Changed in nova:
status: Fix Released → Confirmed

As Melanie correctly stated we need to merge the Train backport of the bugfix to have the problem disappear from granade jobs as that runs a train -> master upgrade.

Changed in nova:
status: Confirmed → Fix Released

Reviewed: https://review.opendev.org/718934
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=88205a4e911268dae7120a6a43ff9042d1534251
Submitter: Zuul
Branch: stable/train

commit 88205a4e911268dae7120a6a43ff9042d1534251
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000

    Reset the cell cache for database access in Service

    We have had a gate bug for a long time where occasionally the scheduler
    service gets into a state where many requests fail in it with
    CellTimeout errors. Example:

      Timed out waiting for response from cell <cell uuid>

    Through the use of much DNM patch debug logging in oslo.db, it was
    revealed that service child processes (workers) were sometimes starting
    off with already locked internal oslo.db locks. This is a known issue
    in python [1] where if a parent process forks a child process while a
    lock is held, the child will inherit the held lock which can never be
    acquired.

    The python issue is not considered a bug and the recommended way to
    handle it is by making use of the os.register_at_fork() in the oslo.db
    to reinitialize its lock. The method is new in python 3.7, so as long
    as we still support python 3.6, we must handle the situation outside of
    oslo.db.

    We can do this by clearing the cell cache that holds oslo.db database
    transaction context manager objects during service start(). This way,
    we get fresh oslo.db locks that are in an unlocked state when a child
    process begins.

    We can also take this opportunity to resolve part of a TODO to clear
    the same cell cache during service reset() (SIGHUP) since it is another
    case where we intended to clear it. The rest of the TODO related to
    periodic clearing of the cache is removed after discussion on the
    review, as such clearing would be unsynchronized among multiple
    services and for periods of time each service might have a different
    view of cached cells than another.

    Closes-Bug: #1844929

    [1] https://bugs.python.org/issue6721

    NOTE(melwitt): The differences from the original change in
    nova/tests/functional/test_service.py is because the following changes:

      I91fa2f73185fef48e9aae9b7f61389c374e06676
      I8c96b337f32148f8f5899c9b87af331b1fa41424

    are not in Train.

    Change-Id: Id233f673a57461cc312e304873a41442d732c051
    (cherry picked from commit 941559042f609ee43ff3160c0f0d0c45187be17f)

melanie witt (melwitt) wrote :

Since the Train change has merged, we should expect to see the problem disappear from changes proposed to the master branch. We will still be able to see failures on stable branches though, until we merge changes to Stein, Rocky, Queens.

Let's keep an eye on the chart and logstash query results and ensure we no longer see hits on master changes:

http://status.openstack.org/elastic-recheck/#1844929

And I'll go ahead and propose a change for Stein while we watch the master branch.

melanie witt (melwitt) wrote :

I went ahead and uploaded changes for Stein, Rocky, and Queens so that they are ready, after finding a change [1] on stable/train fail due to this bug, on logstash.

[1] https://review.opendev.org/715748

Reviewed: https://review.opendev.org/720587
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4de766006d9432a7ccbcf6a4d4232db472b2f0e1
Submitter: Zuul
Branch: stable/stein

commit 4de766006d9432a7ccbcf6a4d4232db472b2f0e1
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000

    Reset the cell cache for database access in Service

    We have had a gate bug for a long time where occasionally the scheduler
    service gets into a state where many requests fail in it with
    CellTimeout errors. Example:

      Timed out waiting for response from cell <cell uuid>

    Through the use of much DNM patch debug logging in oslo.db, it was
    revealed that service child processes (workers) were sometimes starting
    off with already locked internal oslo.db locks. This is a known issue
    in python [1] where if a parent process forks a child process while a
    lock is held, the child will inherit the held lock which can never be
    acquired.

    The python issue is not considered a bug and the recommended way to
    handle it is by making use of the os.register_at_fork() in the oslo.db
    to reinitialize its lock. The method is new in python 3.7, so as long
    as we still support python 3.6, we must handle the situation outside of
    oslo.db.

    We can do this by clearing the cell cache that holds oslo.db database
    transaction context manager objects during service start(). This way,
    we get fresh oslo.db locks that are in an unlocked state when a child
    process begins.

    We can also take this opportunity to resolve part of a TODO to clear
    the same cell cache during service reset() (SIGHUP) since it is another
    case where we intended to clear it. The rest of the TODO related to
    periodic clearing of the cache is removed after discussion on the
    review, as such clearing would be unsynchronized among multiple
    services and for periods of time each service might have a different
    view of cached cells than another.

    Closes-Bug: #1844929

    [1] https://bugs.python.org/issue6721

    Change-Id: Id233f673a57461cc312e304873a41442d732c051
    (cherry picked from commit 941559042f609ee43ff3160c0f0d0c45187be17f)
    (cherry picked from commit 88205a4e911268dae7120a6a43ff9042d1534251)

Reviewed: https://review.opendev.org/720592
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a86ebc75eb886bd293dca42439762ecdd69ca0d7
Submitter: Zuul
Branch: stable/rocky

commit a86ebc75eb886bd293dca42439762ecdd69ca0d7
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000

    Reset the cell cache for database access in Service

    We have had a gate bug for a long time where occasionally the scheduler
    service gets into a state where many requests fail in it with
    CellTimeout errors. Example:

      Timed out waiting for response from cell <cell uuid>

    Through the use of much DNM patch debug logging in oslo.db, it was
    revealed that service child processes (workers) were sometimes starting
    off with already locked internal oslo.db locks. This is a known issue
    in python [1] where if a parent process forks a child process while a
    lock is held, the child will inherit the held lock which can never be
    acquired.

    The python issue is not considered a bug and the recommended way to
    handle it is by making use of the os.register_at_fork() in the oslo.db
    to reinitialize its lock. The method is new in python 3.7, so as long
    as we still support python 3.6, we must handle the situation outside of
    oslo.db.

    We can do this by clearing the cell cache that holds oslo.db database
    transaction context manager objects during service start(). This way,
    we get fresh oslo.db locks that are in an unlocked state when a child
    process begins.

    We can also take this opportunity to resolve part of a TODO to clear
    the same cell cache during service reset() (SIGHUP) since it is another
    case where we intended to clear it. The rest of the TODO related to
    periodic clearing of the cache is removed after discussion on the
    review, as such clearing would be unsynchronized among multiple
    services and for periods of time each service might have a different
    view of cached cells than another.

    Closes-Bug: #1844929

    [1] https://bugs.python.org/issue6721

    NOTE(melwitt): The difference from the Stein change in
    nova/tests/functional/test_service.py is because change
    Idaed39629095f86d24a54334c699a26c218c6593 is not in Rocky.

    Change-Id: Id233f673a57461cc312e304873a41442d732c051
    (cherry picked from commit 941559042f609ee43ff3160c0f0d0c45187be17f)
    (cherry picked from commit 88205a4e911268dae7120a6a43ff9042d1534251)
    (cherry picked from commit 4de766006d9432a7ccbcf6a4d4232db472b2f0e1)

Reviewed: https://review.opendev.org/720596
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5aa78acbce9fa1fa18ac963385e82a6367ff445e
Submitter: Zuul
Branch: stable/queens

commit 5aa78acbce9fa1fa18ac963385e82a6367ff445e
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000

    Reset the cell cache for database access in Service

    We have had a gate bug for a long time where occasionally the scheduler
    service gets into a state where many requests fail in it with
    CellTimeout errors. Example:

      Timed out waiting for response from cell <cell uuid>

    Through the use of much DNM patch debug logging in oslo.db, it was
    revealed that service child processes (workers) were sometimes starting
    off with already locked internal oslo.db locks. This is a known issue
    in python [1] where if a parent process forks a child process while a
    lock is held, the child will inherit the held lock which can never be
    acquired.

    The python issue is not considered a bug and the recommended way to
    handle it is by making use of the os.register_at_fork() in the oslo.db
    to reinitialize its lock. The method is new in python 3.7, so as long
    as we still support python 3.6, we must handle the situation outside of
    oslo.db.

    We can do this by clearing the cell cache that holds oslo.db database
    transaction context manager objects during service start(). This way,
    we get fresh oslo.db locks that are in an unlocked state when a child
    process begins.

    We can also take this opportunity to resolve part of a TODO to clear
    the same cell cache during service reset() (SIGHUP) since it is another
    case where we intended to clear it. The rest of the TODO related to
    periodic clearing of the cache is removed after discussion on the
    review, as such clearing would be unsynchronized among multiple
    services and for periods of time each service might have a different
    view of cached cells than another.

    Closes-Bug: #1844929

    [1] https://bugs.python.org/issue6721

    NOTE(melwitt): This backport differs slightly in that the test setup
    calls set_stub_network_methods because change
    I1dbccc2be6ba79bf267edac9208c80e187e6256a is not in Queens.

    Change-Id: Id233f673a57461cc312e304873a41442d732c051
    (cherry picked from commit 941559042f609ee43ff3160c0f0d0c45187be17f)
    (cherry picked from commit 88205a4e911268dae7120a6a43ff9042d1534251)
    (cherry picked from commit 4de766006d9432a7ccbcf6a4d4232db472b2f0e1)
    (cherry picked from commit a86ebc75eb886bd293dca42439762ecdd69ca0d7)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.