Bug #1844929 “grenade jobs failing due to “Timed out waiting for...” : Bugs : grenade

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-22:

#1

I don't see any changes since Sept 17 in nova, grenade or devstack that appear in any way related to this, so I'm guessing it's something being tickled by slower infra nodes.

Matt Riedemann (mriedem) on 2019-09-22

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-23:

#2

Here is a fix to devstack-gate for collecting the mysql logs during grenade runs:

https://review.opendev.org/#/c/684042/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-23:

#3

Download full text (4.8 KiB)

The confusing thing is it looks like grenade restarts the scheduler:

Sep 22 00:37:32.126839 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Stopped Devstack <email address hidden>.

Sep 22 00:45:55.359862 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Started Devstack <email address hidden>.
...
Sep 22 00:45:59.927792 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO oslo_service.service [None req-d45bc5b0-cd18-49d3-afe4-a9a40f37aefd None None] Starting 2 workers
...
Sep 22 00:46:00.068988 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.230765 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.568873 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Updating service version for nova-scheduler on ubuntu-bionic-ovh-gra1-0011664420 from 37 to 41
...
Sep 22 00:46:00.599841 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Creating RPC server for service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:183}}
...
Sep 22 00:46:00.656407 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Join ServiceGroup membership for this service scheduler {{(pid=19220)
Sep 22 00:46:00.657550 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19220) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}
...
Sep 22 00:46:00.694504 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Join ServiceGroup membership for this service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:201}}
Sep 22 00:46:00.695624 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19219) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}

And then I see this run which is connecting to the cell1 database to pull compute nodes and instances:

Sep 22 00:46:00.734968 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.host_manager [None req-ba4a446d-5a73-4994-8077-366f115fd4ba None None] Total number of compute nodes: 1 {{(pid=18043) _async_init_instance_info /opt/stack/new/nova/nova/scheduler/host_manager.py:428}}

Sep 22 00...

The confusing thing is it looks like grenade restarts the scheduler:

Sep 22 00:37:32.126839 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Stopped Devstack devstack@n-sch.service.

Sep 22 00:45:55.359862 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Started Devstack devstack@n-sch.service.
...
Sep 22 00:45:59.927792 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO oslo_service.service [None req-d45bc5b0-cd18-49d3-afe4-a9a40f37aefd None None] Starting 2 workers
...
Sep 22 00:46:00.068988 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.230765 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.568873 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Updating service version for nova-scheduler on ubuntu-bionic-ovh-gra1-0011664420 from 37 to 41
...
Sep 22 00:46:00.599841 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Creating RPC server for service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:183}}
...
Sep 22 00:46:00.656407 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Join ServiceGroup membership for this service scheduler {{(pid=19220)
Sep 22 00:46:00.657550 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19220) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}
...
Sep 22 00:46:00.694504 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Join ServiceGroup membership for this service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:201}}
Sep 22 00:46:00.695624 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19219) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}

And then I see this run which is connecting to the cell1 database to pull compute nodes and instances:

Sep 22 00:46:00.734968 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.host_manager [None req-ba4a446d-5a73-4994-8077-366f115fd4ba None None] Total number of compute nodes: 1 {{(pid=18043) _async_init_instance_info /opt/stack/new/nova/nova/scheduler/host_manager.py:428}}

Sep 22 00:46:00.783135 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.host_manager [None req-ba4a446d-5a73-4994-8077-366f115fd4ba None None] Adding 2 instances for hosts 10-20 {{(pid=18043) _async_init_instance_info /opt/stack/new/nova/nova/scheduler/host_manager.py:448}}

That's coming from this method in the scheduler HostManager on restart:

https://github.com/openstack/nova/blob/597b34cd87ac349c0f3702a872630f3c830b1483/nova/scheduler/host_manager.py#L413

Since this is a single node grenade job, we have 1 compute node record in the cell1 database and there are 2 instances on that node from the old side of grenade where we create some resources to verify after the upgrade on the new side.

Then the first scheduling request starts here:

Sep 22 00:49:53.890357 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.manager [None req-1929039e-1517-4326-9700-738d4b570ba6 tempest-AttachInterfacesUnderV243Test-2009753731 tempest-AttachInterfacesUnderV243Test-2009753731] Starting to schedule for instances: ['aac35009-5eda-41f7-bb2e-893cc4f868e2'] {{(pid=19220) select_destinations /opt/stack/new/nova/nova/scheduler/manager.py:133}}

That's nearly 4 minutes later and then we've lost the connection to cell1:

Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: WARNING nova.context [None req-1929039e-1517-4326-9700-738d4b570ba6 tempest-AttachInterfacesUnderV243Test-2009753731 tempest-AttachInterfacesUnderV243Test-2009753731] Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90

So what happened between 00:46:00 and 00:49:53 when we lose the connection to the cell1 DB.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-23:

#4

Looking at logstash again this goes back to at least Sept 14 (or older, logstash only saves up to 10 days of logs).

Note that as of Sept 13 upper-constraints is using oslo.service 1.40.2:

https://github.com/openstack/requirements/commit/4d3c335b5cd37dee768927b5360debfe4db7f696

Which is important because it has restart changes in it for a long-standing bug with SIGHUP:

https://review.opendev.org/#/c/641907/ (actually that was released in 1.40.1)

We've been using oslo.service 1.40.1 since Sept 3 in upper-constraints:

https://github.com/openstack/requirements/commit/d09bde76d6aed2a5e26c2018acdfa6b4d43f5456

So that might just be a red herring.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-23:

#5

I think comment 4 can be ignored, we're not doing a SIGHUP:

Sep 22 00:37:27.786606 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[25563]: INFO oslo_service.service [None req-91e88f0d-9b5c-4cb7-a5e9-e7309f922832 None None] Caught SIGTERM, stopping children

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-23: Related fix proposed to nova (master)

#6

Related fix proposed to branch: master
Review: https://review.opendev.org/684118

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-23:

#7

Another data point about post-restart of the scheduler and before that timeout, the verify_instance command runs successfully which connects to the cell1 database:

2019-09-22 00:47:24.934 | + /opt/stack/new/grenade/projects/60_nova/resources.sh:verify:175 : nova-manage cell_v2 verify_instance --uuid 44d7efdc-c048-4dca-8b4b-3d518321eddd

2019-09-22 00:47:29.331 | Instance 44d7efdc-c048-4dca-8b4b-3d518321eddd is in cell: cell1 (8acfb79b-2e40-4e1c-bc3d-d404dac6db90)

Matt Riedemann (mriedem) on 2019-09-23

tags:

added: gate-failure

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-10-04:

#8

I got mysqld logs published in the grenade jobs and it's pretty clear that we're starting mysqld 3 times in the grenade run, which 2 I can understand because old and new devstack don't know they are in a grenade context and will (re)start grenade at least twice, but I'm not sure where the 3rd start is coming from. If we're using bionic nodes on both the old and new side I don't think we'd be upgrading mysqld packages but I need to confirm.

In one recent failure this is where things start to go south in the mysqld logs and that's around the time we hit the cell timeout failures:

https://zuul.opendev.org/t/openstack/build/4085120e390f4f1e971c6ff61304a596/log/logs/mysql/error.txt.gz#213

....

OK it looks like it's the same mysqld package version on all 3 starts:

mysqld (mysqld 5.7.27-0ubuntu0.18.04.1) starting as process

So we're not upgrading the package at all, but we are restarting it.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-10-04:

#9

The 3 restarts are probably:

1. initial package install on the old side
2. re-config for stack user on the old side and restart
3. re-config (same data) for the stack user on the new side and restart

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-05: Related fix merged to nova (master)

#10

Reviewed: https://review.opendev.org/684118
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0436a95f37df086ddc99017376cb9a312e40517a
Submitter: Zuul
Branch: master

commit 0436a95f37df086ddc99017376cb9a312e40517a
Author: Matt Riedemann <email address hidden>
Date: Mon Sep 23 14:57:44 2019 -0400

Log CellTimeout traceback in scatter_gather_cells

    When a call to a cell in scatter_gather_cells times out
    we log a warning and set the did_not_respond_sentinel for
    that cell but it would be useful if we logged the traceback
    with the warning for debugging where the call is happening.

Change-Id: I8f4069474a3955eea6c967d3090f2960e739224c
Related-Bug: #1844929

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-10-08:

#11

With the traceback logging patch applied it's not really useful:

Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.context [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timed out waiting for response from cell: CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context Traceback (most recent call last):
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/opt/stack/new/nova/nova/context.py", line 443, in scatter_gather_cells
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context cell_uuid, result = queue.get()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 322, in get
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return waiter.wait()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 141, in wait
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return get_hub().switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 298, in switch
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context return self.greenlet.switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context
Oct 08 17:02:27.231340 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.scheduler.host_manager [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timeout getting computes for cell 0d45ce2b-0fcc-4d71-9578-84b44fb2fbf0

With the traceback logging patch applied it's not really useful:

Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.context [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timed out waiting for response from cell: CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context Traceback (most recent call last):
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context   File "/opt/stack/new/nova/nova/context.py", line 443, in scatter_gather_cells
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context     cell_uuid, result = queue.get()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 322, in get
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context     return waiter.wait()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/eventlet/queue.py", line 141, in wait
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context     return get_hub().switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 298, in switch
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context     return self.greenlet.switch()
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context CellTimeout: Timeout waiting for response from cell
Oct 08 17:02:27.227697 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: ERROR nova.context 
Oct 08 17:02:27.231340 ubuntu-bionic-ovh-gra1-0012213333 nova-scheduler[30384]: WARNING nova.scheduler.host_manager [None req-0db41537-a738-4425-bf10-0abccf51ac05 demo demo] Timeout getting computes for cell 0d45ce2b-0fcc-4d71-9578-84b44fb2fbf0

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-10-21:

#12

Download full text (3.2 KiB)

Also seeing this in a ceph job in n-api:

https://d494348350733031166c-4e71828f84900af50a9a26357b84a827.ssl.cf1.rackcdn.com/689842/5/check/devstack-plugin-ceph-tempest/962455b/controller/logs/screen-n-api.txt.gz

Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi [None req-a94464d7-d36c-405f-a42f-344e543cb205 tempest-TestShelveInstance-198061268 tempest-TestShelveInstance-198061268] Unexpected exception in API method: NovaException: Cell 901092b8-6de2-4aad-b21a-e1c21691eb30 is not responding and hence instance info is not available.
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi Traceback (most recent call last):
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 671, in wrapped
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 471, in show
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 374, in _get_server
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/common.py", line 472, in get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2605, in get
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2552, in _get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/compute/api.py", line 2545, in _get_instance_from_cell
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 <email address hidden>[20043]: ERROR nova.api.opensta...

Also seeing this in a ceph job in n-api:

https://d494348350733031166c-4e71828f84900af50a9a26357b84a827.ssl.cf1.rackcdn.com/689842/5/check/devstack-plugin-ceph-tempest/962455b/controller/logs/screen-n-api.txt.gz

Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi [None req-a94464d7-d36c-405f-a42f-344e543cb205 tempest-TestShelveInstance-198061268 tempest-TestShelveInstance-198061268] Unexpected exception in API method: NovaException: Cell 901092b8-6de2-4aad-b21a-e1c21691eb30 is not responding and hence instance info is not available.
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi Traceback (most recent call last):
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 671, in wrapped
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi     return f(*args, **kwargs)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 471, in show
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi     cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/compute/servers.py", line 374, in _get_server
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi     cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/common.py", line 472, in get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi     cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi   File "/opt/stack/nova/nova/compute/api.py", line 2605, in get
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi     expected_attrs, cell_down_support=cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi   File "/opt/stack/nova/nova/compute/api.py", line 2552, in _get_instance
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi     expected_attrs, cell_down_support)
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi   File "/opt/stack/nova/nova/compute/api.py", line 2545, in _get_instance_from_cell
Oct 21 22:41:11.686000 ubuntu-bionic-ovh-gra1-0012414214 devstack@n-api.service[20043]: ERROR nova.api.openstack.wsgi     "info is not available.") % cell_uuid)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-22: Related fix proposed to nova (master)

#13

Related fix proposed to branch: master
Review: https://review.opendev.org/690417

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-23: Related fix merged to nova (master)

#14

Reviewed: https://review.opendev.org/690417
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9377d00ccf7a73071b4fb75d66ce5ad7bd321174
Submitter: Zuul
Branch: master

commit 9377d00ccf7a73071b4fb75d66ce5ad7bd321174
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 22 17:10:19 2019 -0400

Revert "Log CellTimeout traceback in scatter_gather_cells"

This reverts commit 0436a95f37df086ddc99017376cb9a312e40517a.

    This was meant to get us more debug details when hitting the
    failure but the results are not helpful [1] so revert this
    and the fix for the resulting regression [2].

[1] http://paste.openstack.org/show/782116/
[2] I7f9edc9a4b4930f4dce98df271888fa8082a1701

Change-Id: Iab8029f081a654278ea7dbbec79a766aea6764ae
Related-Bug: #1844929

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-10-31:

#15

I just saw same issue in neutron scenario job: https://de303317bfb67bc632fc-be2647135e35587b155fc0ff22bf704c.ssl.cf2.rackcdn.com/691640/1/gate/neutron-tempest-plugin-scenario-linuxbridge/189b7d2/controller/logs/screen-n-api.txt.gz so it's not only in grenade jobs.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-11-06:

#16

Download full text (3.7 KiB)

I just noticed this in a nova functional job as well:

https://zuul.opendev.org/t/openstack/build/63001bbd58c244cea70c995f1ebf61fb/log/job-output.txt#3092

In this case it looks like a result of a slow node:

2019-11-05 23:12:57.389624 | ubuntu-bionic | 2019-11-05 23:10:12,022 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'host2'. Re-created its InstanceList.

2019-11-05 23:12:57.389814 | ubuntu-bionic | 2019-11-05 23:10:12,569 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572/servers/9ceb2e4d-9bb3-425b-9454-b171221d6c93" status: 200 len: 2019 microversion: 2.15 time: 0.048796

2019-11-05 23:12:57.389957 | ubuntu-bionic | 2019-11-05 23:11:23,400 WARNING [oslo.service.loopingcall] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 59.77 sec

2019-11-05 23:12:57.390092 | ubuntu-bionic | 2019-11-05 23:11:23,414 WARNING [nova.context] Timed out waiting for response from cell 3d90ac16-9173-4df8-8ef9-1d713ccd8a98

2019-11-05 23:12:57.390189 | ubuntu-bionic | 2019-11-05 23:11:23,415 ERROR [nova.api.openstack.wsgi] Unexpected exception in API method

2019-11-05 23:12:57.390236 | ubuntu-bionic | Traceback (most recent call last):

2019-11-05 23:12:57.390299 | ubuntu-bionic | File "nova/api/openstack/wsgi.py", line 671, in wrapped

2019-11-05 23:12:57.390340 | ubuntu-bionic | return f(*args, **kwargs)

2019-11-05 23:12:57.390406 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390450 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390515 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390559 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390625 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390668 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390734 | ubuntu-bionic | File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390778 | ubuntu-bionic | return func(*args, **kwargs)

2019-11-05 23:12:57.390850 | ubuntu-bionic | File "nova/api/openstack/compute/evacuate.py", line 85, in _evacuate

2019-11-05 23:12:57.390918 | ubuntu-bionic | instance = common.get_instance(self.compute_api, context, id)

2019-11-05 23:12:57.390986 | ubuntu-bionic | File "nova/api/openstack/common.py", line 472, in get_instance

2019-11-05 23:12:57.391035 | ubuntu-bionic | cell_down_support=cell_down_support)

2019-11-05 23:12:57.391104 | ubuntu-bionic | File "nova/compute/api.py", line 2722, in get

2019-11-05 23:12:57.391173 | ubuntu-bionic | expected_attrs, cell_down_support=cell_down_support)

2019-11-05 23:12:57.391236 | ubuntu-bionic | File "nova/compute/api.py", line 2669, in _get_instance

2019-11-05 23:12:57.391284 | ubuntu-bionic | expected_attrs, cell_down_support)

2019-11-05 23:12:57.391354 | ubuntu-bionic | File "nova/compute/api.py", line 2662, in _get_instance_from...

I just noticed this in a nova functional job as well:

https://zuul.opendev.org/t/openstack/build/63001bbd58c244cea70c995f1ebf61fb/log/job-output.txt#3092

In this case it looks like a result of a slow node:

2019-11-05 23:12:57.389624 | ubuntu-bionic |     2019-11-05 23:10:12,022 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'host2'. Re-created its InstanceList.

2019-11-05 23:12:57.389814 | ubuntu-bionic |     2019-11-05 23:10:12,569 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572/servers/9ceb2e4d-9bb3-425b-9454-b171221d6c93" status: 200 len: 2019 microversion: 2.15 time: 0.048796

2019-11-05 23:12:57.389957 | ubuntu-bionic |     2019-11-05 23:11:23,400 WARNING [oslo.service.loopingcall] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 59.77 sec

2019-11-05 23:12:57.390092 | ubuntu-bionic |     2019-11-05 23:11:23,414 WARNING [nova.context] Timed out waiting for response from cell 3d90ac16-9173-4df8-8ef9-1d713ccd8a98

2019-11-05 23:12:57.390189 | ubuntu-bionic |     2019-11-05 23:11:23,415 ERROR [nova.api.openstack.wsgi] Unexpected exception in API method

2019-11-05 23:12:57.390236 | ubuntu-bionic |     Traceback (most recent call last):

2019-11-05 23:12:57.390299 | ubuntu-bionic |       File "nova/api/openstack/wsgi.py", line 671, in wrapped

2019-11-05 23:12:57.390340 | ubuntu-bionic |         return f(*args, **kwargs)

2019-11-05 23:12:57.390406 | ubuntu-bionic |       File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390450 | ubuntu-bionic |         return func(*args, **kwargs)

2019-11-05 23:12:57.390515 | ubuntu-bionic |       File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390559 | ubuntu-bionic |         return func(*args, **kwargs)

2019-11-05 23:12:57.390625 | ubuntu-bionic |       File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390668 | ubuntu-bionic |         return func(*args, **kwargs)

2019-11-05 23:12:57.390734 | ubuntu-bionic |       File "nova/api/validation/__init__.py", line 110, in wrapper

2019-11-05 23:12:57.390778 | ubuntu-bionic |         return func(*args, **kwargs)

2019-11-05 23:12:57.390850 | ubuntu-bionic |       File "nova/api/openstack/compute/evacuate.py", line 85, in _evacuate

2019-11-05 23:12:57.390918 | ubuntu-bionic |         instance = common.get_instance(self.compute_api, context, id)

2019-11-05 23:12:57.390986 | ubuntu-bionic |       File "nova/api/openstack/common.py", line 472, in get_instance

2019-11-05 23:12:57.391035 | ubuntu-bionic |         cell_down_support=cell_down_support)

2019-11-05 23:12:57.391104 | ubuntu-bionic |       File "nova/compute/api.py", line 2722, in get

2019-11-05 23:12:57.391173 | ubuntu-bionic |         expected_attrs, cell_down_support=cell_down_support)

2019-11-05 23:12:57.391236 | ubuntu-bionic |       File "nova/compute/api.py", line 2669, in _get_instance

2019-11-05 23:12:57.391284 | ubuntu-bionic |         expected_attrs, cell_down_support)

2019-11-05 23:12:57.391354 | ubuntu-bionic |       File "nova/compute/api.py", line 2662, in _get_instance_from_cell

2019-11-05 23:12:57.391405 | ubuntu-bionic |         "info is not available.") % cell_uuid)

2019-11-05 23:12:57.391514 | ubuntu-bionic |     NovaException: Cell 3d90ac16-9173-4df8-8ef9-1d713ccd8a98 is not responding and hence instance info is not available.

2019-11-05 23:12:57.391704 | ubuntu-bionic |     2019-11-05 23:11:23,418 INFO [nova.api.openstack.wsgi] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.

2019-11-05 23:12:57.391769 | ubuntu-bionic |     <class 'nova.exception.NovaException'>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Related fix proposed to nova (master)

#17

Related fix proposed to branch: master
Review: https://review.opendev.org/699735

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-19: Change abandoned on nova (master)

#18

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/699735
Reason: Doesn't help:

https://zuul.opendev.org/t/openstack/build/4eebf46e53194e7eb4d5c346ef7959ac/log/logs/screen-n-sch.txt.gz?severity=3

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-12-20:

#19

(11:04:52 AM) mriedem: unrelated, but would like to do something about http://status.openstack.org/elastic-recheck/#1844929 since we have no fixes in sight
(11:05:04 AM) mriedem: looking at logstash, that's primarily hitting on ovh nodes,
(11:05:17 AM) mriedem: i wonder if there is a way to exclude certain node providers from grenade jobs?
(11:05:29 AM) mriedem: i know you can specify a label to say a job should run on a given provider, but is there a NOT version of that?
(11:05:36 AM) clarkb: mriedem: we would have to make new nodepool flavors. I think there is a really good chance that swapping is what causes those problems
(11:05:43 AM) clarkb: particularly in ovh where we get fewer iops
(11:06:50 AM) mriedem: hmm, ok - i'm kind of looking for any workaround atm because that bug has been crushing us for months now

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-20: Related fix proposed to grenade (master)

#20

Related fix proposed to branch: master
Review: https://review.opendev.org/700214

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-20: Change abandoned on grenade (master)

#21

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/700214
Reason: Need to do this via the feature support test matrix in devstack-gate:

https://review.opendev.org/#/c/700233/

Revision history for this message

melanie witt (melwitt) wrote on 2020-02-06:

#22

Download full text (3.3 KiB)

I dug into this more about a month ago and unfortunately came up empty-handed. Going to dump some info I gathered at the time on a DNM patch [1] here for the record.

"Looking at the mysql error log:

https://zuul.opendev.org/t/openstack/build/833a46b05c9641b9b22b3ee7f394e80b/log/logs/mysql/error.txt.gz

I see lots of errors [2] that I think must be why we get the cell timeouts:

...

2020-01-09T00:11:53.330894Z 157 [Note] Aborted connection 157 to db: 'keystone' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:12:19.047684Z 158 [Note] Aborted connection 158 to db: 'placement' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:13:09.703587Z 161 [Note] Aborted connection 161 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:13:09.703775Z 162 [Note] Aborted connection 162 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:41.215651Z 171 [Note] Aborted connection 171 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.841855Z 173 [Note] Aborted connection 173 to db: 'nova_cell0' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.842886Z 174 [Note] Aborted connection 174 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.958218Z 172 [Note] Aborted connection 172 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:48.749196Z 178 [Note] Aborted connection 178 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:48.749216Z 179 [Note] Aborted connection 179 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)

...

I note that all services are getting the same error, not just nova. Includes cinder, neutron, placement, glance, keystone, and nova."

"For this failure, trying to correlate the cell timeout with interesting things from mysql and dstat logging:

screen-n-sch.txt [3]:

Jan 16 18:12:50.807350 ubuntu-bionic-ovh-bhs1-0013905596 nova-scheduler[17316]: WARNING nova.context [None req-d3eeda8c-74d5-451c-8262-e80fa1e45c0f tempest-ServersTestManualDisk-965972951 tempest-ServersTestManualDisk-965972951] Timed out waiting for response from cell 948fa69c-ada4-488c-9808-83dd645d7069

mysql error.txt [4]:

2020-01-16T18:12:41.934112Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4279ms. The settings might not be optimal. (flushed=4 and evicted=0, during the time.)

screen-dstat.txt [5]:

Jan 16 18:12:40.399103 ubuntu-bionic-ovh-bhs1-0013905596 dstat.sh[1006]: 16-01 18:12:40| 8 1 66 25 0|4785M 168M 438M 2024M| 46B 46B| 0 8008k| 0 144 |2371 4119 |2.27 2.00 2.20| 0 2.0 2.0| 0 0 |uwsgi 164902.5%3585B 62k|python2 1036 179k 457B 0%|mysqld 517M|4096k 8188M| 29 334 0 739 0"

Latest query shows all hits showing up on OVH nodes.

[1] https://review.opendev.org/701...

I dug into this more about a month ago and unfortunately came up empty-handed. Going to dump some info I gathered at the time on a DNM patch [1] here for the record.

"Looking at the mysql error log:

https://zuul.opendev.org/t/openstack/build/833a46b05c9641b9b22b3ee7f394e80b/log/logs/mysql/error.txt.gz

I see lots of errors [2] that I think must be why we get the cell timeouts:

...

2020-01-09T00:11:53.330894Z 157 [Note] Aborted connection 157 to db: 'keystone' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:12:19.047684Z 158 [Note] Aborted connection 158 to db: 'placement' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:13:09.703587Z 161 [Note] Aborted connection 161 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:13:09.703775Z 162 [Note] Aborted connection 162 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:41.215651Z 171 [Note] Aborted connection 171 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.841855Z 173 [Note] Aborted connection 173 to db: 'nova_cell0' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.842886Z 174 [Note] Aborted connection 174 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:44.958218Z 172 [Note] Aborted connection 172 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:48.749196Z 178 [Note] Aborted connection 178 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)
2020-01-09T00:14:48.749216Z 179 [Note] Aborted connection 179 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets)

...

I note that all services are getting the same error, not just nova. Includes cinder, neutron, placement, glance, keystone, and nova."

"For this failure, trying to correlate the cell timeout with interesting things from mysql and dstat logging:

screen-n-sch.txt [3]:

Jan 16 18:12:50.807350 ubuntu-bionic-ovh-bhs1-0013905596 nova-scheduler[17316]: WARNING nova.context [None req-d3eeda8c-74d5-451c-8262-e80fa1e45c0f tempest-ServersTestManualDisk-965972951 tempest-ServersTestManualDisk-965972951] Timed out waiting for response from cell 948fa69c-ada4-488c-9808-83dd645d7069

mysql error.txt [4]:

2020-01-16T18:12:41.934112Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4279ms. The settings might not be optimal. (flushed=4 and evicted=0, during the time.)

screen-dstat.txt [5]:

Jan 16 18:12:40.399103 ubuntu-bionic-ovh-bhs1-0013905596 dstat.sh[1006]: 16-01 18:12:40|  8   1  66  25   0|4785M  168M  438M 2024M|  46B   46B|   0  8008k|   0   144 |2371  4119 |2.27 2.00 2.20|  0 2.0 2.0|   0     0 |uwsgi                164902.5%3585B  62k|python2              1036  179k 457B  0%|mysqld       517M|4096k 8188M|  29  334    0  739    0"

Latest query shows all hits showing up on OVH nodes.

[1] https://review.opendev.org/701478
[2] http://paste.openstack.org/show/788349
[3] http://paste.openstack.org/show/788507
[4] http://paste.openstack.org/show/788505
[5] http://paste.openstack.org/show/788506

Revision history for this message

melanie witt (melwitt) wrote on 2020-03-12:

#23

Download full text (3.4 KiB)

I started looking at this again after lyarwood mentioned it in the nova meeting today.

Looking at the logs/mysql/error.txt of some successful grenade runs, there are a lot of messages like this regarding aborted connections:

2020-03-12T19:07:34.435762Z 4 [Note] Aborted connection 4 to db: 'keystone' user: 'root' host: 'localhost' (Got an error reading communication packets)

so that looks likely to be a red herring.

The mysql logs aren't indexed by our logstash, so looking at a few by hand, there seems to be this consistent pattern in the failed jobs, that is not present in the succeeded jobs:

2020-03-11T12:09:58.384142Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18540ms. The settings might not be optimal. (flushed=200 and evicted=0, during the time.)

2020-03-11T11:40:53.524707Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4382ms. The settings might not be optimal. (flushed=3 and evicted=0, during the time.)

2020-03-11T11:41:05.482158Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8976ms. The settings might not be optimal. (flushed=44 and evicted=0, during the time.)

2020-03-11T11:41:41.406597Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4915ms. The settings might not be optimal. (flushed=200 and evicted=0, during the time.)

2020-03-11T10:37:04.469735Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5434ms. The settings might not be optimal. (flushed=5 and evicted=0, during the time.)

I googled about this and learned it's a periodic task in mysql server that flushes dirty pages every second [1]. From the stackoverflow answer, they say:

"Once per second, the page cleaner scans the buffer pool for dirty pages to flush from the buffer pool to disk. The warning you saw shows that it has lots of dirty pages to flush, and it takes over 4 seconds to flush a batch of them to disk, when it should complete that work in under 1 second. In other words, it's biting off more than it can chew."

They go on to say that this issue can be exacerbated if it's happening on a machine with slow disks as that would also cause the page cleaning to fall behind.

The person who asked the question solved their issue by setting innodb_lru_scan_depth=256 to make the page cleaner process smaller chunks at a time (default is 1024). The person who answered the question noted that this will only work if page cleaner can keep up with the average rate of creating new dirty pages. If it cannot, the flushing rate will be automatically increased once innodb_max_dirty_page_pct is exceeded and may result in page cleaner warnings all over again.

They say:

"Another solution would be to put MySQL on a server with faster disks. You need an I/O system that can handle the throughput demanded by your page flushing.

If you see this warning all the time under average traffic, you might be trying to do too many write queries on this MySQL server. It might be time to scale out, and split the writes over multiple MySQL instances, each with their own disk system."

This again seems to point back at slow nodes.

I'm trying out a DNM devstack patch [2] to set innodb_lru_scan_depth=256 and keep rechecking the DNM nova cha...