grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
melanie witt | ||
Queens |
Fix Released
|
Undecided
|
Elod Illes | ||
Rocky |
Fix Released
|
Undecided
|
melanie witt | ||
Stein |
Fix Released
|
Undecided
|
melanie witt | ||
Train |
Fix Released
|
High
|
melanie witt | ||
grenade |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Seen here:
Sep 22 00:50:54.174385 ubuntu-
Looks like something is causing timeouts reaching cell1 during grenade runs. The only errors I see in the rabbit logs are these for the uwsgi (API) servers:
=ERROR REPORT==== 22-Sep-
closing AMQP connection <0.1511.0> (217.182.
missed heartbeats from client, timeout: 60s
--
It looks like we don't have mysql logs in this grenade run, maybe we need a fix like this somewhere for grenade:
https:/
logstash shows 1101 hits in the last 7 days, since Sept 17 actually:
check and gate queues, all failures. It also appears to only show up on fortnebula and OVH nodes, primarily fortnebula. I wonder if there is a performing/timing issue if those nodes are slower and we aren't waiting for something during the grenade upgrade before proceeding.
Matt Riedemann (mriedem) wrote : | #1 |
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → High |
Matt Riedemann (mriedem) wrote : | #2 |
Here is a fix to devstack-gate for collecting the mysql logs during grenade runs:
Matt Riedemann (mriedem) wrote : | #3 |
The confusing thing is it looks like grenade restarts the scheduler:
Sep 22 00:37:32.126839 ubuntu-
Sep 22 00:45:55.359862 ubuntu-
...
Sep 22 00:45:59.927792 ubuntu-
...
Sep 22 00:46:00.068988 ubuntu-
...
Sep 22 00:46:00.230765 ubuntu-
...
Sep 22 00:46:00.568873 ubuntu-
...
Sep 22 00:46:00.599841 ubuntu-
...
Sep 22 00:46:00.656407 ubuntu-
Sep 22 00:46:00.657550 ubuntu-
...
Sep 22 00:46:00.694504 ubuntu-
Sep 22 00:46:00.695624 ubuntu-
And then I see this run which is connecting to the cell1 database to pull compute nodes and instances:
Sep 22 00:46:00.734968 ubuntu-
Sep 22 00...
Matt Riedemann (mriedem) wrote : | #4 |
Looking at logstash again this goes back to at least Sept 14 (or older, logstash only saves up to 10 days of logs).
Note that as of Sept 13 upper-constraints is using oslo.service 1.40.2:
https:/
Which is important because it has restart changes in it for a long-standing bug with SIGHUP:
https:/
We've been using oslo.service 1.40.1 since Sept 3 in upper-constraints:
https:/
So that might just be a red herring.
Matt Riedemann (mriedem) wrote : | #5 |
I think comment 4 can be ignored, we're not doing a SIGHUP:
Sep 22 00:37:27.786606 ubuntu-
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master) | #6 |
Related fix proposed to branch: master
Review: https:/
Matt Riedemann (mriedem) wrote : | #7 |
Another data point about post-restart of the scheduler and before that timeout, the verify_instance command runs successfully which connects to the cell1 database:
2019-09-22 00:47:24.934 | + /opt/stack/
2019-09-22 00:47:29.331 | Instance 44d7efdc-
tags: | added: gate-failure |
Matt Riedemann (mriedem) wrote : | #8 |
I got mysqld logs published in the grenade jobs and it's pretty clear that we're starting mysqld 3 times in the grenade run, which 2 I can understand because old and new devstack don't know they are in a grenade context and will (re)start grenade at least twice, but I'm not sure where the 3rd start is coming from. If we're using bionic nodes on both the old and new side I don't think we'd be upgrading mysqld packages but I need to confirm.
In one recent failure this is where things start to go south in the mysqld logs and that's around the time we hit the cell timeout failures:
....
OK it looks like it's the same mysqld package version on all 3 starts:
mysqld (mysqld 5.7.27-
So we're not upgrading the package at all, but we are restarting it.
Matt Riedemann (mriedem) wrote : | #9 |
The 3 restarts are probably:
1. initial package install on the old side
2. re-config for stack user on the old side and restart
3. re-config (same data) for the stack user on the new side and restart
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master) | #10 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 0436a95f37df086
Author: Matt Riedemann <email address hidden>
Date: Mon Sep 23 14:57:44 2019 -0400
Log CellTimeout traceback in scatter_
When a call to a cell in scatter_
we log a warning and set the did_not_
that cell but it would be useful if we logged the traceback
with the warning for debugging where the call is happening.
Change-Id: I8f4069474a3955
Related-Bug: #1844929
Matt Riedemann (mriedem) wrote : | #11 |
With the traceback logging patch applied it's not really useful:
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.227697 ubuntu-
Oct 08 17:02:27.231340 ubuntu-
Matt Riedemann (mriedem) wrote : | #12 |
Also seeing this in a ceph job in n-api:
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
Oct 21 22:41:11.686000 ubuntu-
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master) | #13 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master) | #14 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 9377d00ccf7a730
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 22 17:10:19 2019 -0400
Revert "Log CellTimeout traceback in scatter_
This reverts commit 0436a95f37df086
This was meant to get us more debug details when hitting the
failure but the results are not helpful [1] so revert this
and the fix for the resulting regression [2].
[1] http://
[2] I7f9edc9a4b4930
Change-Id: Iab8029f081a654
Related-Bug: #1844929
Slawek Kaplonski (slaweq) wrote : | #15 |
I just saw same issue in neutron scenario job: https:/
Matt Riedemann (mriedem) wrote : | #16 |
I just noticed this in a nova functional job as well:
https:/
In this case it looks like a result of a slow node:
2019-11-05 23:12:57.389624 | ubuntu-bionic | 2019-11-05 23:10:12,022 INFO [nova.scheduler
2019-11-05 23:12:57.389814 | ubuntu-bionic | 2019-11-05 23:10:12,569 INFO [nova.api.
2019-11-05 23:12:57.389957 | ubuntu-bionic | 2019-11-05 23:11:23,400 WARNING [oslo.service.
2019-11-05 23:12:57.390092 | ubuntu-bionic | 2019-11-05 23:11:23,414 WARNING [nova.context] Timed out waiting for response from cell 3d90ac16-
2019-11-05 23:12:57.390189 | ubuntu-bionic | 2019-11-05 23:11:23,415 ERROR [nova.api.
2019-11-05 23:12:57.390236 | ubuntu-bionic | Traceback (most recent call last):
2019-11-05 23:12:57.390299 | ubuntu-bionic | File "nova/api/
2019-11-05 23:12:57.390340 | ubuntu-bionic | return f(*args, **kwargs)
2019-11-05 23:12:57.390406 | ubuntu-bionic | File "nova/api/
2019-11-05 23:12:57.390450 | ubuntu-bionic | return func(*args, **kwargs)
2019-11-05 23:12:57.390515 | ubuntu-bionic | File "nova/api/
2019-11-05 23:12:57.390559 | ubuntu-bionic | return func(*args, **kwargs)
2019-11-05 23:12:57.390625 | ubuntu-bionic | File "nova/api/
2019-11-05 23:12:57.390668 | ubuntu-bionic | return func(*args, **kwargs)
2019-11-05 23:12:57.390734 | ubuntu-bionic | File "nova/api/
2019-11-05 23:12:57.390778 | ubuntu-bionic | return func(*args, **kwargs)
2019-11-05 23:12:57.390850 | ubuntu-bionic | File "nova/api/
2019-11-05 23:12:57.390918 | ubuntu-bionic | instance = common.
2019-11-05 23:12:57.390986 | ubuntu-bionic | File "nova/api/
2019-11-05 23:12:57.391035 | ubuntu-bionic | cell_down_
2019-11-05 23:12:57.391104 | ubuntu-bionic | File "nova/compute/
2019-11-05 23:12:57.391173 | ubuntu-bionic | expected_attrs, cell_down_
2019-11-05 23:12:57.391236 | ubuntu-bionic | File "nova/compute/
2019-11-05 23:12:57.391284 | ubuntu-bionic | expected_attrs, cell_down_support)
2019-11-05 23:12:57.391354 | ubuntu-bionic | File "nova/compute/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master) | #17 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master) | #18 |
Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https:/
Reason: Doesn't help:
Matt Riedemann (mriedem) wrote : | #19 |
(11:04:52 AM) mriedem: unrelated, but would like to do something about http://
(11:05:04 AM) mriedem: looking at logstash, that's primarily hitting on ovh nodes,
(11:05:17 AM) mriedem: i wonder if there is a way to exclude certain node providers from grenade jobs?
(11:05:29 AM) mriedem: i know you can specify a label to say a job should run on a given provider, but is there a NOT version of that?
(11:05:36 AM) clarkb: mriedem: we would have to make new nodepool flavors. I think there is a really good chance that swapping is what causes those problems
(11:05:43 AM) clarkb: particularly in ovh where we get fewer iops
(11:06:50 AM) mriedem: hmm, ok - i'm kind of looking for any workaround atm because that bug has been crushing us for months now
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to grenade (master) | #20 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on grenade (master) | #21 |
Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https:/
Reason: Need to do this via the feature support test matrix in devstack-gate:
melanie witt (melwitt) wrote : | #22 |
I dug into this more about a month ago and unfortunately came up empty-handed. Going to dump some info I gathered at the time on a DNM patch [1] here for the record.
"Looking at the mysql error log:
I see lots of errors [2] that I think must be why we get the cell timeouts:
...
2020-01-
2020-01-
2020-01-
2020-01-
2020-01-
2020-01-
2020-01-
2020-01-
2020-01-
2020-01-
...
I note that all services are getting the same error, not just nova. Includes cinder, neutron, placement, glance, keystone, and nova."
"For this failure, trying to correlate the cell timeout with interesting things from mysql and dstat logging:
screen-n-sch.txt [3]:
Jan 16 18:12:50.807350 ubuntu-
mysql error.txt [4]:
2020-01-
screen-dstat.txt [5]:
Jan 16 18:12:40.399103 ubuntu-
Latest query shows all hits showing up on OVH nodes.
melanie witt (melwitt) wrote : | #23 |
I started looking at this again after lyarwood mentioned it in the nova meeting today.
Looking at the logs/mysql/
2020-03-
so that looks likely to be a red herring.
The mysql logs aren't indexed by our logstash, so looking at a few by hand, there seems to be this consistent pattern in the failed jobs, that is not present in the succeeded jobs:
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
I googled about this and learned it's a periodic task in mysql server that flushes dirty pages every second [1]. From the stackoverflow answer, they say:
"Once per second, the page cleaner scans the buffer pool for dirty pages to flush from the buffer pool to disk. The warning you saw shows that it has lots of dirty pages to flush, and it takes over 4 seconds to flush a batch of them to disk, when it should complete that work in under 1 second. In other words, it's biting off more than it can chew."
They go on to say that this issue can be exacerbated if it's happening on a machine with slow disks as that would also cause the page cleaning to fall behind.
The person who asked the question solved their issue by setting innodb_
They say:
"Another solution would be to put MySQL on a server with faster disks. You need an I/O system that can handle the throughput demanded by your page flushing.
If you see this warning all the time under average traffic, you might be trying to do too many write queries on this MySQL server. It might be time to scale out, and split the writes over multiple MySQL instances, each with their own disk system."
This again seems to point back at slow nodes.
I'm trying out a DNM devstack patch [2] to set innodb_
melanie witt (melwitt) wrote : | #24 |
Here's a link to a related ML thread having to do with gate failures on slow nodes and mentions the innodb page cleaner warning. Several of the posts in the thread show investigation that has happened and is relevant to this bug:
http://
melanie witt (melwitt) wrote : | #25 |
> I'm trying out a DNM devstack patch [2] to set innodb_
Well, this seems to have the opposite of the desired effect [4]:
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
but at least I can see the setting change did something.
It also happened to not fail the job run either. I'm going to try out some other settings related to the buffer pool next.
[1] https:/
[2] https:/
[3] https:/
[4] https:/
melanie witt (melwitt) wrote : | #26 |
I tried some settings in devstack:
[mysqld]
innodb_
innodb_
innodb_
but still got a nova-grenade-
Recall that I am running tests on a patch that has enabled db connection debug logging via:
[database]
use_db_reconnect "True"
connection_debug "100"
With that in mind, looking at the screen-n-sch.txt log on the failed nova-grenade-
Contrast with a passing grenade-py3 run [3], where there are a total of 22 query results for compute_nodes logged in screen-n-sch.txt.
This is showing that we are consistently not receiving rows back from the database in the fail case.
[1] https:/
[2] http://
[3] https:/
melanie witt (melwitt) wrote : | #27 |
Another data point, I tried a new PS of my DNM patch where I reverted the commit that added use of the scatter gather utility via eventlet greenthreads to the scheduler host manager [1][2] and Depends-On a revert on stable/train too (since grenade upgrades N-1 => N) [3]. I did this to test whether we would still see a failure to retrieve compute_nodes rows from the database if we did _not_ use the scatter gather ... or if the problem would disappear and indicate that there is instead a bug or problem with the scatter gather utility.
We just got a fail result back [4] and see the same behavior that is seen with scatter gather in place. We only succeed at receiving rows back from the compute_nodes table once and then never again. Same symptoms as with scatter gather in place, not getting any compute_nodes back from the database.
This failure happened on an OVH node.
[1] https:/
[2] https:/
[3] https:/
[4] https:/
melanie witt (melwitt) wrote : | #28 |
After more digging than I'd care to admit, I think I have finally gotten to the bottom of what's happening with this bug.
Through DNM patch debug logging in oslo.db [1], I found that during a grenade run, after a nova-scheduler service stop and start, that child processes of the nova-scheduler (workers) were starting off with already locked internal oslo.db locks occasionally. This can happen if requests are flowing in to the service while it is in the middle of forking child process workers. The first database request fires and takes the lock, and then child processes are forked while the lock is held.
When this happened, database accesses for the particular cell-cached database transaction context manager object could never acquire the lock and would just get stuck, eventually failing with a CellTimeout error.
Here's aggregated snippets of the DNM patch debug logging showing the inherited held locks:
http://
This behavior of not "resetting" or sanitizing standard library locks at fork is a known issue in python [2] that's currently being worked on.
In the meantime, I think we can handle this on our side by clearing our cell cache that holds oslo.db database transaction context manager objects during service start(). This way, we get fresh oslo.db locks that are in an unlocked state when a child process begins.
[1] https:/
[2] https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master) | #29 |
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
assignee: | nobody → melanie witt (melwitt) |
status: | Confirmed → In Progress |
melanie witt (melwitt) wrote : | #30 |
> This behavior of not "resetting" or sanitizing standard library locks at fork is a known issue in python [2] that's currently being worked on.
Correction: this issue [3] is not considered a bug in python and the recommended way of handling the problem is to use the os.register_
For now, we still support python 3.6, so we have to handle it outside of oslo.db some way. But we can also add use of os.register_
[3] https:/
[4] https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master) | #31 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 941559042f609ee
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000
Reset the cell cache for database access in Service
We have had a gate bug for a long time where occasionally the scheduler
service gets into a state where many requests fail in it with
CellTimeout errors. Example:
Timed out waiting for response from cell <cell uuid>
Through the use of much DNM patch debug logging in oslo.db, it was
revealed that service child processes (workers) were sometimes starting
off with already locked internal oslo.db locks. This is a known issue
in python [1] where if a parent process forks a child process while a
lock is held, the child will inherit the held lock which can never be
acquired.
The python issue is not considered a bug and the recommended way to
handle it is by making use of the os.register_
to reinitialize its lock. The method is new in python 3.7, so as long
as we still support python 3.6, we must handle the situation outside of
oslo.db.
We can do this by clearing the cell cache that holds oslo.db database
transaction context manager objects during service start(). This way,
we get fresh oslo.db locks that are in an unlocked state when a child
process begins.
We can also take this opportunity to resolve part of a TODO to clear
the same cell cache during service reset() (SIGHUP) since it is another
case where we intended to clear it. The rest of the TODO related to
periodic clearing of the cache is removed after discussion on the
review, as such clearing would be unsynchronized among multiple
services and for periods of time each service might have a different
view of cached cells than another.
Closes-Bug: #1844929
[1] https:/
Change-Id: Id233f673a57461
Changed in nova: | |
status: | In Progress → Fix Released |
melanie witt (melwitt) wrote : | #32 |
Through my testing, I found this not to be a problem in grenade.
Changed in grenade: | |
status: | New → Invalid |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train) | #33 |
Fix proposed to branch: stable/train
Review: https:/
Balazs Gibizer (balazs-gibizer) wrote : | #34 |
It seems the problem is still happening even after the fix merged to master.
Here is a recent appearance https:/
Changed in nova: | |
status: | Fix Released → Confirmed |
Balazs Gibizer (balazs-gibizer) wrote : | #35 |
As Melanie correctly stated we need to merge the Train backport of the bugfix to have the problem disappear from granade jobs as that runs a train -> master upgrade.
Changed in nova: | |
status: | Confirmed → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train) | #36 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/train
commit 88205a4e911268d
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000
Reset the cell cache for database access in Service
We have had a gate bug for a long time where occasionally the scheduler
service gets into a state where many requests fail in it with
CellTimeout errors. Example:
Timed out waiting for response from cell <cell uuid>
Through the use of much DNM patch debug logging in oslo.db, it was
revealed that service child processes (workers) were sometimes starting
off with already locked internal oslo.db locks. This is a known issue
in python [1] where if a parent process forks a child process while a
lock is held, the child will inherit the held lock which can never be
acquired.
The python issue is not considered a bug and the recommended way to
handle it is by making use of the os.register_
to reinitialize its lock. The method is new in python 3.7, so as long
as we still support python 3.6, we must handle the situation outside of
oslo.db.
We can do this by clearing the cell cache that holds oslo.db database
transaction context manager objects during service start(). This way,
we get fresh oslo.db locks that are in an unlocked state when a child
process begins.
We can also take this opportunity to resolve part of a TODO to clear
the same cell cache during service reset() (SIGHUP) since it is another
case where we intended to clear it. The rest of the TODO related to
periodic clearing of the cache is removed after discussion on the
review, as such clearing would be unsynchronized among multiple
services and for periods of time each service might have a different
view of cached cells than another.
Closes-Bug: #1844929
[1] https:/
NOTE(melwitt): The differences from the original change in
nova/
I91fa2f73
I8c96b337
are not in Train.
Change-Id: Id233f673a57461
(cherry picked from commit 941559042f609ee
melanie witt (melwitt) wrote : | #37 |
Since the Train change has merged, we should expect to see the problem disappear from changes proposed to the master branch. We will still be able to see failures on stable branches though, until we merge changes to Stein, Rocky, Queens.
Let's keep an eye on the chart and logstash query results and ensure we no longer see hits on master changes:
http://
And I'll go ahead and propose a change for Stein while we watch the master branch.
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein) | #38 |
Fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky) | #39 |
Fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens) | #40 |
Fix proposed to branch: stable/queens
Review: https:/
melanie witt (melwitt) wrote : | #41 |
I went ahead and uploaded changes for Stein, Rocky, and Queens so that they are ready, after finding a change [1] on stable/train fail due to this bug, on logstash.
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein) | #42 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit 4de766006d9432a
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000
Reset the cell cache for database access in Service
We have had a gate bug for a long time where occasionally the scheduler
service gets into a state where many requests fail in it with
CellTimeout errors. Example:
Timed out waiting for response from cell <cell uuid>
Through the use of much DNM patch debug logging in oslo.db, it was
revealed that service child processes (workers) were sometimes starting
off with already locked internal oslo.db locks. This is a known issue
in python [1] where if a parent process forks a child process while a
lock is held, the child will inherit the held lock which can never be
acquired.
The python issue is not considered a bug and the recommended way to
handle it is by making use of the os.register_
to reinitialize its lock. The method is new in python 3.7, so as long
as we still support python 3.6, we must handle the situation outside of
oslo.db.
We can do this by clearing the cell cache that holds oslo.db database
transaction context manager objects during service start(). This way,
we get fresh oslo.db locks that are in an unlocked state when a child
process begins.
We can also take this opportunity to resolve part of a TODO to clear
the same cell cache during service reset() (SIGHUP) since it is another
case where we intended to clear it. The rest of the TODO related to
periodic clearing of the cache is removed after discussion on the
review, as such clearing would be unsynchronized among multiple
services and for periods of time each service might have a different
view of cached cells than another.
Closes-Bug: #1844929
[1] https:/
Change-Id: Id233f673a57461
(cherry picked from commit 941559042f609ee
(cherry picked from commit 88205a4e911268d
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky) | #43 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit a86ebc75eb886bd
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000
Reset the cell cache for database access in Service
We have had a gate bug for a long time where occasionally the scheduler
service gets into a state where many requests fail in it with
CellTimeout errors. Example:
Timed out waiting for response from cell <cell uuid>
Through the use of much DNM patch debug logging in oslo.db, it was
revealed that service child processes (workers) were sometimes starting
off with already locked internal oslo.db locks. This is a known issue
in python [1] where if a parent process forks a child process while a
lock is held, the child will inherit the held lock which can never be
acquired.
The python issue is not considered a bug and the recommended way to
handle it is by making use of the os.register_
to reinitialize its lock. The method is new in python 3.7, so as long
as we still support python 3.6, we must handle the situation outside of
oslo.db.
We can do this by clearing the cell cache that holds oslo.db database
transaction context manager objects during service start(). This way,
we get fresh oslo.db locks that are in an unlocked state when a child
process begins.
We can also take this opportunity to resolve part of a TODO to clear
the same cell cache during service reset() (SIGHUP) since it is another
case where we intended to clear it. The rest of the TODO related to
periodic clearing of the cache is removed after discussion on the
review, as such clearing would be unsynchronized among multiple
services and for periods of time each service might have a different
view of cached cells than another.
Closes-Bug: #1844929
[1] https:/
NOTE(melwitt): The difference from the Stein change in
nova/
Idaed396290
Change-Id: Id233f673a57461
(cherry picked from commit 941559042f609ee
(cherry picked from commit 88205a4e911268d
(cherry picked from commit 4de766006d9432a
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens) | #44 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 5aa78acbce9fa1f
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000
Reset the cell cache for database access in Service
We have had a gate bug for a long time where occasionally the scheduler
service gets into a state where many requests fail in it with
CellTimeout errors. Example:
Timed out waiting for response from cell <cell uuid>
Through the use of much DNM patch debug logging in oslo.db, it was
revealed that service child processes (workers) were sometimes starting
off with already locked internal oslo.db locks. This is a known issue
in python [1] where if a parent process forks a child process while a
lock is held, the child will inherit the held lock which can never be
acquired.
The python issue is not considered a bug and the recommended way to
handle it is by making use of the os.register_
to reinitialize its lock. The method is new in python 3.7, so as long
as we still support python 3.6, we must handle the situation outside of
oslo.db.
We can do this by clearing the cell cache that holds oslo.db database
transaction context manager objects during service start(). This way,
we get fresh oslo.db locks that are in an unlocked state when a child
process begins.
We can also take this opportunity to resolve part of a TODO to clear
the same cell cache during service reset() (SIGHUP) since it is another
case where we intended to clear it. The rest of the TODO related to
periodic clearing of the cache is removed after discussion on the
review, as such clearing would be unsynchronized among multiple
services and for periods of time each service might have a different
view of cached cells than another.
Closes-Bug: #1844929
[1] https:/
NOTE(melwitt): This backport differs slightly in that the test setup
calls set_stub_
I1dbccc2be6
Change-Id: Id233f673a57461
(cherry picked from commit 941559042f609ee
(cherry picked from commit 88205a4e911268d
(cherry picked from commit 4de766006d9432a
(cherry picked from commit a86ebc75eb886bd
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol | #45 |
This issue was fixed in the openstack/nova queens-eol release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova rocky-eol | #46 |
This issue was fixed in the openstack/nova rocky-eol release.
I don't see any changes since Sept 17 in nova, grenade or devstack that appear in any way related to this, so I'm guessing it's something being tickled by slower infra nodes.