Reset the cell cache for database access in Service
We have had a gate bug for a long time where occasionally the scheduler
service gets into a state where many requests fail in it with
CellTimeout errors. Example:
Timed out waiting for response from cell <cell uuid>
Through the use of much DNM patch debug logging in oslo.db, it was
revealed that service child processes (workers) were sometimes starting
off with already locked internal oslo.db locks. This is a known issue
in python [1] where if a parent process forks a child process while a
lock is held, the child will inherit the held lock which can never be
acquired.
The python issue is not considered a bug and the recommended way to
handle it is by making use of the os.register_at_fork() in the oslo.db
to reinitialize its lock. The method is new in python 3.7, so as long
as we still support python 3.6, we must handle the situation outside of
oslo.db.
We can do this by clearing the cell cache that holds oslo.db database
transaction context manager objects during service start(). This way,
we get fresh oslo.db locks that are in an unlocked state when a child
process begins.
We can also take this opportunity to resolve part of a TODO to clear
the same cell cache during service reset() (SIGHUP) since it is another
case where we intended to clear it. The rest of the TODO related to
periodic clearing of the cache is removed after discussion on the
review, as such clearing would be unsynchronized among multiple
services and for periods of time each service might have a different
view of cached cells than another.
NOTE(melwitt): This backport differs slightly in that the test setup
calls set_stub_network_methods because change
I1dbccc2be6ba79bf267edac9208c80e187e6256a is not in Queens.
Change-Id: Id233f673a57461cc312e304873a41442d732c051
(cherry picked from commit 941559042f609ee43ff3160c0f0d0c45187be17f)
(cherry picked from commit 88205a4e911268dae7120a6a43ff9042d1534251)
(cherry picked from commit 4de766006d9432a7ccbcf6a4d4232db472b2f0e1)
(cherry picked from commit a86ebc75eb886bd293dca42439762ecdd69ca0d7)
Reviewed: https:/ /review. opendev. org/720596 /git.openstack. org/cgit/ openstack/ nova/commit/ ?id=5aa78acbce9 fa1fa18ac963385 e82a6367ff445e
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 5aa78acbce9fa1f a18ac963385e82a 6367ff445e
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000
Reset the cell cache for database access in Service
We have had a gate bug for a long time where occasionally the scheduler
service gets into a state where many requests fail in it with
CellTimeout errors. Example:
Timed out waiting for response from cell <cell uuid>
Through the use of much DNM patch debug logging in oslo.db, it was
revealed that service child processes (workers) were sometimes starting
off with already locked internal oslo.db locks. This is a known issue
in python [1] where if a parent process forks a child process while a
lock is held, the child will inherit the held lock which can never be
acquired.
The python issue is not considered a bug and the recommended way to at_fork( ) in the oslo.db
handle it is by making use of the os.register_
to reinitialize its lock. The method is new in python 3.7, so as long
as we still support python 3.6, we must handle the situation outside of
oslo.db.
We can do this by clearing the cell cache that holds oslo.db database
transaction context manager objects during service start(). This way,
we get fresh oslo.db locks that are in an unlocked state when a child
process begins.
We can also take this opportunity to resolve part of a TODO to clear
the same cell cache during service reset() (SIGHUP) since it is another
case where we intended to clear it. The rest of the TODO related to
periodic clearing of the cache is removed after discussion on the
review, as such clearing would be unsynchronized among multiple
services and for periods of time each service might have a different
view of cached cells than another.
Closes-Bug: #1844929
[1] https:/ /bugs.python. org/issue6721
NOTE(melwitt): This backport differs slightly in that the test setup network_ methods because change ba79bf267edac92 08c80e187e6256a is not in Queens.
calls set_stub_
I1dbccc2be6
Change-Id: Id233f673a57461 cc312e304873a41 442d732c051 43ff3160c0f0d0c 45187be17f) ae7120a6a43ff90 42d1534251) 7ccbcf6a4d4232d b472b2f0e1) 293dca42439762e cdd69ca0d7)
(cherry picked from commit 941559042f609ee
(cherry picked from commit 88205a4e911268d
(cherry picked from commit 4de766006d9432a
(cherry picked from commit a86ebc75eb886bd