Comment 28 for bug 1844929

Revision history for this message
melanie witt (melwitt) wrote :

After more digging than I'd care to admit, I think I have finally gotten to the bottom of what's happening with this bug.

Through DNM patch debug logging in oslo.db [1], I found that during a grenade run, after a nova-scheduler service stop and start, that child processes of the nova-scheduler (workers) were starting off with already locked internal oslo.db locks occasionally. This can happen if requests are flowing in to the service while it is in the middle of forking child process workers. The first database request fires and takes the lock, and then child processes are forked while the lock is held.

When this happened, database accesses for the particular cell-cached database transaction context manager object could never acquire the lock and would just get stuck, eventually failing with a CellTimeout error.

Here's aggregated snippets of the DNM patch debug logging showing the inherited held locks:

http://paste.openstack.org/show/791646

This behavior of not "resetting" or sanitizing standard library locks at fork is a known issue in python [2] that's currently being worked on.

In the meantime, I think we can handle this on our side by clearing our cell cache that holds oslo.db database transaction context manager objects during service start(). This way, we get fresh oslo.db locks that are in an unlocked state when a child process begins.

[1] https://review.opendev.org/#/c/714802/6
[2] https://bugs.python.org/issue40089