OpenStack Compute (nova)

Bug #1844929
Comment #31

Comment 31 for bug 1844929

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-09: Fix merged to nova (master)

#31

Reviewed: https://review.opendev.org/717662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=941559042f609ee43ff3160c0f0d0c45187be17f
Submitter: Zuul
Branch: master

commit 941559042f609ee43ff3160c0f0d0c45187be17f
Author: melanie witt <email address hidden>
Date: Fri Apr 3 21:22:27 2020 +0000

Reset the cell cache for database access in Service

    We have had a gate bug for a long time where occasionally the scheduler
    service gets into a state where many requests fail in it with
    CellTimeout errors. Example:

Timed out waiting for response from cell <cell uuid>

    Through the use of much DNM patch debug logging in oslo.db, it was
    revealed that service child processes (workers) were sometimes starting
    off with already locked internal oslo.db locks. This is a known issue
    in python [1] where if a parent process forks a child process while a
    lock is held, the child will inherit the held lock which can never be
    acquired.

    The python issue is not considered a bug and the recommended way to
    handle it is by making use of the os.register_at_fork() in the oslo.db
    to reinitialize its lock. The method is new in python 3.7, so as long
    as we still support python 3.6, we must handle the situation outside of
    oslo.db.

    We can do this by clearing the cell cache that holds oslo.db database
    transaction context manager objects during service start(). This way,
    we get fresh oslo.db locks that are in an unlocked state when a child
    process begins.

    We can also take this opportunity to resolve part of a TODO to clear
    the same cell cache during service reset() (SIGHUP) since it is another
    case where we intended to clear it. The rest of the TODO related to
    periodic clearing of the cache is removed after discussion on the
    review, as such clearing would be unsynchronized among multiple
    services and for periods of time each service might have a different
    view of cached cells than another.

Closes-Bug: #1844929

[1] https://bugs.python.org/issue6721

Change-Id: Id233f673a57461cc312e304873a41442d732c051

Reviewed:  https://review.opendev.org/717662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=941559042f609ee43ff3160c0f0d0c45187be17f
Submitter: Zuul
Branch:    master

commit 941559042f609ee43ff3160c0f0d0c45187be17f
Author: melanie witt <melwittt@gmail.com>
Date:   Fri Apr 3 21:22:27 2020 +0000

Reset the cell cache for database access in Service
    
    We have had a gate bug for a long time where occasionally the scheduler
    service gets into a state where many requests fail in it with
    CellTimeout errors. Example:
    
      Timed out waiting for response from cell <cell uuid>
    
    Through the use of much DNM patch debug logging in oslo.db, it was
    revealed that service child processes (workers) were sometimes starting
    off with already locked internal oslo.db locks. This is a known issue
    in python [1] where if a parent process forks a child process while a
    lock is held, the child will inherit the held lock which can never be
    acquired.
    
    The python issue is not considered a bug and the recommended way to
    handle it is by making use of the os.register_at_fork() in the oslo.db
    to reinitialize its lock. The method is new in python 3.7, so as long
    as we still support python 3.6, we must handle the situation outside of
    oslo.db.
    
    We can do this by clearing the cell cache that holds oslo.db database
    transaction context manager objects during service start(). This way,
    we get fresh oslo.db locks that are in an unlocked state when a child
    process begins.
    
    We can also take this opportunity to resolve part of a TODO to clear
    the same cell cache during service reset() (SIGHUP) since it is another
    case where we intended to clear it. The rest of the TODO related to
    periodic clearing of the cache is removed after discussion on the
    review, as such clearing would be unsynchronized among multiple
    services and for periods of time each service might have a different
    view of cached cells than another.
    
    Closes-Bug: #1844929
    
    [1] https://bugs.python.org/issue6721
    
    Change-Id: Id233f673a57461cc312e304873a41442d732c051