Robustify locking system (utils.synchronized )

Bug #872558 reported by Cedric Brandily
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Stanislaw Pitucha

Bug Description

As noted by soren, troubles with utils.synchronized :

Scenario:
T1, T2, T3 ; 3 greenthreads executing a function with utils.synchronized("oof") as decorator

T1 associates a semaphore S to "oof" in _semaphores,
T1 acquires S
T2 get the same semaphore S, wait for S
T1 releases S and remove "oof"/S from _semaphores
T2 acquires S which is no more in _semaphores

scenario first possible ending (not really annoying):
T2 releases S and crash when removing "oof" from semaphores

scenario second possible ending (more annoying):
T3 associates a new semaphore S' to "oof" in _semaphores
T3 acquires S' ... T2 and T3 are doing "oof things at the same time

The attachment contains a test to highlight the first scenario and a proposed implementation solving the trouble.
The proposed implementation of synchronized is based around the class LockByKey.
This class is inspired from the implementation of threading.Semaphore (which i hope is quite robust !).

As far as i test, the proposed implementation supports green threads and native ones.

Revision history for this message
Cedric Brandily (cbrandily) wrote :
Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → Medium
status: New → Triaged
Mike Pittaro (mikeyp-3)
Changed in nova:
assignee: nobody → Mike Pittaro (mikeyp-3)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/4832

Revision history for this message
Cedric Brandily (cbrandily) wrote :

My feedback (with a better understanding of greenthread/coroutines (like eventlet)) :

- utils.synchronized is working well if we only use it with coroutines (greenthread) because eventlet implies constraints on where switch between greenthreads append
- utils.synchronized is not working for real Threads ...
==> so impose to never use Threads or change utils.synchronized

I got trouble with the condition in my proposal (coroutine liveness trouble) so i use weak references instead in LockByKey ( ... cleaning embedded :)). The code is more simple

class LockByKey(object):
    '''
    usage:
    locks = LockByKey()
    with locks('using __call__ + with'):
        pass
    with locks['using __getitem__ + with']:
        pass
    '''
    def __init__(self, lock_factory=None):
        self._lock_factory = lock_factory or threading.Lock
        self._lock = self._lock_factory()
        self._weaklocks = weakref.WeakValueDictionary()

    def _get_lock(self, lock_key):
        with self._lock:
            lock = self._weaklocks.get(lock_key)
            if lock is None:
                self._weaklocks[lock_key] = lock = self._lock_factory()
        return lock

    @contextlib.contextmanager
    def __call__(self, lock_key):
        with self._get_lock(lock_key):
            yield
    __getitem__ = __call__

Revision history for this message
Cedric Brandily (cbrandily) wrote :

           with _semaphores_lock: #part1
               if name not in _semaphores: # part1
                    _semaphores[name] = semaphore.Semaphore() # part1
            sem = _semaphores[name] # part3
            ....
            with _semaphores_lock: # part2
                if not sem.balance < 1:# part2
                    del _semaphores[name]# part2

The solution you propose does not work in the following scenario:
one thread (T1) does part1
an other one does part2
T1 use a lock no more in _semaphores

Revision history for this message
Johannes Erdfelt (johannes.erdfelt) wrote :

Is the implementation really a problem? Nova doesn't use real threads except for a couple of well defined areas that don't use utils.synchronized().

I'm not sure I understand when the existing implementation can run into a problem.

Revision history for this message
Mike Pittaro (mikeyp-3) wrote :

 The original bug report introduces a greenthread.sleep() to reproduce the problem, but since the logging calls involve I/O, I still think there's a potential problem if we're debug logging.

There's no unit test for this code - I'll add one, and see what that finds. Another test wont hurt.

Also, is there any good reason to delete the semaphores anyway ? We have less than a dozen in the code, and they're not very expensive in terms of resources.

Revision history for this message
Cedric Brandily (cbrandily) wrote :

I never thought about I/O in logging but you're right.

If i remember well some synchronized are called with "variable" lock names (see nova.virt.libvirt.connection module).
And perhaps there are (will be) synchronized which use a name based on the project/vm ...
So without cleaning, semaphores size will increase over time.

One solution is to transform semaphores into a weakref.WeakValueDictionary (beware it's not sufficient), you no more need to do the cleaning because gc will be able to do it for you.
Mainly that's what LockByKey is doing.

Revision history for this message
Mike Pittaro (mikeyp-3) wrote :

The units tests already exist in test_misc.py (I originally looked in test_utils.py.) Those tests are robust, and I repeatedly ran those without reproducing an error. I also tried some variations on the original test case, and can't reproduce a problem unless I add artificial sleeps.

I don't think this is an Essex issue - propose revisiting in Folsom.

Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

Revisiting this in Folsom.
There seem to be only 3 cases of using synchronized with a random name, so there may be a high number of arbitrary-name locks on a busy host:

- images (nova/virt/libvirt/imagebackend.py, nova/virt/baremetal/driver.py)
- instance_uuid (nova/compute/manager.py)

Last comment from ZZ mentioned that weakref.WeakValueDictionary is not sufficient, but without an example. I think the corner case there is that the semaphore would have to be checked+inserted in an "atomic" way to make sure it doesn't disappear in between creation and using it due to GC.

It seems like the code would have to look like this (with no "del _semaphores[name]" at the end of the function):

    _semaphores = weakref.WeakValueDictionary()

    semaphore = _semaphores.get(name, Semaphore())
    # we have either a new semaphore, or the old one has a reference now
    # if we've got a new one, name is not in the dict anymore, insert it
    if name not in _semaphores:
        _semaphores[name] = semaphore

This will create a new semaphore on every invocation. But that shouldn't be bad - hopefully we have no lock contention, so we'd need to create one anyway using old solution.

Is there anything missing from this approach?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/10924

Changed in nova:
assignee: Mike Pittaro (mikeyp-3) → Stanislaw Pitucha (stanislaw-pitucha)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/10924
Committed: http://github.com/openstack/nova/commit/9206ee5a63a65e076342896e3b41bbcbf819af56
Submitter: Jenkins
Branch: master

commit 9206ee5a63a65e076342896e3b41bbcbf819af56
Author: Stanislaw Pitucha <email address hidden>
Date: Mon Aug 6 23:16:07 2012 +0100

    Solve possible race in semaphor creation

    Solve the race condition described in the bug 872558 report which can
    result in:

    - thread crashing trying to remove semaphore from dict
    - two threads getting different semaphores for the same name

    First case is solved automatically by weakref dictionary. No explicit
    deletion takes place.
    The second case is solved by getting existing or new semaphore in one
    step. Once a local reference to the semaphore is obtained, it can be
    safely assigned to the dictionary if it's missing. If it's present, it
    will not be removed anymore because there's at least one strong
    reference present (local variable 'sem').

    This solution is only valid for greenthreads.

    Change-Id: I6bddc3e7abb39fd75e1f03abb2ea0f911b761957

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Cedric Brandily (cbrandily) wrote :

Sorry, i forget to share the weakref example/adaptation which is quite similar to your solution:

    semaphore = _semaphores.get(name, Semaphore())
    # we have either a new semaphore, or the old one has a reference now
    # if we've got a new one, name is not in the dict anymore, insert it
    if name not in _semaphores:
        _semaphores[name] = semaphore

I would just propose the following improvement:
    semaphore = _semaphores.get(name)
    # we have either a reference to a "old" semaphore or None
    # if we've got None, we need to create a semaphore and add it to the dict
    if semaphore is None:
        semaphore = _semaphores[name] = Semaphore()

It avoids to create every time a semaphore even if you got already one in the weakref dict ! And i feel it more clean and clear to understand.

Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

You're right - I guess I was trying to be too secure with atomic semaphore insertion. Since context switches cannot happen between the .get() and assignment, this should affect neither GC nor atomic update of the dict. Also .get(name, default) is not guaranteed to be atomic anyway, so this could be simplified.

Do you want to submit it? It's your modification after all.

Thierry Carrez (ttx)
Changed in nova:
milestone: none → folsom-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-3 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.