Libvirt's close callback causes deadlocks

Bug #1293641 reported by John Warren
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
John Warren

Bug Description

Libvirt's close callback causes deadlocks

Unlike libvirt's lifecycle event callback which is triggered every time an event occurs, the close callback is only triggered when an attempt is made to use a connection that has been closed. In that case, the sequence of events is usually as follows:

LibvirtDriver._get_connection acquires _wrapped_conn_lock
LibvirtDriver._get_connection calls _test_connection
LibvirtDriver._test_connection calls libvirt.getLibVersion
libvirt.getLibVersion triggers LibvirtDriver._close_callback (because the connection is closed and this method is the registered handler)
LibvirtDriver._close_callback attempts to acquire _wrapped_conn_lock

_get_connection cannot release the lock because it is waiting for _close_callback to return and the latter cannot complete until it has acquired the lock.

Making the handling of the close callback asynchronous (like the lifecycle event handling) won't work, because by the time the lock is released, the connection object that was passed into the callback will no longer be equal to LibvirtDriver._wrapped_conn. Even if the connection object is ignored, the instance will have already been disabled via the _get_new_connection method's existing error-handling logic.

The best solution would appear to be to simply not register a close callback. The only case where it might provide some benefit is when a connection is closed after _get_connection has returned a reference to it. The benefit of disabling the instance a little earlier in such marginal cases is arguably outweighed by the complexity of a thread-safe implementation, especially when the difficulty of testing such an implementation (to ensure it is indeed thread safe) is taken into consideration.

Note that having _close_callback use _wrapped_conn_lock.acquire(False) instead of "with _wrapped_conn_lock" by itself is not a viable solution, because due to connections being opened via tpool.proxy_call, the close callback is called by a native thread, which means it should not be used to perform the various operations (including logging) involved in disabling the instance.

Tags: libvirt
Revision history for this message
Tracy Jones (tjones-i) wrote :

adding to rc1 for visibility

Changed in nova:
milestone: none → icehouse-rc1
Revision history for this message
Joe Gordon (jogo) wrote :

We are tracking several known libvirt bugs at the moment (https://blueprints.launchpad.net/nova/+spec/support-libvirt-1x). What version of libvirt are you using, and do any of the known bugs sound like they are a duplicate of this?

Changed in nova:
status: New → Incomplete
Revision history for this message
John Warren (jswarren) wrote :

I am using version 1.1.1 and none of the other bugs look like a duplicate of this one.

Revision history for this message
Daniel Berrange (berrange) wrote :

NB This isn't a deadlock in libvirt code - it is in the libvirt nova driver itself - recursive acquisition of _wrapped_conn_lock

Revision history for this message
Daniel Berrange (berrange) wrote :

I think we can just move the call to 'test_connection' outside the lock, and loop to re-try upon failure eg

    def _get_connection(self):
        wrapped_conn = None
        # multiple concurrent connections are protected by _wrapped_conn_lock
        while not wrapped_conn:
            with self._wrapped_conn_lock:
                if not self._wrapped_conn
                    wrapped_conn = self._get_new_connection()
               wrapped_conn = self._wrapped_conn

            if not self._test_connection(wrapped_conn):
               wrapped_conn = None

        return wrapped_conn

Revision history for this message
John Warren (jswarren) wrote :

That solution does not solve the problem that _close_callback is being called by a native thread--none of the operations involved in disabling the host, including logging, can be called by a native thread. A proposed solution is already in the works: https://review.openstack.org/#/c/79617/ It also covers a separate bug that explains the native-thread issue.

Changed in nova:
status: Incomplete → Confirmed
importance: Undecided → High
status: Confirmed → In Progress
assignee: nobody → John Warren (jswarren)
tags: added: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/79617
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dba898226ec26ab49cecd24e9d6c255cf6153cd1
Submitter: Jenkins
Branch: master

commit dba898226ec26ab49cecd24e9d6c255cf6153cd1
Author: John Warren <email address hidden>
Date: Tue Mar 11 14:40:31 2014 +0000

    Change libvirt close callback to use green thread

    The native thread that calls the close callback is being
    used to update a host's status, potentially causing threading
    issues, since it involves operations that should only be
    performed by green threads. This change fixes the issue
    by limiting the native thread's interaction to adding data
    to a queue and leaving the remaining work for
    a green thread.

    Closes-Bug: #1290487
    Closes-Bug: #1293641

    Change-Id: I59c14da15e4655dc4bf21d4de3d509472b146f7a

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.