OpenStack Compute (nova)

Libvirt's close callback causes deadlocks

Bug #1293641 reported by John Warren on 2014-03-17

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	John Warren	OpenStack Compute (nova) 2014.1 "icehouse"

Bug Description

Unlike libvirt's lifecycle event callback which is triggered every time an event occurs, the close callback is only triggered when an attempt is made to use a connection that has been closed. In that case, the sequence of events is usually as follows:

LibvirtDriver._get_connection acquires _wrapped_conn_lock
LibvirtDriver._get_connection calls _test_connection
LibvirtDriver._test_connection calls libvirt.getLibVersion
libvirt.getLibVersion triggers LibvirtDriver._close_callback (because the connection is closed and this method is the registered handler)
LibvirtDriver._close_callback attempts to acquire _wrapped_conn_lock

_get_connection cannot release the lock because it is waiting for _close_callback to return and the latter cannot complete until it has acquired the lock.

Making the handling of the close callback asynchronous (like the lifecycle event handling) won't work, because by the time the lock is released, the connection object that was passed into the callback will no longer be equal to LibvirtDriver._wrapped_conn. Even if the connection object is ignored, the instance will have already been disabled via the _get_new_connection method's existing error-handling logic.

The best solution would appear to be to simply not register a close callback. The only case where it might provide some benefit is when a connection is closed after _get_connection has returned a reference to it. The benefit of disabling the instance a little earlier in such marginal cases is arguably outweighed by the complexity of a thread-safe implementation, especially when the difficulty of testing such an implementation (to ensure it is indeed thread safe) is taken into consideration.

Note that having _close_callback use _wrapped_conn_lock.acquire(False) instead of "with _wrapped_conn_lock" by itself is not a viable solution, because due to connections being opened via tpool.proxy_call, the close callback is called by a native thread, which means it should not be used to perform the various operations (including logging) involved in disabling the instance.

Tags:

Revision history for this message

Tracy Jones (tjones-i) wrote on 2014-03-18:

adding to rc1 for visibility

Changed in nova:
milestone:	none → icehouse-rc1

Revision history for this message

Joe Gordon (jogo) wrote on 2014-03-18:

We are tracking several known libvirt bugs at the moment (https://blueprints.launchpad.net/nova/+spec/support-libvirt-1x). What version of libvirt are you using, and do any of the known bugs sound like they are a duplicate of this?

Changed in nova:
status:	New → Incomplete

Revision history for this message

John Warren (jswarren) wrote on 2014-03-19:

I am using version 1.1.1 and none of the other bugs look like a duplicate of this one.

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-03-19:

NB This isn't a deadlock in libvirt code - it is in the libvirt nova driver itself - recursive acquisition of _wrapped_conn_lock

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-03-19:

I think we can just move the call to 'test_connection' outside the lock, and loop to re-try upon failure eg

    def _get_connection(self):
        wrapped_conn = None
        # multiple concurrent connections are protected by _wrapped_conn_lock
        while not wrapped_conn:
            with self._wrapped_conn_lock:
                if not self._wrapped_conn
                    wrapped_conn = self._get_new_connection()
               wrapped_conn = self._wrapped_conn

if not self._test_connection(wrapped_conn):
wrapped_conn = None

return wrapped_conn

Revision history for this message

John Warren (jswarren) wrote on 2014-03-19:

That solution does not solve the problem that _close_callback is being called by a native thread--none of the operations involved in disabling the host, including logging, can be called by a native thread. A proposed solution is already in the works: https://review.openstack.org/#/c/79617/ It also covers a separate bug that explains the native-thread issue.

Russell Bryant (russellb) on 2014-03-19

Changed in nova:
status:	Incomplete → Confirmed
importance:	Undecided → High
status:	Confirmed → In Progress
assignee:	nobody → John Warren (jswarren)

John Garbutt (johngarbutt) on 2014-03-25

tags:

added: libvirt

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-26: Fix merged to nova (master)

Reviewed: https://review.openstack.org/79617
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dba898226ec26ab49cecd24e9d6c255cf6153cd1
Submitter: Jenkins
Branch: master

commit dba898226ec26ab49cecd24e9d6c255cf6153cd1
Author: John Warren <email address hidden>
Date: Tue Mar 11 14:40:31 2014 +0000

Change libvirt close callback to use green thread

    The native thread that calls the close callback is being
    used to update a host's status, potentially causing threading
    issues, since it involves operations that should only be
    performed by green threads. This change fixes the issue
    by limiting the native thread's interaction to adding data
    to a queue and leaving the remaining work for
    a green thread.

Closes-Bug: #1290487
Closes-Bug: #1293641

Change-Id: I59c14da15e4655dc4bf21d4de3d509472b146f7a

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-03-31

Changed in nova:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2014-04-17

Changed in nova:
milestone:	icehouse-rc1 → 2014.1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.