nova-compute hangs while executing a blocking call to librbd

Bug #1607461 reported by Roman Podoliaka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Roman Podoliaka

Bug Description

While executing a call to librbd nova-compute may hang for a while (looks like at least some calls can take a really long time depending on the health of a Ceph cluster and things like http://docs.ceph.com/docs/master/rbd/librbdpy/#rbd.RBD.list are inherently slow down as the number of entities to be listed grows) and eventually go down in nova service-list output.

strace'ing shows that a process is stuck on acquiring a mutex:

root@node-153:~# strace -p 16675
Process 16675 attached
futex(0x7fff084ce36c, FUTEX_WAIT_PRIVATE, 1, NULL

gdb allows to see the traceback:

http://paste.openstack.org/show/542534/

^ which basically means calls to librbd (C library) are not monkey-patched and do not allow to switch the execution context to another green thread in an eventlet-based process.

To avoid blocking of the whole nova-compute process on calls to librbd we should wrap them with tpool.execute() (http://eventlet.net/doc/threading.html#eventlet.tpool.execute)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in nova:
assignee: nobody → Roman Podoliaka (rpodolyaka)
tags: added: ceph compute
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/348492

Changed in nova:
status: New → In Progress
melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/348492
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3405a28688eacbca23cf5cac0a611d33fb1a1f2c
Submitter: Jenkins
Branch: master

commit 3405a28688eacbca23cf5cac0a611d33fb1a1f2c
Author: Roman Podoliaka <email address hidden>
Date: Thu Jul 28 20:08:44 2016 +0300

    rbd_utils: wrap blocking calls in tpool.Proxy()

    librbd is a Python binding around a C library, which is not aware of
    eventlet - all the calls to the functions from this library will block
    the whole nova-compute process for duration of a call. To make sure
    nova-compute remains responsive we need to wrap all the calls in
    tpool.Proxy() eventlet helper, that switches the execution context
    back to the event loop, while the call is executed in a native OS
    thread from a pool.

    Prefer tpool.Proxy() to tpool.execute() here as the former allows for
    wrapping objects and automatically executes all the method calls in
    native OS threads, while the latter needs to be applied to each
    method call in the code repeatedly.

    Existing calls are modified for the sake of consistency.

    Closes-Bug: #1607461

    Change-Id: I743ab372332eb656258a476ae91f5e8fd2cbdc99

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.0.0b3

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/365051

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/mitaka)

Reviewed: https://review.openstack.org/365051
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e2b2f6e6249117e5f74bc8088b7f1b9f085e6de2
Submitter: Jenkins
Branch: stable/mitaka

commit e2b2f6e6249117e5f74bc8088b7f1b9f085e6de2
Author: Roman Podoliaka <email address hidden>
Date: Thu Jul 28 20:08:44 2016 +0300

    rbd_utils: wrap blocking calls in tpool.Proxy()

    librbd is a Python binding around a C library, which is not aware of
    eventlet - all the calls to the functions from this library will block
    the whole nova-compute process for duration of a call. To make sure
    nova-compute remains responsive we need to wrap all the calls in
    tpool.Proxy() eventlet helper, that switches the execution context
    back to the event loop, while the call is executed in a native OS
    thread from a pool.

    Prefer tpool.Proxy() to tpool.execute() here as the former allows for
    wrapping objects and automatically executes all the method calls in
    native OS threads, while the latter needs to be applied to each
    method call in the code repeatedly.

    Existing calls are modified for the sake of consistency.

    Closes-Bug: #1607461

    Change-Id: I743ab372332eb656258a476ae91f5e8fd2cbdc99
    (cherry picked from commit 3405a28688eacbca23cf5cac0a611d33fb1a1f2c)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 13.1.2

This issue was fixed in the openstack/nova 13.1.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.