test linked against nih-dbus-tool-generated libraryis not thread-safe

Bug #1294200 reported by Serge Hallyn
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cgmanager (Ubuntu)
Fix Released
High
Unassigned
dbus (Ubuntu)
Won't Fix
High
Unassigned
libnih (Ubuntu)
Won't Fix
High
Unassigned
lxc (Ubuntu)
Fix Released
Medium
Unassigned
lxcfs (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I've taken the libnih source for trusty, and added '--enable-threads' to the three dh_auto_configure lines in debian/rules, rebuilt ,and installed the result. Then I took the source for cgmanager package, rebuilt, and installed.

Finally I took github.com/cgmanager/cgmanager, copied the configure.ac, Makefile.am, and tests/cgm-concurrent.c into the cgmanager package source, and built tests/cgm-concurrent.c.

The cgm-concurrent.c only connects to the cgmanager dbus server, sends a ping method, and disconnects - with one connection per thread.

When I do cgm-concurrent -i 100 -j 30 -c (meaning use 30 threads and do 100 full iterations, and don't do any extra dbus calls), I get pretty random dumps. Here is one example:

(gdb) where
#0 0x00007ffff71c5f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff71c9388 in __GI_abort () at abort.c:89
#2 0x00007ffff71bee36 in __assert_fail_base (fmt=0x7ffff73104b8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7ffff756793c "mutex->__data.__owner == 0",
    file=file@entry=0x7ffff7567908 "../nptl/pthread_mutex_lock.c", line=line@entry=116, function=function@entry=0x7ffff7567a40 <__PRETTY_FUNCTION__.8500> "__pthread_mutex_lock") at assert.c:92
#3 0x00007ffff71beee2 in __GI___assert_fail (assertion=0x7ffff756793c "mutex->__data.__owner == 0", file=0x7ffff7567908 "../nptl/pthread_mutex_lock.c", line=116,
    function=0x7ffff7567a40 <__PRETTY_FUNCTION__.8500> "__pthread_mutex_lock") at assert.c:101
#4 0x00007ffff755f52f in __GI___pthread_mutex_lock (mutex=0xfefefefefefefe00) at ../nptl/pthread_mutex_lock.c:116
#5 0x00007ffff779fa45 in _dbus_platform_rmutex_lock (mutex=<optimized out>) at ../../dbus/dbus-sysdeps-pthread.c:156
#6 0x00007ffff77943f5 in _dbus_rmutex_lock (mutex=<optimized out>) at ../../dbus/dbus-threads.c:176
#7 0x00007ffff7782b40 in dbus_connection_get_data (connection=0x7fffc0040e70, slot=0) at ../../dbus/dbus-connection.c:5979
#8 0x00007ffff79bb766 in nih_dbus_setup () from /lib/x86_64-linux-gnu/libnih-dbus.so.1
#9 0x0000000000401cbe in cgm_dbus_connect ()
#10 0x0000000000401e5d in do_function ()
#11 0x0000000000401ec4 in concurrent ()
#12 0x00007ffff755d182 in start_thread (arg=0x7fffd8ff9700) at pthread_create.c:312
#13 0x00007ffff728a12d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Sometimes they show up in nih_free, where usually the thing being freed has a ->next which points to a member function; sometimes they show up in the dbus library at various points.

I may well be doing something wrong, but I don't know what. If we can't either fix what I'm doing wrong or fix libnih or libdbus (if those are being buggy), then I guess I'll have to mutex the connection among all threads.

Changed in lxc (Ubuntu):
importance: Undecided → Critical
Changed in cgmanager (Ubuntu):
importance: Undecided → Critical
Revision history for this message
James Hunt (jamesodhunt) wrote :

Hi Serge,

Looking at nih's configure.ac, you actually want to use '--enable-threading' I think. I've never used this option and tbh wasn't even aware it was there.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1294200] Re: test linked against nih-dbus-tool-generated libraryis not thread-safe

D'oh! thanks. the enable-threads= obviously confused me. I'll re-test
with that option.

If that option *does* work, then as xnox suggested yesterday we may need
to have a separate package with threading-enabled libnih.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :
Download full text (6.5 KiB)

sadly that did not help, but actually seemed to make it worse (as in,
crashes faster):

[New Thread 0x7ffff6f86700 (LWP 18964)]
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent[Thread 0x7fffd0ff1700 (LWP 18962) exited]
[New Thread 0x7ffff6785700 (LWP 18965)]
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
*** Error in `/home/serge/cgmanager-0.20/tests/cgm-concurrent': malloc(): smallbin double linked list corrupted: 0x00007fffd81de510 ***
[New Thread 0x7ffff5f84700 (LWP 18966)]
*** Error in `[New Thread 0x7ffff5783700 (LWP 18967)]
*** Error in `[Thread 0x7ffff6f86700 (LWP 18964) exited]
[Thread 0x7fffd17f2700 (LWP 18963) exited]
[Thread 0x7ff...

Read more...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Marking medium for lxc as lxc works around this now using a mutex :(

Changed in cgmanager (Ubuntu):
importance: Critical → High
Changed in lxc (Ubuntu):
importance: Critical → Medium
status: New → Confirmed
Changed in cgmanager (Ubuntu):
status: New → Confirmed
Changed in dbus (Ubuntu):
importance: Undecided → High
Changed in libnih (Ubuntu):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Some of this likely is due to (acknowledged) basic thread-safety issues in libdbus, and there is discussion on the dbus mailing list about how to address it (in libdbus)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Turns out (afaics) there are no libdbus threading issues in what we've found so far. Rather,

libnih needs to be built with --enable-threading, otherwise it - especially the error handling - is not thread-safe.

lxcfs needed a few tweaks to become thread-safe when built against libnih

If current tests pan out (I have candidate patches in ppa:serge-hallyn/lxd) then one last update should be to stop mutexing the use of cgmanager in lxc (making the dbus connection per-thread as I did in lxcfs)

Changed in cgmanager (Ubuntu):
status: Confirmed → Fix Released
Changed in dbus (Ubuntu):
status: New → Invalid
Changed in libnih (Ubuntu):
status: New → Confirmed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

No, I was wrong. Upping the ncpus makes that apparent.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Should we close this bug now that we're no longer using libnih in lxcfs and are moving away from cgmanager too for this very threading reason?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I don't know. libnih is still a nice library and it would be nice if it could
be fixed. Certainly the lxcfs bug should be marked invalid since we no longer
use it. Perhaps lxc eventually, but not yet.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ok. It was 'fix released' in cgmanager and lxc by working around it (not
enabling threading). It is invalid in lxcfs in xenial because we have
switched to glib and gdbus there. The libnih and dbus bugs are still open,
though in dbus it is wontfix from upstream. Since dbus is wontfix, I think we
can mark the libnih one wontfix as well. I have a vague recollection that there
may have been threading issues in libnih even without dbus (and iirc support
to make threading mostly-safe is still not compiled in anyway).

Changed in lxcfs (Ubuntu):
status: New → Invalid
Changed in dbus (Ubuntu):
status: Invalid → Confirmed
Changed in lxc (Ubuntu):
status: Confirmed → Fix Released
Changed in dbus (Ubuntu):
status: Confirmed → Won't Fix
Changed in libnih (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.