1.8 and pthread instability

Bug #954257 reported by boundless
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Panda3D
Fix Released
High
rdb

Bug Description

Install Kubuntu 11.10 amd64
Update, reboot and install panda3d1.8_1.8.0~oneiric_amd64.deb
Run egg2bam

egg2bam plane.egg -o plane.bam

Writing plane.bam
egg2bam: built/include/mutexPosixImpl.I:54: void MutexPosixImpl::acquire(): Assertion `result == 0' failed.
Aborted

This assert is thrown on every egg2bam independant of the egg file. It still makes some bam file, but not sure how correct it is.

Panda3D samples are running successful.

The same behaviour with panda3d1.8_1.8.0+cvs20120313~oneiric154_amd64.deb

Revision history for this message
boundless (military) wrote :
Revision history for this message
boundless (military) wrote :

The same behaviour is observed with 32 bit version of Kubuntu and panda.
However the bug is observed on only 1 of my machines.

rdb (rdb)
Changed in panda3d:
importance: Undecided → High
milestone: none → 1.8.1
status: New → Confirmed
Revision history for this message
ThomasEgi (antitron) wrote :

i'm on ubuntu 12.04 amd46. same error for every time i ran egg2bam. (getting it on 2 different machines)
the outputted bam file seems correct.
ocasionally i ran into another mutex-posix-impl bug outside egg2bam but i can't really reproduce it there.

Revision history for this message
rdb (rdb) wrote :

I get this issue too, all the time, with the pthreads implementation of Panda's mutex lock. Most of the time it happens in acquire:
Assertion failed: (result == 0), function acquire, file built/include/mutexPosixImpl.I, line 54.

However, it can also occur in the destructor. The pthread function seems to return some error value. Unfortunately, this is really -David's area-- I don't have much expertise in threading programming at all, so I would have no idea how to debug this.

summary: - Assert during egg2bam on 1.8 version
+ 1.8 and pthread instability
Revision history for this message
rdb (rdb) wrote :

For the record, the issue occurs for me at shutdown, without any kind of threading-model enabled, with support-threads to #f, in a utility program that uses Panda but does not start up a window or even invoke any kind of run loop.

Revision history for this message
Tureba (tureba) wrote :

Well, cvs head still shows the problem. The backtrace is as follows:

-----------------
#0 0x00007ffff412e235 in raise () from /lib64/libc.so.6
#1 0x00007ffff412f6b8 in abort () from /lib64/libc.so.6
#2 0x00007ffff41272b2 in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007ffff4127362 in __assert_fail () from /lib64/libc.so.6
#4 0x00007ffff6b8ec53 in TypedWritable::~TypedWritable() () from built/lib/libpanda.so.1.9
#5 0x00007ffff620a947 in RenderEffects::~RenderEffects() () from built/lib/libpanda.so.1.9
#6 0x00007ffff620a9c9 in RenderEffects::~RenderEffects() () from built/lib/libpanda.so.1.9
#7 0x00007ffff60fef65 in PointerToBase<RenderEffects>::~PointerToBase() () from built/lib/libpanda.so.1.9
#8 0x00007ffff4131157 in __cxa_finalize () from /lib64/libc.so.6
#9 0x00007ffff60b8953 in __do_global_dtors_aux () from built/lib/libpanda.so.1.9
#10 0x00007fffffffded0 in ?? ()
#11 0x00007ffff7deaf2f in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC
-----------------

So the problem lies in ~TypedWritable. Looking at it (panda/src/putil/typedWritable.*), I see that the lock (typedWritable.h:86) is a static member used to guard a non-static pvector (typedWritable.h:85), which is a fundamentaly wrong design. The attached patch solves this for me by turning the lock into an object-member. It may also improve parallelism because it doesn't lock all objects in order to access one - it locks one to access one. Global locks are evil.

Revision history for this message
Tureba (tureba) wrote :

Fixes the bug. Touches TypedWritable and BamWriter changing the lock attribute and its accesses.

Revision history for this message
David Rose (droklaunchpad) wrote :

Well, the reason this lock was static is because TypedWritable is the base class of almost every object in Panda, and for low-memory environments it can be important to keep the size of this class as small as possible. The extra word in every object that a lock will require can add up after a while. Also, the list it protects is only rarely used, so there's very little cost in terms of parallelism for using a global lock here.

From the stack trace, it seems that the fundamental cause of the assertion is not that the static lock is protecting a non-static pvector (which is an uncommon design, but not fundamentally wrong), but simply a problem with statically-declared LightMutex objects in general. The problem is that when the application ends and all the static destructors are called, the relative order of the destructors between different modules is undefined; and in this case it appears to be destructing at least one TypedWritable object *after* _bam_writers_lock has already destructed.

I think a better solution is to redefine _bam_writers_lock as a "static LightMutex *_bam_writers_lock", and allocate it the first time it is used. Then it will never attempt to destruct. Some care will need to be taken to ensure there is no race condition when it is allocated, for instance following the example of AttribNodeRegistry::make_global_ptr().

David

Revision history for this message
Tureba (tureba) wrote :

I just recently noticed the same design problem on other parts. It is still not right. If the lock is there to protect a list, then it should be in the list - not in the elements. And all accesses to the list should be through member functions and operators of that list implementation that transparently acquire and release the lock. The lock should be private to the list and it should not be used by anything other than the list itself.

Setting more class members to static is not the way to go - it is the exact opposite of what has to be done here because it only exposes more shared memory (to derived and friend classes, at the very least, if the data is private; to everyone if the data is public) to be protected from data races. There are thread-safe container implementations out there to be used as reference, if needed. I think even C++11 STL improves on this, but also, Boost and TBB are great.

But now that this bug has David's attention, I won't try to hit it with naive patches anymore - someone better is looking at the bug now. :-)

Revision history for this message
rdb (rdb) wrote :

I agree with David, adding a lock to every TypedWritable instance is unnecessary bloat. It may be unusual to protect a non-static member with a non-static lock, but it's obviously the better solution here. "It doesn't feel right" isn't a good excuse for adding a significant memory overhead.

David, do you want to commit a fix for this since you know exactly what to do, or should I just go ahead and copy the mechanism from AttribNodeRegistry?

I'm guessing that this fix (unfortunately) can't go in 1.8.1 for ABI compatibility reasons, so I'm removing the 1.8.1 milestone tag.

Changed in panda3d:
milestone: 1.8.1 → none
status: Confirmed → Triaged
Revision history for this message
rdb (rdb) wrote :

In an effort to please the both of you, I replaced the BamWriters pointer with a singly linked list locked using an atomic tagged pointer. This means it is locked on a per-instance level while not consuming any extra memory. And in the common case of there being only one BamWriter, this still saves about 40 bytes per TypedWritable pointer on my system. Probably overkill, but there ya go.

Diff:
https://github.com/panda3d/incremental/commit/63336dfdda0f361d5e249f2a9f24974ad96ca177

Changed in panda3d:
assignee: nobody → rdb (rdb)
milestone: none → 1.9.0
status: Triaged → Fix Committed
rdb (rdb)
Changed in panda3d:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers