Multithreaded tls_model("global-dynamic") glXMakeContextCurrent ___tls_get_addr () freeze/deadlock

Bug #965798 reported by Xerxes Rånby
78
This bug affects 33 people
Affects Status Importance Assigned to Milestone
Linux Mint
New
Undecided
Unassigned
mesa (Debian)
Invalid
Undecided
Unassigned
mesa (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Ubuntu mesa contains the patch 113_fix_tls.diff that triggers a freezing impact:
+++
 extern __thread void *u_current_user
- __attribute__((tls_model("initial-exec")));
+ __attribute__((tls_model("global-dynamic")));
+++

JogAmp JOGL have triaged that this bug are introduced by this ubuntu local patch:
https://jogamp.org/bugzilla/show_bug.cgi?id=566

I am able to reproduce this bug using Ubuntu 11.10 + mesa 7.11
GL_VENDOR: X.Org
GL_RENDERER: Gallium 0.4 on AMD RV770
GL_VERSION: 2.1 Mesa 7.11

Testcase (large ~150mb)
wget http://processing.googlecode.com/files/processing-2.0a5-linux.tgz
tar zxvf processing-2.0a5-linux.tgz
cd processing-2.0a5
./processing
#open File->Examples... Pick Topics->Geometry-> double click on RGBCube
#run the sketch ctrl-r
#output below:

xranby@hp ~/processing-2.0a5 $ ./processing
Info: XInitThreads() called for concurrent Thread support
<window opens>
<freeze>

Backtrace of the stalled window thread.

(gdb) bt
#0 0x00118e3f in pthread_mutex_lock ()
   from /lib/i386-linux-gnu/libpthread.so.0
#1 0x0058701b in ?? () from /lib/ld-linux.so.2
#2 0x0059781d in ___tls_get_addr () from /lib/ld-linux.so.2
#3 0x0305e558 in ?? () from /usr/lib/i386-linux-gnu/dri/libgallium.so
#4 0x0218e3be in dri_make_current ()
   from /usr/lib/i386-linux-gnu/dri/r600_dri.so
#5 0x0216b1bf in ?? () from /usr/lib/i386-linux-gnu/dri/r600_dri.so
#6 0x074e4b8d in ?? () from /usr/lib/i386-linux-gnu/mesa/libGL.so.1
#7 0x074bcb97 in glXMakeCurrentReadSGI ()
   from /usr/lib/i386-linux-gnu/mesa/libGL.so.1
#8 0x082f329a in Java_jogamp_opengl_x11_glx_GLX_dispatch_1glXMakeContextCurrent1(__complex __complex __complex __complex __complex) ()
   from /tmp/jogamp.tmp.cache_000000/jln4751330094528893169/jln2305029403340098767/libjogl_desktop.so

Related branches

Revision history for this message
Sven Gothel (sgothel) wrote :

Allow me to copy my analysis of this bug here
from https://jogamp.org/bugzilla/show_bug.cgi?id=566#c13

This bug affects all Mesa8 w/ the tls_model("global-dynamic") patch enabled!

+++

Analysis of patch 113_fix_tls.diff and it's freezing impact:

+++
 extern __thread void *u_current_user
- __attribute__((tls_model("initial-exec")));
+ __attribute__((tls_model("global-dynamic")));
+++

the above changes the TLS behavior of libGL.

Note: u_current_user reflects the current context
changed by glXMakeContextCurrent(..).

Phenomenon of AWT-EDT glXMakeContextCurrent() freeze:

[0] Java launches
[0.1] starts new AWT-EDT thread
[0.2] starts main thread and hands over execution of main

[1] JOGL loads and initializes the libGL via 1st call of GLProfile.initSingleton()
incl. lib loading via dlopen() and symbol lookup etc.

[2] Step [1] kicks of the shared resource runner, which creates the very 1st
JOGL GL context while probing all profiles.

[3] At some point, AWT-EDT runs a JOGL Runnable:
[3.1] ctx = createContext()
[3.2] makeCurrent(dpy, read, write, ctx);
[3.3] makeCurrent(dpy, 0, 0, 0); -> release context *** Freeze ***

glx/glxcurrent.c/MakeContextCurrent():............__glXSetCurrentContextNull()
glx/glxcurrent.c/__glXSetCurrentContextNull(): .........._glapi_set_context(NULL);
./mapi/mapi/mapi_glapi.c/_glapi_set_context(NULL):........u_current_set_user(NULL);
./mapi/mapi/u_current.c/u_current_set_user(NULL):.............u_current_init();
./mapi/mapi/u_current.c/u_current_set_user(NULL): .............u_current_user =(void *) ptr; ***** FREEZE ***

Mesa8 release-context implementation attempts to access the TLS u_current_user 'field'.

For some reason, an already existing thread like AWT-EDT [0.1] cannot use libGL's global TLS [3]
when the latter was initialized by a later running thread [0.2] + [1].

Prove that this is the root cause can be made by it's remedies:

R1: Set java system property 'sun.java2d.opengl' to 'true', which loads and
uses libGL from the start.

R2: Load the libraries [1] from the AWT-EDT

+++

R2 is our workaround in JOGL and works fine w/ Ubuntu 12.04 & Mesa8

+++

Nevertheless, it is unclear whether the above behavior is intrinsic to
tls_model("global-dynamic") and hence a restriction or other special
circumstances
lead to an erroneous behavior.

Note:tls_model("initial-exec") works fine, i.e. w/o patch 113_fix_tls.diff

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mesa (Ubuntu):
status: New → Confirmed
Revision history for this message
Sven Gothel (sgothel) wrote :

Allow me to start a discussion about the motivation of the tls-global path 113_fix_tls.diff
and pls correct me if I am wrong ... ofc.

[1] tls-global implies that all the tls fields are directly exposed to the other library or executable loading the
tls-global exposed library.

[1.1] This is IMHO a very bad idea and API hooks shall expose and control it's access,
which actually is true in this case here (glX .. currentContext ..).

[1.2] AFAIK history of the patch was to fix the use-case of loading libGL by another dynamic library
via dlopen .. etc. This seems to be broken now having a multithreading scenario, like JOGL.

[2] Vanilla Mesa8 does not use tls-global, but uses tls-initial-exec, which works fine w/ JOGL and
other use-cases on ArchLinux and other distributions.

It would be great to understand the original motivation and nature of this patch
and ofc whether the multithreading issue is a tls-global constraint or a bug.
The latter maybe generated by gcc .. but this is just a guess ofc.

In case tls-global is indeed not required at all and maybe even a security or stability risk,
I would vote for removing the offending patch.

Xerxes Rånby (xranby)
tags: added: oneiric precise
Revision history for this message
Chris Halse Rogers (raof) wrote :

My understanding is that the TLS model is entirely a performance optimisation; it does not change the symbol's visibility. Now, as per the only non-source documentation of the Linux ELF TLS ABI I can find¹, the all the thread-local symbols using the initial-exec model are required to be known at initial execution, as they're allocated in a static area that's only set up at initial load by the dynamic loader. This obviously doesn't work if you try to dlopen a binary containing symbols with the initial-exec TLS model.

Of course, this documentation is out of date, and libc has long had support for loading some initial-exec TLS symbols after initial program load - precisely for libGL, which wants both minimum overhead and a small number of TLS symbols.

We had problems with this in Natty, where Mesa gained a dependency on libstdc++, which also has some TLS symbols, and was causing crashes with anything that tried to dlopen libGL and didn't already have a copy of libstdc++ in the program image. (That was bug 259219).

Now, this was allegedly an underlying libc bug that's been fixed; I was under the impression that we'd dropped the tls patch when updating to mesa 8.0 in Precise. I'm now investigating whether we still need it.

I don't believe that the TLS patch is to blame for JOGL's issues; I think it's more likely that it's exposing a race somewhere - a global-dynamic symbol requires a couple more instructions (and potentially extra memory accesses) to access, which would change the timing.

¹: http://www.akkadia.org/drepper/tls.pdf

Revision history for this message
Sven Gothel (sgothel) wrote :

Thank you Chris for responding and describing the history and motivation of the patch.

Regarding visibility, I read in the same tls paper, that global exposes the tls field[s]
to the 'calling' library and executables as well, where the other models don't.
Hence I would be able to access them directly, shortcutting the API.

Sine I am very familiar w/ JOGL's multithreading and did a thorough analysis of
the locking as well, I can almost guarantee - even though you never can be 100% certain,
that at least in this case we have no race condition.
I investigated the call trace w/ manual instrumentation (read: inserted fprintf(stderr, ..))
and the call sequence incl. their thread origin is as expected and good.
Besides, we do run lots of concurrent unit tests in our automated unit tests via jenkins
on different platforms and machines, where none leads to a deadlock or starvation.
This includes GL drivers of NV, AMD, IMG, AMD, .. and Mesa as well ofc.

In this analyzed case, we only have 3 active threads running.
T1 (Main-Thread), T2 (SharedResource-Thread) and T3 (AWT-EventDispatchThread EDT).
At the time T3 issues the 'freezing' context release, T2 and T1 are on the wait queue
doing nothing and also do not hold a current GL context.
This is described in my comment 1.
I am very confident here - the threads in this case are running sequentially,
impossible to cause a race condition.
BTW we also manage our Java side GLContext object, e.g. have Java side locking management implemented.

Even though I don't understand the root cause of the bug, due to my lack of knowledge
about of the actual TLS assembler code - loading libGL on the 'early most' thread (T3 here) allows
proper execution of GLX on it. Hence it makes a difference on which thread the native library gets initialized.
I will prepare an easy to reproduce test case later.

I appreciate your reevaluation of the tls-global patch and hope it gets 'aligned' with vanilla Mesa,
which doesn't have it included. We may then be able to remove our workaround.

Great stuff, thank you very much.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mesa - 8.0.2-0ubuntu3

---------------
mesa (8.0.2-0ubuntu3) precise; urgency=low

  * Drop 113_fix_tls.diff - this turns out to have been working around a libc
    bug (most likely Debian bug #637239) which is now fixed. It also seems
    to be causing problems itself now. (LP: #965798)
 -- Christopher James Halse Rogers <email address hidden> Tue, 27 Mar 2012 12:51:58 +1100

Changed in mesa (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Xerxes Rånby (xranby) wrote :

Ubuntu 11.10 scilab users are still hitting this bug frequently in combination with mesa 7.11
https://bugs.launchpad.net/ubuntu/+source/mesa/+bug/877491 (added as a duplicate of this bug)
would it be possible to backport this fix to oneiric?

Revision history for this message
Xerxes Rånby (xranby) wrote :

scilab
plot(1)
work fine after upgrading mesa (8.0.2-0ubuntu2) to the new mesa (8.0.2-0ubuntu3) in precise.

Revision history for this message
Sven Gothel (sgothel) wrote :

Excellent - thank you so much for accepting our proposal.
Xerxes also confirmed that the fix solves the problem in the processing-2.0a5 build
using Mesa/Ubuntu.

Xerxes Rånby (xranby)
Changed in mesa (Debian):
importance: Undecided → Unknown
status: New → Unknown
Changed in mesa (Debian):
status: Unknown → New
Revision history for this message
Xerxes Rånby (xranby) wrote :

Debian testing/unstable use mesa 7.11.2-1 with the 113_fix_tls.diff included that triggers this bug.
Debian experimental use mesa 8.0-2 with the 113_fix_tls.diff included that triggers this bug.
http://packages.qa.debian.org/m/mesa.html

Changed in mesa (Debian):
importance: Unknown → Undecided
Revision history for this message
Xerxes Rånby (xranby) wrote :

LinuxMint that are based on Ubuntu Oneiric or Debian in some mint flavors use mesa with the 113_fix_tls.diff included that triggers this bug.

The easiest testcase to trigger this bug are to install scilab and type
plot
in the scilab window on a 32bit x86 machine. Scilab will then open "Graphic window number 0" and display two sample plots.
When used with a mesa that includes the 113_fix_tls.diff then both scilab windows freeze before the two example plots gets disaplayed.

Simply recompile the mesa 7.11 in Linuxmint based on Ubuntu oneiric or Debian sid without including the 113_fix_tls.diff fixes this bug.

Revision history for this message
Robert Hooker (sarvatt) wrote :
Changed in mesa (Debian):
status: New → Invalid
Revision history for this message
Xerxes Rånby (xranby) wrote :

Ok, thank you Robert for verifying that debian are infact fine, I must have been simply confused, sorry for the noise.

The mesa 7.11 in Ubuntu Oneiric do still have this bug.

I would be nice to add a ubuntu/oneiric/+source/mesa to the Affected list of this bugreport.
It would be nice if this patch can get dropped to fix this bug in a stable release update for mesa in oneiric.

Revision history for this message
ferminfm (ferminfm) wrote :

This is still going on Ubuntu 13.04 Raring, Scilab 5.4.1 and libgl1-mesa-glx 9.1.3-0. Impossible to use Scilab.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.