global static TLS slot limit breaks the x86 emulator

Bug #1375555 reported by Ricardo Salveti on 2014-09-30
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
eglibc (Ubuntu)
Undecided
Unassigned
glibc (Fedora)
Fix Released
Undecided
glibc (Ubuntu)
Critical
Colin Watson
glibc (Ubuntu RTM)
Critical
Ricardo Salveti
qtmir (Ubuntu)
Undecided
Josh Arenson

Bug Description

Just create an emulator image for any image newer than 254 and you'll notice that the wizard will never show up, always showing a black screen.

# Creating the emulator image
sudo ubuntu-emulator create --channel=ubuntu-touch/utopic-proposed --arch=i386 test_x86 --revision=XXX (254 is the last known working one)

From wizard:
====
phablet@ubuntu-phablet:~$ cat ./.cache/upstart/ubuntu-system-settings-wizard.log
unity8 stop/starting
()
qtmir.mir: MirServerConfiguration created
qtmir.mir: PromptSessionListener::PromptSessionListener - this= PromptSessionListener(0x9ff9384)
qtmir.mir: SessionListener::SessionListener - this= SessionListener(0x9ffa1a4)
qtmir.mir: MirPlacementStrategy::MirPlacementStrategy
qtmir.sensor: Screen - nativeOrientation is: 1
qtmir.sensor: Screen - initial currentOrientation is: 1
QtCompositor::setAllWindowsExposed true
file:///usr/share/ubuntu/settings/wizard/qml/main.qml:21:1: plugin cannot be loaded for module "Unity.Application": Cannot load library /usr/lib/i386-linux-gnu/qt5/qml/Unity/Application/libunityapplicationplugin.so: (dlopen: cannot load any more object with static TLS)
     import Unity.Application 0.1
     ^
DisplayWindow::DisplayWindow
Window 0xa0d8ec8: 0xa04a498 0x1

createPlatformOpenGLContext QOpenGLContext(0xa0dad28)
unity8 stop/starting
()
====

If you create an image based on 254 and update only qtmir it's already possible to reproduce the issue.

Description of problem:
Crashes on startup in current F21.

Version-Release number of selected component:
calibre-1.46.0-1.fc21

Additional info:
reporter: libreport-2.2.3
cmdline: python2 /usr/bin/calibre --detach
executable: /usr/bin/calibre
kernel: 3.16.0-0.rc6.git2.2.fc22.x86_64
runlevel: N 5
type: Python
uid: 1001

Truncated backtrace:
#1 <module> in /usr/lib64/calibre/calibre/utils/magick/__init__.py:14
#2 <module> in /usr/lib64/calibre/calibre/db/backend.py:31
#3 <module> in /usr/lib64/calibre/calibre/db/legacy.py:17
#4 <module> in /usr/lib64/calibre/calibre/gui2/ui.py:26
#5 run_gui in /usr/lib64/calibre/calibre/gui2/main.py:320
#6 main in /usr/lib64/calibre/calibre/gui2/main.py:458
#7 <module> in /usr/bin/calibre:20

Created attachment 922690
File: backtrace

Created attachment 922691
File: environ

I suspect this is a glibc change. We're seeing the same error message in an Anaconda environment from trying to dlopen(libgtk.so).

So, let's CC the glibc maintainer (at least, the person to whom glibc bugs appear currently to be assigned).

Created attachment 925164
LD_DEBUG log from crash (requested by carlos)

Please get the list of all shared libraries loaded by python then run `readelf -a -W` against all of them and dump that to a file. I want to know which if the libraries is using static TLS.

The error you are seeing is not a bug in glibc. The dynamic loader has a fixed amount of "super fast" static TLS for use by core libraries which are always loaded and can use this static TLS. Normal libraries should not be using static TLS and you should not be running out of static TLS, those loaded libraries should be using dynamic TLS which has no limits (not quite as fast as static TLS but still fast).

"Please get the list of all shared libraries loaded by python then run `readelf -a -W` against all of them and dump that to a file."

Is there some easy way to get this list without parsing the strace output?

I thought it might be in the abrt data, but I don't see it in there at least in an obvious way. I can do the parsing, I just didn't want to embarrass myself doing it in my monkey way in front of carlos :P when I have some privacy to pore over the 'cut' manpages without anyone seeing, I'll get it done...

(In reply to Adam Williamson (Red Hat) from comment #9)
> I thought it might be in the abrt data, but I don't see it in there at least
> in an obvious way. I can do the parsing, I just didn't want to embarrass
> myself doing it in my monkey way in front of carlos :P when I have some
> privacy to pore over the 'cut' manpages without anyone seeing, I'll get it
> done...

The abrt data should have a loaded library list. For example /proc/self/maps should have been dumped and provided? That will contain the full list of DSOs.

Another user experienced a similar problem:

During the first start I left the default values and clicked on next button then application froze. Every other start of application ends with traceback.

reporter: libreport-2.2.3
cmdline: python2 /usr/bin/calibre --detach
executable: /usr/bin/calibre
kernel: 3.16.0-1.fc21.x86_64
package: calibre-1.46.0-3.fc21
reason: __init__.py:14:<module>:RuntimeError: Failed to load ImageMagick: dlopen: cannot load any more object with static TLS
runlevel: N 5
type: Python
uid: 1000

Created attachment 928076
readelf -a -W of libraries that calibre opens

strace -o /tmp/strace -f calibre

fgrep '.so"' /tmp/strace | grep ' open(' | grep -v ENOENT | cut -d'"' -f2 | sort -u > /tmp/strace.so.list.txt

for i in $(cat /tmp/strace.so.list.txt) ; do echo "readelf -a -W $i" >> /tmp/readelf.txt; readelf -a -W $i >> /tmp/readelf.txt; done

Created attachment 928078
readelf -a -W of libraries that calibre opens

Again with missing libraries added (sorry for the spam):

strace -o /tmp/strace -f calibre

fgrep '.so' /tmp/strace | grep ' open(' | grep -v ENOENT | cut -d'"' -f2 | sort -u > /tmp/strace.so.list.txt

for i in $(cat /tmp/strace.so.list.txt) ; do echo "readelf -a -W $i" >> /tmp/readelf.txt; readelf -a -W $i >> /tmp/readelf.txt; done

cat readelf.txt | egrep 'readelf| TLS ' | grep -v ' 0 TLS ' | grep TLS -B1

384 /lib64/libpixman-1.so.0
272 /lib64/libc.so.6
224 /lib64/libselinux.so.1
108 /lib64/libgomp.so.1
78 /lib64/libuuid.so.1
56 /lib64/libsystemd.so.0
48 /lib64/libstdc++.so.6
32 /lib64/libglapi.so.0
25 /lib64/libcom_err.so.2
8 /lib64/libasound.so.2
8 /lib64/libdw.so.1
8 /lib64/libEGL.so.1
8 /lib64/libGL.so.1
8 /lib64/libQtCore.so.4
4 /lib64/libelf.so.1

= 1271 bytes (is there some additional rounding done by linker?)

After some googling I found out that we need -d flag to readelf to find which modules have static tls. Results are:

cat readelf.txt | egrep 'readelf|STATIC_TLS' | grep TLS -B1 | grep readelf | cut -d/ -f2-
272 /lib64/libc.so.6
108 /lib64/libgomp.so.1
32 /lib64/libglapi.so.0
8 /lib64/libEGL.so.1
8 /lib64/libGL.so.1
0 /lib64/libcrypt.so.1
0 /lib64/libm.so.6
0 /lib64/libnsl.so.1
0 /lib64/libnss_files.so.2
0 /lib64/libpthread.so.0
0 /lib64/libresolv.so.2
0 /lib64/librt.so.1
0 /lib64/libutil.so.1

The 0 byte STATIC_TLS usage .so files have references to glibc variables such as so I did not count them.
0 TLS GLOBAL DEFAULT UND errno@GLIBC_PRIVATE (4)

If I googled right the .so libraries tagged STATIC_TLS fail to load if their TLS section does not fit into the static section. Others prefer static TLS (and thus use it even if not tagged static).

The actual .so load order (for libraries with TLS section) is:
/lib64/libc.so.6
/lib64/libcom_err.so.2
/lib64/libselinux.so.1
/lib64/libstdc++.so.6
/lib64/libQtCore.so.4
/lib64/libuuid.so.1
/lib64/libasound.so.2
/lib64/libGL.so.1
/lib64/libglapi.so.0
/lib64/libsystemd.so.0
/lib64/libdw.so.1
/lib64/libelf.so.1
/lib64/libpixman-1.so.0
/lib64/libEGL.so.1
/lib64/libgomp.so.1

Looking at the above lists my conclusion is that libgomp.so is the library that fails to load (the python tries to load libMagickCore-6.Q16.so.2, which depends on libgomp). But the libselinux and libpixman have already managed to store their large TLS sections into the static block.

Possible solutions:
1) make the calibre python somehow not load libselinux (could such switch be added to python?)
2) make the calibre load the libgomp/libMagicCore earlier, before libpixman (most likely libselinux is loaded so early that it cannot be avoided)
3) add a new flag RTLD_NO_AUTOMATIC_STATIC_TLS for dlopen function to _not_ use static TLS for libraries unless they request it with STATIC_TLS. And then make python load libraries with the new flag
4) recompile the fedora to have larger static TLS section (the glibc seems to define TLS_STATIC_SURPLUS as 64+DL_NNS*100, where DL_NNS is either 1 or 16)

Download full text (6.7 KiB)

(In reply to Mikko Tiihonen from comment #14)
> cat readelf.txt | egrep 'readelf|STATIC_TLS' | grep TLS -B1 | grep readelf |
> cut -d/ -f2-
> 272 /lib64/libc.so.6
> 108 /lib64/libgomp.so.1
> 32 /lib64/libglapi.so.0
> 8 /lib64/libEGL.so.1
> 8 /lib64/libGL.so.1
> 0 /lib64/libcrypt.so.1
> 0 /lib64/libm.so.6
> 0 /lib64/libnsl.so.1
> 0 /lib64/libnss_files.so.2
> 0 /lib64/libpthread.so.0
> 0 /lib64/libresolv.so.2
> 0 /lib64/librt.so.1
> 0 /lib64/libutil.so.1

That many libraries should not overflow the reserved static TLS slots in the DTV. That is only 13 slots. We have 14 surplus slots.

The original error is:

"RuntimeError: Failed to load ImageMagick: dlopen: cannot load any more object with static TLS"

Which indicates overflow of the slots not size of the allocated static TLS block itself.

For the record DL_NNS should be 16, so we should hvae 102,400 bytes of static surplus storage.

> The 0 byte STATIC_TLS usage .so files have references to glibc variables
> such as so I did not count them.
> 0 TLS GLOBAL DEFAULT UND errno@GLIBC_PRIVATE (4)

These are references to static TLS varaibles in other modules e.g. libc.so.6, and because of those references the entire access type for the module is adjusted to be static TLS. Their size doesn't matter for now.

> If I googled right the .so libraries tagged STATIC_TLS fail to load if their
> TLS section does not fit into the static section. Others prefer static TLS
> (and thus use it even if not tagged static).

Libraries don't prefer static TLS, they must be compiled for it.

No shared libraries should be built with static TLS. I'm going to make it my quest to ban anything but the implementation from using static TLS in libraries because it leads to unmaintainable chaos at the distribution level :-(

Maybe a few key libraries might be allowed...

> The actual .so load order (for libraries with TLS section) is:
> /lib64/libc.so.6
> /lib64/libcom_err.so.2
> /lib64/libselinux.so.1
> /lib64/libstdc++.so.6
> /lib64/libQtCore.so.4
> /lib64/libuuid.so.1
> /lib64/libasound.so.2
> /lib64/libGL.so.1
> /lib64/libglapi.so.0
> /lib64/libsystemd.so.0
> /lib64/libdw.so.1
> /lib64/libelf.so.1
> /lib64/libpixman-1.so.0
> /lib64/libEGL.so.1
> /lib64/libgomp.so.1
>
> Looking at the above lists my conclusion is that libgomp.so is the library
> that fails to load (the python tries to load libMagickCore-6.Q16.so.2, which
> depends on libgomp). But the libselinux and libpixman have already managed
> to store their large TLS sections into the static block.

Neither libselinux nor libpixman have static TLS AFAIK. How did you determine they did?

> Possible solutions:
> 1) make the calibre python somehow not load libselinux (could such switch be
> added to python?)

That's not a solution since libselinux doesn't use static TLS.

> 2) make the calibre load the libgomp/libMagicCore earlier, before libpixman
> (most likely libselinux is loaded so early that it cannot be avoided)

Not an option.

> 3) add a new flag RTLD_NO_AUTOMATIC_STATIC_TLS for dlopen function to _not_
> use static TLS for libraries unless they request it with STATIC_TLS. And
> then make python load libraries with the new flag

You are misunderstanding ...

Read more...

Your glibc-2.19.90-34 build allows calibre to start.

And thank you for the long explanation on how things really are. My googling seems to have lead me really far a way from the actual problem :)

(In reply to Mikko Tiihonen from comment #16)
> Your glibc-2.19.90-34 build allows calibre to start.
>
> And thank you for the long explanation on how things really are. My googling
> seems to have lead me really far a way from the actual problem :)

You are very welcome. I'm happy to explain.

However, please don't walk away now.

What I need from you is help. How many STATIC_TLS libraries did Calibre need?

I bumped the surplus slot allowance up to 64, so obviously less than that.

Can you find them for me so I can figure out who I should talk to?

I'll be away from the machine with the problem for a week now. But here is what I had time to test with the new glibc:

Assuming my grepping skills are still ok the list of .so files loaded with static_tls is in order:

lib64/libc.so.6
lib64/libpthread.so.0
lib64/libutil.so.1
lib64/libm.so.6
lib64/libc.so.6
lib64/libresolv.so.2
lib64/librt.so.1
lib64/libGL.so.1
lib64/libglapi.so.0
lib64/libnsl.so.1
lib64/libEGL.so.1
lib64/libcrypt.so.1
lib64/libc.so.6
lib64/libc.so.6
lib64/libgomp.so.1
lib64/libnss_files.so.2

Thus 16 times, but only 13 unique. The libc.so.6 is loaded by different pids (threads?) like this:
23705 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
23705 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
23706 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
23709 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3

Does it still use a new slow every time?

I can confirm that the scratch glibc allows calibre to start. I can downgrade and provide further debug info if required.

So, calibre 2.0.0 came out today. It switches over to qt5 and doesn't have this issue anymore, due to loading different libs. ;(

So, we could just close this now... but it might also still be worth finding out what libraries are to blame when we hit them with other applications?

Hi,

I seem to have run into the glibc bug too. I've filed a new bug here:

https://bugzilla.redhat.com/show_bug.cgi?id=1124987

Thanks
Ankur

(In reply to Ankur Sinha (FranciscoD) from comment #21)
> Hi,
>
> I seem to have run into the glibc bug too. I've filed a new bug here:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1124987

For the record it is not a glibc bug. It is a defect in the loaded libraries, they should not require TLS. Fundamentally it's a defect that we allowed developers to build shared libraries with static TLS in the first place.

I will keep debugging this issue here and we'll get to a resolution.

*** Bug 1133843 has been marked as a duplicate of this bug. ***

Hi Carlos,

Would you have any suggestions on how the vlc error could be fixed? I can get in touch with the package maintainer at rpmfusion and ask them to correct their build flags, for example.

Thanks,
Ankur

(In reply to Ankur Sinha (FranciscoD) from comment #24)
> Hi Carlos,
>
> Would you have any suggestions on how the vlc error could be fixed? I can
> get in touch with the package maintainer at rpmfusion and ask them to
> correct their build flags, for example.

There is nothing vlc can do. I'm going to fix this in glibc to work around the issue.

After an analysis I see ~44 distribution-wide libraries using static TLS, and you can only realy have ~30 of them loaded at any given point (some are x11 drivers so you can only load one). Thus increasing the slots in glibc to 32 will fix this.

http://koji.fedoraproject.org/koji/taskinfo?taskID=7538652

This should now be fixed in glibc-2.19.90-36.fc21 which is currently building. In this release I've bumped the surplus DTV slots to 32. An analysis of the distribution shows that's a reasonable minimum given glibc, MESA, and X11 usage.

(In reply to Carlos O'Donell from comment #26)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=7538652
>
> This should now be fixed in glibc-2.19.90-36.fc21 which is currently
> building. In this release I've bumped the surplus DTV slots to 32. An
> analysis of the distribution shows that's a reasonable minimum given glibc,
> MESA, and X11 usage.

Fixes my VLC issue. Thanks, Carlos.

glibc-2.19.90-36.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/glibc-2.19.90-36.fc21

Package glibc-2.19.90-36.fc21:
* should fix your issue,
* was pushed to the Fedora 21 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.19.90-36.fc21'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-10275/glibc-2.19.90-36.fc21
then log in and leave karma (feedback).

glibc-2.19.90-36.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report.

tags: added: rtm14
Colin Watson (cjwatson) wrote :

This is probably a lack of -fPIC somewhere.

Colin Watson (cjwatson) wrote :

Static TLS is actually a global resource, so this could be triggered by something otherwise entirely innocuous. I'm experimenting with Carlos's glibc patch from https://bugzilla.redhat.com/show_bug.cgi?id=1124987 to raise the global limit.

Colin Watson (cjwatson) on 2014-09-30
affects: qtmir (Ubuntu) → glibc (Ubuntu)
summary: - [regression] qtmir landing for image mako 256 (emulator 255) broke the
- x86 emulator
+ global static TLS slot limit breaks the x86 emulator
Changed in glibc (Ubuntu):
importance: Undecided → Critical
assignee: nobody → Colin Watson (cjwatson)
status: New → In Progress
Colin Watson (cjwatson) on 2014-09-30
Changed in glibc (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glibc - 2.19-10ubuntu2

---------------
glibc (2.19-10ubuntu2) utopic; urgency=medium

  * Add patches/ubuntu/unsubmitted-increase-dtv-surplus.diff from Fedora to
    allow up to 32 dlopened modules to use static TLS (LP: #1375555).
 -- Colin Watson <email address hidden> Tue, 30 Sep 2014 14:33:02 +0100

Changed in glibc (Ubuntu):
status: Fix Committed → Fix Released
Changed in glibc (Ubuntu RTM):
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Ricardo Salveti (rsalveti)
Alexander Sack (asac) wrote :

I add a qtmir bug because it would be great to understand if qtmir is what is causing this to happen and if we intentionally hit this bound (e.g. because we use more threads etc.) or if there is maybe a bug like a thread leak looming here that might need follow up.

kgunn: could you check if there is a good explain on your side for this?

Changed in qtmir (Ubuntu):
importance: Undecided → Wishlist
importance: Wishlist → Undecided
kevin gunn (kgunn72) on 2014-10-23
Changed in qtmir (Ubuntu):
assignee: nobody → Josh Arenson (josharenson)
Gerry Boland (gerboland) wrote :

QtMir is passing -fPIC as a CXXFLAG to the qml module & QPA plugin.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glibc - 2.19-10ubuntu2

---------------
glibc (2.19-10ubuntu2) utopic; urgency=medium

  * Add patches/ubuntu/unsubmitted-increase-dtv-surplus.diff from Fedora to
    allow up to 32 dlopened modules to use static TLS (LP: #1375555).
 -- Colin Watson <email address hidden> Tue, 30 Sep 2014 14:33:02 +0100

Changed in glibc (Ubuntu RTM):
status: Confirmed → Fix Released

(In reply to Carlos O'Donell from comment #25)

> After an analysis I see ~44 distribution-wide libraries using static TLS,

Could you tell us how to see/check which libraries are affected?

(In reply to Arkadiusz Miskiewicz from comment #31)
> (In reply to Carlos O'Donell from comment #25)
>
> > After an analysis I see ~44 distribution-wide libraries using static TLS,
>
> Could you tell us how to see/check which libraries are affected?

All architectures should set the STATIC_TLS flag in the dynamic section if any variable used static TLS. Some architectures, AArch64 I believe, fail to set STATIC_TLS (a bug), but in that case you can see the static tls usage by the relocation.

e.g.

[carlos@koi static-tls]$ readelf -a -W /lib64/libc.so.6 | grep STATIC
 0x000000000000001e (FLAGS) BIND_NOW STATIC_TLS

OK, because libc.so.6 uses static TLS to accelerate the runtime.

[carlos@koi lib64]$ readelf -a -W libGL.so.1.2.0 | grep STATIC
 0x000000000000001e (FLAGS) SYMBOLIC STATIC_TLS

Technically not OK because libGL is not part of the core runtime, but as a distribution we are coordinating the use of static TLs across multiple libraries to accelerate the distribution.

[carlos@koi lib64]$ readelf -a -W libGL.so.1.2.0 | grep TPOFF
0000003f66e91fb8 0000000000000012 R_X86_64_TPOFF64 0
0000003f66e91ff0 000000ab00000012 R_X86_64_TPOFF64 0000000000000000 _glapi_tls_Dispatch + 0

If on x86_64 the STATIC_TLS would not have been set, we could have looked for thread-pointer offset relocations to catch the use of static TLS. Here you can see that the libGL variable `_glapi_tls_Dispatch` is using static TLS and has a TPOFF relocation to have the dynamic loader setup the variable properly during assembly of the in-memory application execution image.

Any chance we see this small patch backported to F20? On our robots we use an architecture that loads plugins from shared libraries (dozens on a complex system) which build on a wide variety of existing system libraries. Therefore we frequently face this problem when loading (many) plugins which load a large number of shared libraries. Another problem for example is graphviz, which itself dlopens plugins for rendering.

We have rolled our own glibc (patch attached) and it works just fine. We would appreciate if that patch could be rolled out in F20. Since it does not change API or ABI this should be safe.

A scratch build for testing (which we use on our production machines) is available at
http://koji.fedoraproject.org/koji/taskinfo?taskID=8722107

Created attachment 985507
Patch for F-20 package

Martin Ritter (martin-ritter) wrote :

This bug also affect eglibc on Ubuntu 14.04 maybe not for the x86 emulator but we do have a software project which suffers from this. Is there a possibility that the fix might be backported?

Changed in glibc (Fedora):
importance: Unknown → Undecided
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.