Backport upstream bug 13844 - fix for futex issue

Bug #1091186 reported by Linaro Toolchain Builder on 2012-12-17
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
eglibc
Fix Released
Medium
eglibc (Ubuntu)
Undecided
Adam Conrad
Precise
High
Adam Conrad
Quantal
High
Adam Conrad

Bug Description

[Impact / Justification]
There was a bug in glibc where custom lowlevellock implementations weren't being used, but rather the generic one was instead. This was a non-issue on most arches, however hppa, ARM, and Sparc needed their own custom implementation to have sane futexes. The patch from upstream fixes this by swapping a "" include for a <> include, which correctly walks sysdeps paths and grabs the arch-specific implementations.

[Test Case]
I don't personally have a good test for this, but there are several people very keen on having this fix in that will test quite heavily on my behalf, I'm told. I'll make sure they do so (and bonus points if they document their testcase)...

[Regression Potential]
This patch has zero effect on i386, x86_64, and powerpc, and the architecture this does effect (armel/armhf) was quite fundamentally broken, apparently, so the only risk here is that it remains broken, which I'm told by people who've already tested, it won't.

[Original Report]
There is a subtle futex bug in glibc 2.15 which is fixed dealt with by BZ #13844 (http://sourceware.org/bugzilla/show_bug.cgi?id=13844).

This is being hit regularly when testing locks on ARM.

Can this be backported to precise and quantal (which I believe are both 2.15 based)?

The patch is here: http://sourceware.org/git/?p=glibc.git;a=commit;h=7e7fa5f8719c0a497f4b262e6fb5625c13b6c22e

Hi.

 Because of the absence of the sparc32-linux specific version of this file, `nptl/sysdeps/unix/sysv/linux/libc-lowlevellock.c' is fetched when building `libc-lowlevellock.o{,s}'. The latter contains just
. . .
#include "lowlevellock.c"

which leads to `nptl/sysdeps/unix/sysv/linux/lowlevellock.c' being actually compiled instead of `nptl/sysdeps/unix/sysv/linux/sparc/sparc32/lowlevellock.c'.

 As a consequence a futex can be accessed in a non-atomic way when using libc's internal locks, e.g. when performing output to the same stream from different threads.

 For example, when the following test is run on a V8 host, this typically leads to a deadlock when all threads are waiting for a futex to become free and there is no one to release it. This happens particularly soon when it's executed this way at a multi-CPU host:

$ ./test.sparc32 5 > /dev/null

$ cat test.c

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

void *
handler (void *arg)
{
  for (;;)
    printf ("Thread #%ld\n", (long) arg);

  return NULL;
}

int
main (int argc, char **argv)
{
  size_t i;
  size_t nthreads = (size_t) atoi (argv[1]);
  pthread_t *threads = (pthread_t *) malloc (nthreads * sizeof (pthread_t));

  for (i = 0; i < nthreads; i++)
    pthread_create (&threads[i], NULL, handler, (void *) i);

  for (i = 0; i < nthreads; i++)
    pthread_join (threads[i], NULL);

  free (threads);
  return 0;
}

(1) I've added Dave Miller (Sparc maintainer) to the CC.

(2) Please follow "What to put in a report": http://www.gnu.org/software/libc/bugs.html

(3) No architecture needs to provide libc-lowlevellock.c. The normal sysdep override mechanism should have selected the sparc32 version of lowlevellock.c *unless* your configure options failed to cause the selection of sparc32. Given that you have not provided a full report we don't know why sparc32 was not selected.

Please include the full log of running `./configure ...` since that will contain which sysdep directories were selected and in what override order.

Thanks.

(In reply to comment #1)
. . .
> (3) . . . The normal sysdep
> override mechanism should have selected the sparc32 version of lowlevellock.c
> *unless* your configure options failed to cause the selection of sparc32. . .
. . .
 And it actually did when compiling lowlevellock.o{,s}, however `#include "lowlevellock.c"' in the generic version of libc-lowlevellock.c causes the generic version of lowlevellock.c to be included (because they are in the same directory) rather than the sparc32 one.

 If you still need a full configure log, I'll provide it a bit later.

(In reply to comment #2)
> (In reply to comment #1)
> . . .
> > (3) . . . The normal sysdep
> > override mechanism should have selected the sparc32 version of lowlevellock.c
> > *unless* your configure options failed to cause the selection of sparc32. . .
> . . .
> And it actually did when compiling lowlevellock.o{,s}, however `#include
> "lowlevellock.c"' in the generic version of libc-lowlevellock.c causes the
> generic version of lowlevellock.c to be included (because they are in the same
> directory) rather than the sparc32 one.
>
> If you still need a full configure log, I'll provide it a bit later.

No, that makes more sense to me.

I see HPPA has it's own version of libc-lowlevellock.c probably because of this issue.

I also think ARM might suffer the same problem, it has a lowlevellock.c but no overriding libc-lowlevellock.c like HPPA.

OK, confirmed, on ARM we are also using the *default* lowlevellock.c e.g.
~~~
nptl/libc-lowlevellock.os: file format elf32-littlearm

Disassembly of section .text:

00000000 <__lll_lock_wait_private>:
__lll_lock_wait_private():
/opt/codesourcery/arm-verifone-linux-gnueabi/src/glibc/nptl/../nptl/sysdeps/unix/sysv/linux/lowlevellock.c:29
   0: e92d40f0 push {r4, r5, r6, r7, lr}
~~~

OK, some big-picture questions.

Dave, Why does sparc have a custom version of lowlevellock.c? Why do any architectures? The custom versions all seem vaguely similar to the generic linux version.

If architectures *do* need to override the generic lowlevellock.c without overriding libc-lowlevellock.c then I think the right solution would be to make libc-lowlevellock.c use `#include <lowlevellock.c>` such that the sysdep include mechanism includes the right file.

Il'ya, Would you mind testing that change?

Sparc needs a custom lowlevellock.c on 32-bit sparc because
we have a situation where we don't know if we have the
proper atomic instructions available or not.

For example, if we don't have the v9 compare-and-swap available
we use spinlocks instead. And for futexes we use a special
24-bit atomic, which embeds the spinlock in the remaining
8-bits. That's what all of those atomic_compare_and_exchange_val_24_acq
etc. routines are all about.

I'll just add the necessary sparc32/libc-lowlevellock.c file
to fix this, thanks for the report.

Joseph,

It looks like ARM also needs a libc-lowlevellock.c? Could you please review?

I think we should use #include <> in the generic libc-lowlevellock.c
rather than accumulating copies.

That's what I ended up doing in the end.

Matthias Klose (doko) on 2012-12-17
Changed in eglibc (Ubuntu Precise):
importance: Undecided → High
milestone: none → precise-updates
status: New → Confirmed
Changed in eglibc (Ubuntu Quantal):
importance: Undecided → High
milestone: none → quantal-updates
status: New → Confirmed
Changed in eglibc:
importance: Unknown → Medium
status: Unknown → Fix Released
Adam Conrad (adconrad) on 2013-01-28
Changed in eglibc (Ubuntu):
status: New → Fix Released
Adam Conrad (adconrad) on 2013-01-28
description: updated
Adam Conrad (adconrad) on 2013-01-28
Changed in eglibc (Ubuntu Precise):
assignee: nobody → Adam Conrad (adconrad)
Changed in eglibc (Ubuntu Quantal):
assignee: nobody → Adam Conrad (adconrad)
Changed in eglibc (Ubuntu):
assignee: nobody → Adam Conrad (adconrad)

Hello Linaro, or anyone else affected,

Accepted eglibc into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/eglibc/2.15-0ubuntu10.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in eglibc (Ubuntu Precise):
status: Confirmed → Fix Committed
tags: added: verification-needed
Changed in eglibc (Ubuntu Quantal):
status: Confirmed → Fix Committed
Colin Watson (cjwatson) wrote :

Hello Linaro, or anyone else affected,

Accepted eglibc into quantal-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/eglibc/2.15-0ubuntu20.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Mark Rutland (mark-rutland) wrote :

Originally, when we ran a 32bit mallocstress under a 64bit kernel on an ARM fast model, mallocstress would deadlock (even with relatively small numbers of threads and loops). Will Deacon looked into this, and saw that all threads were waiting for the same futex. We believe the non-atomic eglibc locking implementation issue was to blame for the threads getting into this state.

I've installed the following packages dervived from the proposed eglibc (as described in http://launchpad.net/ubuntu/+source/eglibc/2.15-0ubuntu20.1) onto a quantal core image:
* libc-dev-bin
* libc6-dev
* libc-bin
* libc6
* multiarch-support

With these installed, I no longer experience deadlock in mallocstress.

I note the updated package contains another related bug fix: https://bugs.launchpad.net/ubuntu/quantal/+source/eglibc/+bug/1081734, which is supposed to fix this issue by changing the internal structure of malloc/calloc. Due to this, I'm unable to say which change actually fixes the issue.

I took the liberty to run a larger selection of LTP tests, and have not experienced any regressions with the updated packages, so having both doesn't seem to be harmful, at least.

Adam Conrad (adconrad) wrote :

Marking verification-done based on the above comments, and that no one's been able to reproduce the LTP issues with the new version.

tags: added: verification-done
removed: verification-needed

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eglibc - 2.15-0ubuntu10.4

---------------
eglibc (2.15-0ubuntu10.4) precise; urgency=low

  * Add patch ubuntu/local-disable-nscd-netgroup-caching.diff to
    disable netgroup caching in the default config (LP: #1068889)
  * Backport any/cvs-malloc-deadlock.diff from upstream to prevent
    glibc deadlocking in mallock arena retry paths (LP: #1081734)
  * Fix futex issue (BZ #13844), backport from 2.16 (LP: #1091186)
  * Drop patch any/local-disable-nscd-host-caching.diff, as this
    bug was apparently resolved upstream a while ago (LP: #613662)
  * Add patch any/cvs-ld-self-load.diff to restore ld.so's ability
    to load itself, a behaviour accidentally removed (LP: #1088677)
  * Drop dangling libnss_db.so symlink in libc6-dev (LP: #1088773)
 -- Adam Conrad <email address hidden> Sun, 27 Jan 2013 16:46:30 -0700

Changed in eglibc (Ubuntu Precise):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eglibc - 2.15-0ubuntu20.1

---------------
eglibc (2.15-0ubuntu20.1) quantal; urgency=low

  * Add patch ubuntu/local-disable-nscd-netgroup-caching.diff to
    disable netgroup caching in the default config (LP: #1068889)
  * Backport any/cvs-malloc-deadlock.diff from upstream to prevent
    glibc deadlocking in mallock arena retry paths (LP: #1081734)
  * Fix futex issue (BZ #13844), backport from 2.16 (LP: #1091186)
  * Drop patch any/local-disable-nscd-host-caching.diff, as this
    bug was apparently resolved upstream a while ago (LP: #613662)
  * Add patch any/cvs-ld-self-load.diff to restore ld.so's ability
    to load itself, a behaviour accidentally removed (LP: #1088677)
  * Drop dangling libnss_db.so symlink in libc6-dev (LP: #1088773)
 -- Adam Conrad <email address hidden> Sun, 27 Jan 2013 16:46:30 -0700

Changed in eglibc (Ubuntu Quantal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.