weird pthread/fork race/deadlock

Bug #838975 reported by desrt on 2011-09-01
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
eglibc (Ubuntu)
High
Unassigned
Natty
High
Unassigned
Oneiric
High
Unassigned
glibc (Fedora)
Fix Released
Undecided

Bug Description

There appears to be a strange bug in glibc that causes deadlocks when calling fork() from threads. We had a testcase in GLib failing from time to time because of this.

I've attached a minimal testcase that uses only pure pthreads + libc. Compile it with -pthread and run it. It should fill your screen with dots for a while, then hang when it hits the bug (which happens randomly anywhere between 1 dot and hundreds). I've already received independent verification that this testcase hangs on several people's computers.

I believe this to be an upstream issue since this bug is visible on Fedora 15 and 16, but the glibc website says I should file bugs against distributions first. I also believe the issue to be a regression since Lucid is fine but Oneiric is not. The problem appears to affect both 32 and 64bits.

Some notes:

 - compiling the testcase with -static has the side-effect of causing the bug to go away

 - compiling the testcase with -DFORK_DIRECTLY also appears to solve the problem

 - replacing the execv() with a direct exit(0) doesn't solve the problem but causes the frequency to change

The fact that both static linking and making the fork() syscall directly cause the problem to disappear leads me to believe that this is a libc bug rather than a kernel bug (which is the only other possibility). I'm not 100% sure of that, though, since libc actually uses the clone() syscall to implement fork(), so there could be a different inside the kernel because of that.

Related branches

desrt (desrt) wrote :
desrt (desrt) wrote :

Micah Gersten just tested on Natty and discovered that the bug is there too.

desrt (desrt) wrote :

2.6.38-11.49 and 2.13-0ubuntu13 are the Natty versions that have the bug.

Micah also tested maverick in a VM and was unable to observe the issue there. That's 2.6.35-28.50 and 2.12.1-0ubuntu10.2. Gives an idea of when the regression may have come.

Created attachment 522617
test case demonstrating the issue

There appears to be a strange bug in glibc that causes deadlocks when calling fork() from threads. We had a testcase in GLib failing from time to time because of this.

I've attached a minimal testcase that uses only pure pthreads + libc. Compile it with -pthread and run it. It should fill your screen with dots for a while, then hang when it hits the bug (which happens randomly anywhere between 1 dot and hundreds). I've already received independent verification that this testcase hangs on several people's computers.

I believe this to be an upstream issue since this bug is visible on Ubuntu as well, but the glibc website says I should file bugs against distributions first. I also believe the issue to be a regression since older Fedora and RHEL releases are unaffected. The problem appears to affect both 32 and 64bits.
Description of problem:

Some notes:

 - compiling the testcase with -static has the side-effect of causing the
   bug to go away

 - compiling the testcase with -DFORK_DIRECTLY also appears to solve the
   problem

 - replacing the execv() with a direct exit(0) doesn't solve the problem
   but causes the frequency to change

The fact that both static linking and making the fork() syscall directly cause the problem to disappear leads me to believe that this is a libc bug rather than a kernel bug (which is the only other possibility). I'm not 100% sure of that, though, since libc actually uses the clone() syscall to implement fork(), so there could be a different inside the kernel because of that.

glibc-2.14.90-9 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/glibc-2.14.90-9

Thanks for the awesome turnaround. I installed the update from testing on my F16 system and it appears to fix the problem.

desrt (desrt) wrote :

I got bored of waiting and filed the bug against Fedora instead. They tracked it down and released an updated package. You might want to take a look over there: https://bugzilla.redhat.com/show_bug.cgi?id=737387

Hm, it would be nice to know what was changed in the package to fix it.

Matthias Klose (doko) wrote :

http://pkgs.fedoraproject.org/gitweb/?p=glibc.git doesn't seem to be up to date.
the package upload (?) is found here: https://admin.fedoraproject.org/updates/glibc-2.14.90-9

Package glibc-2.14.90-9:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.14.90-9'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/glibc-2.14.90-9
then log in and leave karma (feedback).

decoder (decoder-ubuntu) wrote :

The fix is located here: http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=8bd683657e8ab1e6e0e787d6c00e763d8393f5e5

Please fix this, I hit this deadlock a few times per day from OpenJDK and have to kill the locked processes by hand :/

Colin Watson (cjwatson) on 2011-09-26
Changed in eglibc (Ubuntu Oneiric):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Foundations Team (canonical-foundations)
milestone: none → ubuntu-11.10
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eglibc - 2.13-20ubuntu3

---------------
eglibc (2.13-20ubuntu3) oneiric; urgency=low

  * Fix pthread/fork race/deadlock. LP: #838975.
    - Avoid race between {,__de}allocate_stack and __reclaim_stacks during fork.

  * Merge from Debian:

  [ Aurelien Jarno ]
  * Add debian/patches/cvs-dl_close-scope-handling.diff from upstream to
    fix issues with dl_close() when resolving locally-defined symbols.
    Closes: #625250.
  * patches/i386/local-cpuid-level2.diff: fix a typo. Closes: #609389.
 -- Matthias Klose <email address hidden> Mon, 26 Sep 2011 13:50:14 +0200

Changed in eglibc (Ubuntu Oneiric):
status: Triaged → Fix Released

Package glibc-2.14.90-10:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.14.90-10'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/glibc-2.14.90-10
then log in and leave karma (feedback).

decoder (decoder-ubuntu) wrote :

Could someone please provide replacement packages for natty as well? This is a serious issue for some tasks.

glibc-2.14.90-10 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.

Changed in eglibc (Ubuntu Natty):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Foundations Team (canonical-foundations)
milestone: none → natty-updates
dino99 (9d9) wrote :
Changed in eglibc (Ubuntu Natty):
status: Triaged → Invalid
Changed in eglibc (Ubuntu):
assignee: Canonical Foundations Team (canonical-foundations) → nobody
Changed in eglibc (Ubuntu Natty):
assignee: Canonical Foundations Team (canonical-foundations) → nobody
Changed in eglibc (Ubuntu Oneiric):
assignee: Canonical Foundations Team (canonical-foundations) → nobody
Changed in glibc (Fedora):
importance: Unknown → Undecided
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.