deadlock in pthread_cond_signal under high contention
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
glibc (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Hello!
I'm working on a large C++-based cross-platform project. I noticed that on arm64-based systems some of my processes sporadically became paralyzed by the deadlock hitting all the threads posting to the single boost::
There you may find there the test source code, the detailed description, the deadlock call stacks for all threads and theirs compact view as a call graph.
In a short, in the test I have threads of two types:
(1) producers - Np threads calling pthread_cond_signal after unlocking a mutex at a rate of Rp calls per second;
(2) consumers - Nc threads calling pthread_cond_wait at a rate of Rc calls per second.
Np, Rp and Rc can be specified with command line parameters, Nc is equal to the number of CPU cores of the particular system running the test. Once started on arm64-based multi-core device the test eventually gets all its threads blocked if the Np, Rp and Rc are enough to keep contention high around pthread_cond_singal calls.
The deadlock can be workarounded by
* reducing probability of concurring pthread_cond_singal calls by tuning Np, Rp and Rc;
* moving pthread_cond_singal call under the lock
Moreover, the deadlock can be broken by ptrace: attaching with debugger, generating dump with Google Breakpad and etc. makes the process revive. One time I was able to wake up the process from the deadlock with SIGSTOP/SIGCONT, however, the healing effect was very limited and the process returned into the deadlock state in a few seconds.
I would like to note that a problem with symptoms that look similar was reported and fixed in kernel several years ago (see https:/
However, I believe this time the problem is on the NPTL implementation side because:
* 100% of the observed deadlocks both in our product and the tests appear to have the same structure: single producer blocked in __condvar_
* mutex misbehavior was never observed either in test or in my project;
* wakeups by ptrace/signal simply mean waiting on a futex got interrupted and on the next iteration (if any) at least one of these call paths made progress after observing changed global state, which can be a side effect of the race in the userland as well as in the kernel;
* while the mutex object is more contended than pthread_cond_signal related internal data of the condvar if I put the pthread_cond_signal call under the lock, I cannot reproduce the problem.
I looked at the nptl source code (https:/
1) All producers (signalling threads) except one are blocked in __condvar_
2) According to the comments lavishly sown around the code, that "lucky" signalling thread waits for the some of consumers (waiting threads) to leave G1 group to be able to close the group and make the group switch in __condvar_
3) And all consumers (waiting threads) wait, of course, they wait for the producers to send a signal, see __pthread_
4) And if you watch the code around __pthread_
This fact can explain how ptrace/signal allows to break the deadlock.
--
I posted the bug report here because the glibc's wiki strongly recommends to start from the distribution bug tracker. All arm64-based devices I tested were running Ubuntu 18.04.
ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: libc6 2.27-3ubuntu1
Uname: Linux 4.9.187-52 aarch64
ApportVersion: 2.20.9-0ubuntu7.15
Architecture: arm64
Date: Fri Jul 24 14:05:57 2020
Dependencies:
gcc-8-base 8.3.0-6ubuntu1~
libc6 2.27-3ubuntu1
libgcc1 1:8.3.0-
ProcEnviron:
TERM=rxvt-
PATH=(custom, no user)
XDG_RUNTIME_
LANG=C.UTF-8
SHELL=/bin/bash
SourcePackage: glibc
UpgradeStatus: No upgrade log present (probably fresh install)
Could you please test the packages from https:/ /launchpad. net/~ci- train-ppa- service/ +archive/ ubuntu/ 4121/+packages ?
This has a glibc upstream snapshot including fixing https:/ /bugs.launchpad .net/ubuntu/ +source/ glibc/+ bug/1858203 .