pthread_mutex_lock robust hangs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
glibc (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
I'm using an interprocess (process-shared, robust) pthread_mutex located in shared memory to synchronize access to a data structure. It has caused a hang on several occasions when the process whose thread holds the lock crashed. 99.9% of the time I do not experience the issue. If one of my processes goes down, the other receives EOWNERDEAD from the pthread_mutex_lock call as expected, and uses pthread_
After much experimentation, I've managed to create a test case that reproduces the problem more than 90% of the time. Unfortunately, running it under strace apparently changes something about it, so I cannot tell exactly what is going wrong at the syscall level (not sure I would be able to decode that anyway).
My best guess about the conditions necessary is:
Process 1, thread 1 acquires the lock
Process 1, thread 2 attempts to acquire the lock (hence waiting in __lll_robust_
Process 2 attempts to acquire the lock
Process 1 crashes.
Process 2 is left waiting in __lll_robust_
I believe the sequence of locking threads is important to reproducing it.
Once in this state, any other caller attempting to lock the mutex also hangs. The mutex data structure (__owner) still shows process 1, thread 1 as the owning thread.
I don't have the glibc or futex background to go further with debugging.
$ lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04
$ apt-cache policy libc6
libc6:
Installed: 2.23-0ubuntu9
Candidate: 2.23-0ubuntu9
Version table:
*** 2.23-0ubuntu9 500
500 http://
500 http://
100 /var/lib/
2.23-0ubuntu3 500
500 http://
$ uname -a
Linux tirion 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Attaching test program to demonstrate the problem. Compile with
gcc -pthread -g -o tr tr.c
Run with ./tr. The program sets an alarm for 5 seconds. If the problem occurs, the alarm will expire:
$ ./tr
parent 75964 sleeping
c0: cthread 75965 locking
c0: cthread got lock
c1: cthread 75966 locking
parent locking
c0: cthread exiting
Alarm clock
Otherwise the last line will be:
parent unlocking
My system is a VM running within Mac OSX which could certainly affect timing.