pthread_mutex_lock robust hangs

Bug #1706780 reported by Jeff Barber
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
glibc (Ubuntu)
New
Undecided
Unassigned

Bug Description

I'm using an interprocess (process-shared, robust) pthread_mutex located in shared memory to synchronize access to a data structure. It has caused a hang on several occasions when the process whose thread holds the lock crashed. 99.9% of the time I do not experience the issue. If one of my processes goes down, the other receives EOWNERDEAD from the pthread_mutex_lock call as expected, and uses pthread_mutex_consistent to recover the lock. Once in a while when a process crashes, the pthread_mutex_lock simply never completes.

After much experimentation, I've managed to create a test case that reproduces the problem more than 90% of the time. Unfortunately, running it under strace apparently changes something about it, so I cannot tell exactly what is going wrong at the syscall level (not sure I would be able to decode that anyway).

My best guess about the conditions necessary is:
  Process 1, thread 1 acquires the lock
  Process 1, thread 2 attempts to acquire the lock (hence waiting in __lll_robust_lock_wait)
  Process 2 attempts to acquire the lock
  Process 1 crashes.
  Process 2 is left waiting in __lll_robust_lock_wait forever

I believe the sequence of locking threads is important to reproducing it.

Once in this state, any other caller attempting to lock the mutex also hangs. The mutex data structure (__owner) still shows process 1, thread 1 as the owning thread.

I don't have the glibc or futex background to go further with debugging.

$ lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04

$ apt-cache policy libc6
libc6:
  Installed: 2.23-0ubuntu9
  Candidate: 2.23-0ubuntu9
  Version table:
 *** 2.23-0ubuntu9 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages
        100 /var/lib/dpkg/status
     2.23-0ubuntu3 500
        500 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

$ uname -a
Linux tirion 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Jeff Barber (jsbarber60) wrote :

Attaching test program to demonstrate the problem. Compile with

gcc -pthread -g -o tr tr.c

Run with ./tr. The program sets an alarm for 5 seconds. If the problem occurs, the alarm will expire:

$ ./tr
parent 75964 sleeping
c0: cthread 75965 locking
c0: cthread got lock
c1: cthread 75966 locking
parent locking
c0: cthread exiting
Alarm clock

Otherwise the last line will be:
parent unlocking

My system is a VM running within Mac OSX which could certainly affect timing.

Revision history for this message
Austin Hendrix (namniart) wrote :

I'm using robust mutexes in a similar way, and I've found that if I use PTHREAD_PRIO_INHERIT attribute on my mutexes, I can no longer reproduce this bug.

It looks like this is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1401665 .

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.