[eglibc] process shared mutex's fail on armel v7 (thumb)

Bug #604753 reported by David Sugar on 2010-07-12
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linaro Toolchain Miscellanies
Fix Released
High
Unassigned
apr (Ubuntu)
Wishlist
Unassigned
eglibc (Ubuntu)
Undecided
Unassigned

Bug Description

As per #599874, "process shared" mutexes (that is a mutex in shared memory and with the Process Shared attribute set) deadlock and fail. This was found in "apr" through the package's self-test, and we created a patch to disable their use on arm v7/thumb. However, the real solution is most likely in glibc/pthread and perhaps it's use of atomics for locking in process shared memory.

Related branches

Loïc Minier (lool) on 2010-07-12
tags: added: armel armv7 thumb
tags: added: toolchain
Matthias Klose (doko) wrote :

for the apr report see bug 599874

Changed in linaro-toolchain-wg:
assignee: nobody → Linaro Toolchain Developers (linaro-toolchain-dev)
Matthias Klose (doko) on 2010-07-21
affects: glibc (Ubuntu) → eglibc (Ubuntu)

It's critical that we investigate this and understand it.

Changed in linaro-toolchain-wg:
assignee: Linaro Toolchain Developers (linaro-toolchain-dev) → nobody
Matthias Klose (doko) wrote :

code should be fixed to ensure the use of the atomic primitives provided by gcc, and gcc should ensure not to use the expensive routines when in v7 mode.

Loïc Minier (lool) on 2010-08-17
affects: linaro-toolchain-wg → linaro-toolchain-misc
Changed in linaro-toolchain-misc:
importance: Undecided → High
tags: added: eglibc
summary: - process shared mutex's fail on armel v7 (thumb)
+ [eglibc] process shared mutex's fail on armel v7 (thumb)

Note that there are upstream patches that improve the GCC _sync_*
primitives. If APR uses those, or via GLIBC uses those, then it might
be related.

Michael Hope (michaelh1) wrote :

Confirmed on maverick with gcc-4.4 4.4.4-9ubuntu2, eglibc 2.12.1-0ubuntu1, and apr 1.4.2-3ubuntu1

Michael Hope (michaelh1) wrote :

The test halts while running the pthread based APR_LOCK_PROC_PTHREAD variant. All other variants pass.

The test spawns a number of processes and then randomly locks, writes to a shared variable, waits, and unlocks on a process mutex across the different processes. When using APR_LOCK_PROC_PTHREAD, what actually happens is the first process locks and unlocks while all others are stalled.

LP: #491342 is probably related. According to that pthread spinlocks are broken on eglibc on ARM due to still using the old swp/swpb instructions.

Next step would be to replace these with __sync_* as per sysdeps/ia64/bits/atomic.h and similar.

Michael Hope (michaelh1) on 2010-09-01
Changed in linaro-toolchain-misc:
status: New → Confirmed
Loïc Minier (lool) wrote :

I'm opening an apr task to remember reverting the changes from bug #599874 once this is fixed in eglibc.

Clint Byrum (clint-fewbar) wrote :

Given that the APR task is just for our own development usage, I'm setting the status to Confirmed, and the Importance to Wishlist. When eglibc is fixed, the changes should be reverted, but at the very least, this bug can change to status Triaged at that point.

Changed in apr (Ubuntu):
status: New → Confirmed
importance: Undecided → Wishlist
Steve Langasek (vorlon) on 2011-02-15
tags: added: arm-porting-queue
Ken Werner (kwerner) wrote :

I had a quick look into the apr sources. Their locks/unix/proc_mutex.c can map the proc_mutex functionality to various methods:
 1) Posix sem implemenation (see mutex_posixsem_methods)
 2) SysV sem implementation (see mutex_sysv_methods)
 3) pthread implementation (see mutex_proc_pthread_methods)
 4) fcntl implementation (see mutex_fcntl_methods)
 5) flock implementation (see mutex_flock_methods)

I think only the pthread implemenation (3) is of interest here. The creation of the pthread_mutex happens at:
  locks/unix/proc_mutex.c:proc_mutex_proc_pthread_create
that calls pthread_mutex_init with the following attributes:
  PTHREAD_PROCESS_SHARED, PTHREAD_MUTEX_ROBUST_NP, PTHREAD_PRIO_INHERIT
The apr-phthread-mutex functionality relies on pthread_mutex_lock and pthread_mutex_trylock. The eglibc implemenation for such a mutex calls atomic_compare_and_exchange_val_acq (see also lll_lock/lll_trylock at ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.h) which goes down to __arch_compare_and_exchange_val_32_acq. Using (e)glibc (>= 2.12.1-0ubuntu11) and GCC (>=4.5) this expands to __sync_val_compare_and_swap (ports/sysdeps/unix/sysv/linux/arm/nptl/bits/atomic.h) which should be safe even on a SMP system.

Some infos regarding the atomic memory operations can be found at:
https://wiki.linaro.org/WorkingGroups/ToolChain/AtomicMemoryOperations#implementation%20details

The next step would be to check whether the testcase fails using a decent GCC and eglibc (Ubuntu Natty for example).

Ken Werner (kwerner) wrote :

I ran the testprocmutex test on a panda board using ubuntu natty alpha 2 and it passed:
$ ./testall testprocmutex
testprocmutex : SUCCESS
All tests passed.

$ apt-cache show libc6 gcc|grep Version
Version: 4:4.5.1-1ubuntu3
Version: 2.12.1-0ubuntu16

Ken Werner (kwerner) wrote :

I ran the apr testsuite (svn revision 1071306) several times and I notice that the testlock:test_timeoutcond test fails sometimes because the timer returned to late. Not sure sure if this is related.

Ken Werner (kwerner) wrote :

I think this the issue has been fixed by #643171. I do not see this testcase failing using a recent (e)glibc and GCC. Could anyone of the original reporters check if it works for them? Please make sure that you use at least glibc 2.12.1-0ubuntu11 and GCC 4.5.

Jani Monoses (jani) wrote :

apr without workaround applied passes tests on current natty so glibc is fixed.
We still need to revert the workaround in apr (that is part of Debian packaging now as well)

Changed in eglibc (Ubuntu):
status: New → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apr - 1.4.2-7ubuntu1

---------------
apr (1.4.2-7ubuntu1) natty; urgency=low

  * debian/rules: Reenable robust pthread mutexes on armel, now that eglibc
    process shared mutexes were fixed to use gcc atomic sync builtins.
    (LP: #604753)
 -- Jani Monoses <email address hidden> Fri, 18 Mar 2011 18:37:44 +0200

Changed in apr (Ubuntu):
status: Confirmed → Fix Released
Jani Monoses (jani) on 2011-03-18
Changed in linaro-toolchain-misc:
status: Confirmed → Fix Released
Loïc Minier (lool) wrote :

Reverted:

apr (1.4.2-7ubuntu2) natty; urgency=low

  * Revert previous change. I forgot the build servers have an older
    kernel on which the testsuite fails to pass. Reopens LP: #604753

 -- Jani Monoses <email address hidden> Mon, 21 Mar 2011 10:14:43 +0200

Need to reapply once buildds' kernels are upgraded

Changed in apr (Ubuntu):
status: Fix Released → Triaged
Jani Monoses (jani) wrote :

Do the builders have newer kernels now? If so the testsuite in apr may now pass and the workaround could be dropped.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers