race condition on rmm for module ldap (ldap cache) - part II

Bug #1767105 reported by Rafael David Tinoco on 2018-04-26
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Apache2 Web Server
Confirmed
Medium
apache2 (Ubuntu)
Medium
Unassigned

Bug Description

[Impact]

 * Apache users using ldap module might face this if using multiple threads and shared memory activated for apr memory allocator (default in Ubuntu).

[Test Case]

 * Configure apache to use ldap module, for authentication e.g., and wait for the race condition to happen.
 * Analysis made out of a dump from a production environment.
 * Bug has been reported multiple times upstream in the past 10 years.

[Regression Potential]

 * ldap module has broken locking mechanism when using apr mem mgmt.
 * ldap would continue to have broken locking mechanism.
 * race conditions could still exist.
 * could could brake ldap module.
 * patch is upstreamed in next version to be released.

[Other Info]

ORIGINAL CASE DESCRIPTION:

This is related to LP: #1752683. The locking locking mechanism for LDAP was fixed, since it is now obtained from server config merge like it was supposed to. Problem is that it has, likely, a race condition on its logic, causing ldap module to still fail in some conditions.

Problem summary:

apr_rmm_init acts as a relocatable memory management initialization

it is used in: mod_auth_digest and util_ldap_cache

From the dump was brought to my knowledge, in the following sequence:

- util_ldap_compare_node_copy()
- util_ald_strdup()
- apr_rmm_calloc()
- find_block_of_size()

Had a "cache->rmm_addr" with no lock at "find_block_of_size()"

cache->rmm_addr->lock { type = apr_anylock_none }

And an invalid "next" offset (out of rmm->base->firstfree).

This rmm_addr was initialized with NULL as a locking mechanism:

From apr-utils:

apr_rmm_init()

    if (!lock) { <-- 2nd argument to apr_rmm_init()
        nulllock.type = apr_anylock_none; <--- found in the dump
        nulllock.lock.pm = NULL;
        lock = &nulllock;
    }

From apache:

# mod_auth_digest

    sts = apr_rmm_init(&client_rmm,
                       NULL, /* no lock, we'll do the locking ourselves */
                       apr_shm_baseaddr_get(client_shm),
                       shmem_size, ctx);

# util_ldap_cache

        result = apr_rmm_init(&st->cache_rmm, NULL,
                              apr_shm_baseaddr_get(st->cache_shm), size,
                              st->pool);

It appears that the ldap module chose to use "rmm" for memory allocation, using
the shared memory approach, but without explicitly definiting a lock to it.
Without it, its up to the caller to guarantee that there are locks for rmm
synchronization (just like mod_auth_digest does, using global mutexes).

Because of that, there was a race condition in "find_block_of_size" and a call
touching "rmm->base->firstfree", possibly "move_block()", in a multi-threaded
apache environment, since there were no lock guarantees inside rmm logic (lock
was "apr_anylock_none" and the locking calls don't do anything).

In find_block_of_size:

    apr_rmm_off_t next = rmm->base->firstfree;

We have:

    rmm->base->firstfree
 Decimal:356400
 Hex:0x57030

But "next" turned into:

Name : next
 Decimal:8320808657351632189
 Hex:0x737973636970653d

Causing:

        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

        if (blk->size == size)

To segfault.

Upstream bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=58483
https://bz.apache.org/bugzilla/show_bug.cgi?id=60296
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814980#15

Upstream bugs fixed the "outer" lock, which is not obtained from the config merge function in all conditions. Since apr_rmm_init() is called with no lock, the caller should take care of the lockin. Unfortunately the "outer" lock is not working as it seems:

---- This new bug explanation:

LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch for this bug has been applied, https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1752683, end user still complains about seg faults and core dumps show same issue: race condition for rmm->base->firstfree in function "find_block_of_size".

In the dump, in find_block_of_size():

    apr_rmm_off_t next = rmm->base->firstfree;
    ...
    while(next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

blk gets value 0x5772e56b36226557 because "next" was corrupted (value: 0x57726573553d554f). This happens because the lock

    APR_ANYLOCK_LOCK(&rmm->lock);

in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.

For the sake of exercising possibilities, if mod_ldap is calling APR RMM with external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is this:

Thread #19 7092 (Suspended : Container)
 kill() at syscall-template.S:84 0x7ff7e9911767
 <signal handler called>() at 0x7ff7e9cb7390
 find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
 apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
 util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
 util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
 util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
 uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
 ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937
 apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa
 apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01
 authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca
 ap_run_auth_checker() at request.c:91 0x56127e692f00
 ap_process_request_internal() at request.c:335 0x56127e695d57
 ap_process_async_request() at http_request.c:408 0x56127e6b4690
 ap_process_request() at http_request.c:445 0x56127e6b4850
 ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e
 ap_process_http_connection() at http_core.c:251 0x56127e6b091e
 ap_run_process_connection() at connection.c:41 0x56127e6a6bf0
 ap_process_connection() at connection.c:213 0x56127e6a7000
 process_socket() at worker.c:631 0x7ff7e2f51f8b
 worker_thread() at worker.c:990 0x7ff7e2f51f8b
 start_thread() at pthread_create.c:333 0x7ff7e9cad6ba
 clone() at clone.S:109 0x7ff7e99e341d

Which means uldap_cache_compare() would have synchronized access to APR RMM calls through LDAP_CACHE_LOCK() macro. This doesn't seem to be the case as the lock doesn't seem to be acquired.

LDAP_CACHE_LOCK() translates into:

do {
    if (st->util_ldap_cache_lock)
        apr_global_mutex_lock(st->util_ldap_cache_lock);
} while (0);

After the change proposed for this bug (where "util_ldap_cache_lock" would come from the ldap_merge_config), it seems that st has util_ldap_cache_lock and util_ldap_cache all set:

Name : util_ldap_cache_lock
 Hex:0x7ff7ea75aee0

Name : util_ldap_cache
 Hex:0x7ff7e0e51038

Meaning that it got the ldap_cache and ldap_cache_lock from the merge config function.

From the mutex acquire logic, for the apr_global_mutex_lock() -> apr_proc_mutex_lock():

apr_status_t apr_proc_mutex_lock(apr_proc_mutex_t *mutex)
{
    return mutex->meth->acquire(mutex);
}

And it would translate into:

st->util_ldap_cache_lock->proc_mutex->meth->acquire == proc_mutex_fcntl_acquire()

And from that logic:

static apr_status_t proc_mutex_fcntl_acquire(apr_proc_mutex_t *mutex)
{
    int rc;

    do {
        rc = fcntl(mutex->interproc->filedes, F_SETLKW, &proc_mutex_lock_it);
    } while (rc < 0 && errno == EINTR);
    if (rc < 0) {
        return errno;
    }
    mutex->curr_locked=1;
    return APR_SUCCESS;
}

We would guarantee mutex lock through a file descriptor to the file:

"/var/lock/apache2/ldap-cache.1368" (filedes == 15)

And the "mutex->curr_locked" would be 1.

Unfortunately, considering my stack trace, during the cache insertion:

find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3

Name : st->util_ldap_cache_lock
 Details:0x7ff7ea75aee0
 Default:0x7ff7ea75aee0
 Decimal:140702767230688
 Hex:0x7ff7ea75aee0
 Binary:11111111111011111101010011101011010111011100000
 Octal:03777375235327340

Name : proc_mutex
 Details:0x7ff7ea75aef8
 Default:0x7ff7ea75aef8
 Decimal:140702767230712
 Hex:0x7ff7ea75aef8
 Binary:11111111111011111101010011101011010111011111000
 Octal:03777375235327370

Name : curr_locked
 Details:0
 Default:0
 Decimal:0
 Hex:0x0
 Binary:0
 Octal:0

I have curr_locked = 0

(I originally filed this with Debian's apache team because when I searched for related information on this issue I turned up something in their bug tracker. Here is the link for more context: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814980)

The anylock structure provided by mod_ldap was set to the type apr_anylock_none, which resulted in multiple threads mutating the shared RMM state at the same time without any concurrency guards. The issue I was seeing was that all the threads on the server were stuck inside the RMM internal function find_block_of_size every 2 or 3 days, requiring the server processes to be killed and restarted.

We are using Apache on Windows.

I made the following patch which passes in a lock to the RMM pool created in mod_ldap when APR_HAS_THREADS is defined. Since doing so we have not encountered any hangs:

diff -Naur httpd-2.4.16\include\util_ldap.h httpd-2.4.16-ea\include\util_ldap.h
--- httpd-2.4.16\include\util_ldap.h Mon Jul 14 05:07:55 2014
+++ httpd-2.4.16-ea\include\util_ldap.h Mon Aug 29 10:20:08 2016
@@ -169,6 +169,10 @@
 #if APR_HAS_SHARED_MEMORY
     apr_shm_t *cache_shm;
     apr_rmm_t *cache_rmm;
+#if APR_HAS_THREADS
+ apr_thread_mutex_t *lock;
+ apr_anylock_t cache_rmm_anylock;
+#endif
 #endif

     /* cache ald */
diff -Naur httpd-2.4.16\modules\ldap\util_ldap_cache.c httpd-2.4.16-ea\modules\ldap\util_ldap_cache.c
--- httpd-2.4.16\modules\ldap\util_ldap_cache.c Mon Aug 19 04:45:19 2013
+++ httpd-2.4.16-ea\modules\ldap\util_ldap_cache.c Mon Aug 29 10:23:04 2016
@@ -410,6 +410,14 @@
         st->cache_shm = NULL;
         return result;
     }
+
+#if APR_HAS_THREADS
+ apr_thread_mutex_destroy(st->lock);
+ st->lock = NULL;
+ st->cache_rmm_anylock.type = apr_anylock_none;
+ st->cache_rmm_anylock.lock.pm = NULL;
+#endif
+
 #endif
     return APR_SUCCESS;
 }
@@ -436,8 +444,18 @@
         /* Determine the usable size of the shm segment. */
         size = apr_shm_size_get(st->cache_shm);

+#if APR_HAS_THREADS
+ apr_thread_mutex_create(&st->lock, APR_THREAD_MUTEX_DEFAULT, st->pool);
+ st->cache_rmm_anylock.type = apr_anylock_threadmutex;
+ st->cache_rmm_anylock.lock.tm = st->lock;
+#else
+ st->lock = NULL;
+ st->cache_rmm_anylock.type = apr_anylock_none;
+ st->cache_rmm_anylock.lock.pm = NULL;
+#endif
+
         /* This will create a rmm "handler" to get into the shared memory area */
- result = apr_rmm_init(&st->cache_rmm, NULL,
+ result = apr_rmm_init(&st->cache_rmm, &st->cache_rmm_anylock,
                               apr_shm_baseaddr_get(st->cache_shm), size,
                               st->pool);
         if (result != APR_SUCCESS) {

Thanks!

Hi Max, it appears to me that mod_ldap is trying to manage a global (cross process + cross thread) lock outside of rmm, which is why it tells RMM to not use a lock.

Windows is a special case where there is only 1 process and N threads. But passing in a thread mutex only is probably wrong outside of windows.

Not having looked too closely, it seems likely that an LDAP_CACHE_LOCK() is missing somewhere where we call into rmm.

Have you been able to verify on a later release?

I have not tried a later release in production, no, only because we have been soaking the change to see if it improved reliability.

I am looking at the mod_ldap changes between our current version (2.4.16) and the latest (2.4.23) and see that the only changes are whitespace, either adding/remove spaces or adding newlines before braces, so I would expect that 2.4.23 is similarly affected.

The LDAP_CACHE_LOCK construct is defined only to the util_ldap.c translation unit, it does not seem to be available, at present, to util_ldap_cache_mgr.c where the RMM blocks are allocated and freed.

(In reply to Max Burke from comment #2)
> I have not tried a later release in production, no, only because we have
> been soaking the change to see if it improved reliability.
>
> I am looking at the mod_ldap changes between our current version (2.4.16)
> and the latest (2.4.23) and see that the only changes are whitespace, either
> adding/remove spaces or adding newlines before braces, so I would expect
> that 2.4.23 is similarly affected.
>
> The LDAP_CACHE_LOCK construct is defined only to the util_ldap.c translation
> unit, it does not seem to be available, at present, to util_ldap_cache_mgr.c
> where the RMM blocks are allocated and freed.

It's not directly available, but that lock is meant to be held by anyone who would ultimately get into the code that does the allocations. That is usually (AFAICT) via util_ald_cache_insert().

What kind of Require ldap-* do you use? I noticed the debian OP didn't list that bit.

(In reply to Eric Covener from comment #3)
> What kind of Require ldap-* do you use? I noticed the debian OP didn't list
> that bit.

I'm unfamiliar with what you mean by "Require ldap-*", but here are the snippets from our httpd.conf for the ldap settings:

LoadModule authnz_ldap_module modules/mod_authnz_ldap.so
LoadModule ldap_module modules/mod_ldap.so

LDAPTrustedMode SSL
LDAPVerifyServerCert off
LDAPSharedCacheSize 20000000

<AuthnProviderAlias ldap eac>
 AuthLDAPURL "[... snip ... ]"
 AuthLDAPBindDN "[ ... snip ... ]"

 # including ldap password from a separate file
 # AuthLDAPBindPassword secret
 Include conf/ldap_bind_pwd.conf
</AuthnProviderAlias>

Fixed status - sorry wrong bug.

*** Bug 58483 has been marked as a duplicate of this bug. ***

Max or Eric,

Are there any changes in status of this bug ?

I can corroborate to what Max said, and proposed, since I'm analysing a dump that was brought to me, exactly in this situation (and also could find another old bug with the same race condition):

https://bz.apache.org/bugzilla/show_bug.cgi?id=58483

These are my notes so far:

Problem summary:

apr_rmm_init acts as a relocatable memory management initialization

it is used in: mod_auth_digest and util_ldap_cache

From the dump was brought to my knowledge, in the following sequence:

- util_ldap_compare_node_copy()
- util_ald_strdup()
- apr_rmm_calloc()
- find_block_of_size()

Had a "cache->rmm_addr" with no lock at "find_block_of_size()"

cache->rmm_addr->lock { type = apr_anylock_none }

And an invalid "next" offset (out of rmm->base->firstfree).

This rmm_addr was initialized with NULL as a locking mechanism:

From apr-utils:

apr_rmm_init()

    if (!lock) { <-- 2nd argument to apr_rmm_init()
        nulllock.type = apr_anylock_none; <--- found in the dump
        nulllock.lock.pm = NULL;
        lock = &nulllock;
    }

From apache:

# mod_auth_digest

    sts = apr_rmm_init(&client_rmm,
                       NULL, /* no lock, we'll do the locking ourselves */
                       apr_shm_baseaddr_get(client_shm),
                       shmem_size, ctx);

# util_ldap_cache

        result = apr_rmm_init(&st->cache_rmm, NULL,
                              apr_shm_baseaddr_get(st->cache_shm), size,
                              st->pool);

It appears that the ldap module chose to use "rmm" for memory allocation, using
the shared memory approach, but without explicitly definiting a lock to it.
Without it, its up to the caller to guarantee that there are locks for rmm
synchronization (just like mod_auth_digest does, using global mutexes).

Because of that, there was a race condition in "find_block_of_size" and a call
touching "rmm->base->firstfree", possibly "move_block()", in a multi-threaded
apache environment, since there were no lock guarantees inside rmm logic (lock
was "apr_anylock_none" and the locking calls don't do anything).

In find_block_of_size:

    apr_rmm_off_t next = rmm->base->firstfree;

We have:

    rmm->base->firstfree
 Decimal:356400
 Hex:0x57030

But "next" turned into:

Name : next
 Decimal:8320808657351632189
 Hex:0x737973636970653d

Causing:

        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

        if (blk->size == size)

To segfault.

I recently re-fixed an issue that causes the lock in mod_ldap to not be used. It will be in 2.4.31. It is a one-line change.

http://svn.apache.org/viewvc?view=revision&revision=1824811

Just for a reference on someone that finds this bug.

Distributions Fixes (Debian & Ubuntu):

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814980#15
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1752683

With LP: #1752683 being worked on (about to suggest final fix for all Ubuntu versions), it is likely I'll suggest same fix for Bug: #814980, both based in upstream commit: 39ae6cd64268 brought to me by Eric.

Download full text (5.9 KiB)

Eric,

Even after your fix...

LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch for this bug has been applied, https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1752683 and the new bug https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1767105, end user still complains about seg faults and core dumps show same issue: race condition for rmm->base->firstfree in function "find_block_of_size".

In the dump, in find_block_of_size():

    apr_rmm_off_t next = rmm->base->firstfree;
    ...
    while(next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

blk gets value 0x5772e56b36226557 because "next" was corrupted (value: 0x57726573553d554f). This happens because the lock

    APR_ANYLOCK_LOCK(&rmm->lock);

in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.

For the sake of exercising possibilities, if mod_ldap is calling APR RMM with external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is this:

Thread #19 7092 (Suspended : Container)
 kill() at syscall-template.S:84 0x7ff7e9911767
 <signal handler called>() at 0x7ff7e9cb7390
 find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
 apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
 util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
 util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
 util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
 uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
 ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937
 apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa
 apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01
 authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca
 ap_run_auth_checker() at request.c:91 0x56127e692f00
 ap_process_request_internal() at request.c:335 0x56127e695d57
 ap_process_async_request() at http_request.c:408 0x56127e6b4690
 ap_process_request() at http_request.c:445 0x56127e6b4850
 ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e
 ap_process_http_connection() at http_core.c:251 0x56127e6b091e
 ap_run_process_connection() at connection.c:41 0x56127e6a6bf0
 ap_process_connection() at connection.c:213 0x56127e6a7000
 process_socket() at worker.c:631 0x7ff7e2f51f8b
 worker_thread() at worker.c:990 0x7ff7e2f51f8b
 start_thread() at pthread_create.c:333 0x7ff7e9cad6ba
 clone() at clone.S:109 0x7ff7e99e341d

Which means uldap_cache_compare() would have synchronized access to APR RMM calls through LDAP_CACHE_LOCK() macro. This doesn't seem to be the case as the lock doesn't seem to be acquired.

LDAP_CACHE_LOCK() translates into:

do {
    if (st->util_ldap_cache_lock)
        apr_global_mutex_lock(st->util_ldap_cache_lock);
} while (0);

After the change proposed for this bug (where "util_ldap_cache_lock" would come from the ldap_merge_config), it seems that st has util_ldap_cache_lock and util_ldap_cache all set:

Name : util_ldap_cache_lock
 Hex:0x7ff7ea75aee0

Name : util_ldap_cache
 Hex:0x7ff7e0e51038

Meaning that it got the ldap...

Read more...

Rafael David Tinoco (inaddy) wrote :

For this particular bug, the following upstream comment was done:

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296#c10

and the bug in question is this:

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

The same one for LP: #1752683

Is fcntl being forced here somehow, or is it the default (apachectl -V|grep APR_USE) ? I wonder if an affected user could try a different mechanism, via e.g. https://httpd.apache.org/docs/2.4/mod/core.html#mutex with "ldap-cache" as the parm (or if they already are somehow?)

Changed in apache2 (Ubuntu):
status: New → In Progress
assignee: nobody → Rafael David Tinoco (inaddy)
importance: Undecided → Medium
Download full text (11.6 KiB)

Thread #19 7092 (Suspended : Container)
 kill() at syscall-template.S:84 0x7ff7e9911767
 <signal handler called>() at 0x7ff7e9cb7390
 find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
 apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
 util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
 util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
 util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
 uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
 ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937
 apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa
 apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01
 authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca
 ap_run_auth_checker() at request.c:91 0x56127e692f00
 ap_process_request_internal() at request.c:335 0x56127e695d57
 ap_process_async_request() at http_request.c:408 0x56127e6b4690
 ap_process_request() at http_request.c:445 0x56127e6b4850
 ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e
 ap_process_http_connection() at http_core.c:251 0x56127e6b091e
 ap_run_process_connection() at connection.c:41 0x56127e6a6bf0
 ap_process_connection() at connection.c:213 0x56127e6a7000
 process_socket() at worker.c:631 0x7ff7e2f51f8b
 worker_thread() at worker.c:990 0x7ff7e2f51f8b
 start_thread() at pthread_create.c:333 0x7ff7e9cad6ba
 clone() at clone.S:109 0x7ff7e99e341d

--------------------------------------------------------------------------------

static apr_rmm_off_t find_block_of_size(apr_rmm_t *rmm, apr_size_t size)
{
    apr_rmm_off_t next = rmm->base->firstfree;
    apr_rmm_off_t best = 0;
    apr_rmm_off_t bestsize = 0;

    while (next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

--------------------------------------------------------------------------------

The following code touches "rmm->base->firstfree" (capable of racing):

apr_rmm_destroy()
 - ...

apr_rmm_init()
 - ...

find_block_of_size()
 - apr_rmm_calloc()
 - apr_rmm_malloc()

move_block()
 - apr_rmm_calloc()
 - apr_rmm_free()
 - apr_rmm_malloc()

--------------------------------------------------------------------------------

All APR calls to RMM have to have: APR_ANYLOCK_LOCK(&rmm->lock) and lock can't
be apr_anylock_none ORELSE the lock has to be guaranteed by the caller, in our
case, uldap_cache_compare() using the LDAP_CACHE_LOCK() directive.

LDAP_CACHE_LOCK() directive is:

 do {
  if (st->util_ldap_cache_lock)
  apr_global_mutex_lock(st->util_ldap_cache_lock);
 } while (0);

Where st is a ldap state struct, got by ap_get_module_config() on the ldap mod.
"util_ldap_cache_lock" is the global apr mutex for this module and it is set by:

    st->util_ldap_cache_lock = base->util_ldap_cache_lock;

in "util_ldap_merge_config" AND (re)-initialized in util_ldap_post_config if,
still after ldap merge, it does NOT contain "cache_shm" and APR_HAS_SHARED_MEMOR
was defined (cache settings are inherited in the virtual host, all server use
the same shared memory cache).

If re-initialized by util_ldap_post_config, util_ldap_cache_lock is a mutex
created with ap_global_mutex_create using a...

Rafael David Tinoco (inaddy) wrote :

The following upstream bug is being worked on:

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

And I just did the following comment:

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296#c12

Which I'm reproducing in the next comment for documenting purposes.

Rafael David Tinoco (inaddy) wrote :
Download full text (11.5 KiB)

Thread #19 7092 (Suspended : Container)
 kill() at syscall-template.S:84 0x7ff7e9911767
 <signal handler called>() at 0x7ff7e9cb7390
 find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
 apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
 util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
 util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
 util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
 uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
 ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937
 apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa
 apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01
 authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca
 ap_run_auth_checker() at request.c:91 0x56127e692f00
 ap_process_request_internal() at request.c:335 0x56127e695d57
 ap_process_async_request() at http_request.c:408 0x56127e6b4690
 ap_process_request() at http_request.c:445 0x56127e6b4850
 ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e
 ap_process_http_connection() at http_core.c:251 0x56127e6b091e
 ap_run_process_connection() at connection.c:41 0x56127e6a6bf0
 ap_process_connection() at connection.c:213 0x56127e6a7000
 process_socket() at worker.c:631 0x7ff7e2f51f8b
 worker_thread() at worker.c:990 0x7ff7e2f51f8b
 start_thread() at pthread_create.c:333 0x7ff7e9cad6ba
 clone() at clone.S:109 0x7ff7e99e341d

--------------------------------------------------------------------------------

static apr_rmm_off_t find_block_of_size(apr_rmm_t *rmm, apr_size_t size)
{
    apr_rmm_off_t next = rmm->base->firstfree;
    apr_rmm_off_t best = 0;
    apr_rmm_off_t bestsize = 0;

    while (next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

--------------------------------------------------------------------------------

The following code touches "rmm->base->firstfree" (capable of racing):

apr_rmm_destroy()
 - ...

apr_rmm_init()
 - ...

find_block_of_size()
 - apr_rmm_calloc()
 - apr_rmm_malloc()

move_block()
 - apr_rmm_calloc()
 - apr_rmm_free()
 - apr_rmm_malloc()

--------------------------------------------------------------------------------

All APR calls to RMM have to have: APR_ANYLOCK_LOCK(&rmm->lock) and lock can't
be apr_anylock_none ORELSE the lock has to be guaranteed by the caller, in our
case, uldap_cache_compare() using the LDAP_CACHE_LOCK() directive.

LDAP_CACHE_LOCK() directive is:

 do {
  if (st->util_ldap_cache_lock)
  apr_global_mutex_lock(st->util_ldap_cache_lock);
 } while (0);

Where st is a ldap state struct, got by ap_get_module_config() on the ldap mod.
"util_ldap_cache_lock" is the global apr mutex for this module and it is set by:

    st->util_ldap_cache_lock = base->util_ldap_cache_lock;

in "util_ldap_merge_config" AND (re)-initialized in util_ldap_post_config if,
still after ldap merge, it does NOT contain "cache_shm" and APR_HAS_SHARED_MEMOR
was defined (cache settings are inherited in the virtual host, all server use
the same shared memory cache).

If re-initialized by util_ldap_post_config, util_ldap_cache_lock is a mutex
created with ap_global_mutex_create using a...

> Another question that raises is, why to use fcntl as a backing mechanism for
> the
> LDAP locking ? If the lock was supposed to be guaranteed among different
> nodes,
> then backing the lock in a file, IF the name was based also on the instance
> id,
> which in this case is not, it could make sense. But in a shared-threaded only
> environment, why ?

the locking is only meant to be among the N children of a single instance / parent process. You should find the PID here is the parent pid. So I think this aspect is OK.

I have seen fcntl related errors on other OS'es, where APR might use it by default, related to EDEADLOCK being returned when two unrelated threads use two unrelated fcntl mutexes.

In a local test I set: `Mutex fcntl:/tmp/run default`

globally and added an assert to the two lock/unlock macros and got several aborts. The asserts with sysvsem never fired.

I will be removing the macros and adding inline funcitons that log and abort on failure.

[Tue May 08 08:26:42.418299 2018] [ldap:crit] [pid 68897:tid 700008b10000] (11)Resource deadlock avoided: [client 127.0.0.1:60404] LDAP cache lock failed

Changed in apache2 (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
Changed in apache2 (Ubuntu):
status: In Progress → Triaged

Hi Rafael -- out of curiosity did you ever find how fcntl() came to be used here?

Changed in apache2:
importance: Unknown → Medium
status: Unknown → Incomplete

Hello Eric,

Sorry for taking so long in providing you feedback, I moved from one company to another and there is another engineer dealing with this case on the previous company (and I lost emails from this thread, only noticed it because debian bug merged this into its bug and I got my name highlighted).

Nevertheless, yes, I discovered - through the dump and execution logic - final user was enforcing it by configuration file without any particular reason for it.

I didn't have the chance to ask final user to remove that and try again (which I think subsequent communication here might indicate as next steps), but I'm almost sure that this is indeed the issue.

Tks for providing feedback so fast to all my investigation!! The only thing here as a *TODO* for upstream work, if I might say that, would be to understand better why fcntl is the cause for the race (not guaranteeing atomicity).

Likely a sync and/or I/O barrier issue ? Exactly how you described in your tests!

To follow up with results of testing non-file locking: yep, the end user we were working with initially had this in their config, as Rafael said:

Mutex file:${APACHE_LOCK_DIR} default

and we asked them to update it to not use file for ldap, by changing to:

Mutex file:${APACHE_LOCK_DIR} default
Mutex sem ldap-cache

After that change (done about 1 month ago) they restarted and have not seen the problem happen again. I don't have any answer to why exactly they set the default to file locking.

Changed in apache2:
status: Incomplete → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.