race condition on rmm for module ldap (ldap cache) - part II
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Apache2 Web Server |
Confirmed
|
Medium
|
||
| apache2 (Ubuntu) |
Medium
|
Unassigned |
Bug Description
[Impact]
* Apache users using ldap module might face this if using multiple threads and shared memory activated for apr memory allocator (default in Ubuntu).
[Test Case]
* Configure apache to use ldap module, for authentication e.g., and wait for the race condition to happen.
* Analysis made out of a dump from a production environment.
* Bug has been reported multiple times upstream in the past 10 years.
[Regression Potential]
* ldap module has broken locking mechanism when using apr mem mgmt.
* ldap would continue to have broken locking mechanism.
* race conditions could still exist.
* could could brake ldap module.
* patch is upstreamed in next version to be released.
[Other Info]
ORIGINAL CASE DESCRIPTION:
This is related to LP: #1752683. The locking locking mechanism for LDAP was fixed, since it is now obtained from server config merge like it was supposed to. Problem is that it has, likely, a race condition on its logic, causing ldap module to still fail in some conditions.
Problem summary:
apr_rmm_init acts as a relocatable memory management initialization
it is used in: mod_auth_digest and util_ldap_cache
From the dump was brought to my knowledge, in the following sequence:
- util_ldap_
- util_ald_strdup()
- apr_rmm_calloc()
- find_block_
Had a "cache->rmm_addr" with no lock at "find_block_
cache->
And an invalid "next" offset (out of rmm->base-
This rmm_addr was initialized with NULL as a locking mechanism:
From apr-utils:
apr_rmm_init()
if (!lock) { <-- 2nd argument to apr_rmm_init()
lock = &nulllock;
}
From apache:
# mod_auth_digest
sts = apr_rmm_
# util_ldap_cache
result = apr_rmm_
It appears that the ldap module chose to use "rmm" for memory allocation, using
the shared memory approach, but without explicitly definiting a lock to it.
Without it, its up to the caller to guarantee that there are locks for rmm
synchronization (just like mod_auth_digest does, using global mutexes).
Because of that, there was a race condition in "find_block_
touching "rmm->base-
apache environment, since there were no lock guarantees inside rmm logic (lock
was "apr_anylock_none" and the locking calls don't do anything).
In find_block_of_size:
apr_rmm_off_t next = rmm->base-
We have:
rmm-
Decimal:356400
Hex:0x57030
But "next" turned into:
Name : next
Decimal:
Hex:0x73797363
Causing:
struct rmm_block_t *blk = (rmm_block_
if (blk->size == size)
To segfault.
Upstream bugs:
https:/
https:/
https:/
Upstream bugs fixed the "outer" lock, which is not obtained from the config merge function in all conditions. Since apr_rmm_init() is called with no lock, the caller should take care of the lockin. Unfortunately the "outer" lock is not working as it seems:
---- This new bug explanation:
LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch for this bug has been applied, https:/
In the dump, in find_block_
apr_rmm_off_t next = rmm->base-
...
while(next) {
struct rmm_block_t *blk = (rmm_block_
blk gets value 0x5772e56b36226557 because "next" was corrupted (value: 0x57726573553d5
APR_
in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.
For the sake of exercising possibilities, if mod_ldap is calling APR RMM with external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is this:
Thread #19 7092 (Suspended : Container)
kill() at syscall-
<signal handler called>() at 0x7ff7e9cb7390
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_
ldapgroup_
apply_
apply_
authorize_
ap_run_
ap_process_
ap_process_
ap_process_
ap_process_
ap_process_
ap_run_
ap_process_
process_socket() at worker.c:631 0x7ff7e2f51f8b
worker_thread() at worker.c:990 0x7ff7e2f51f8b
start_thread() at pthread_
clone() at clone.S:109 0x7ff7e99e341d
Which means uldap_cache_
LDAP_CACHE_LOCK() translates into:
do {
if (st->util_
} while (0);
After the change proposed for this bug (where "util_ldap_
Name : util_ldap_
Hex:0x7ff7ea75aee0
Name : util_ldap_cache
Hex:0x7ff7e0e51038
Meaning that it got the ldap_cache and ldap_cache_lock from the merge config function.
From the mutex acquire logic, for the apr_global_
apr_status_t apr_proc_
{
return mutex->
}
And it would translate into:
st->util_
And from that logic:
static apr_status_t proc_mutex_
{
int rc;
do {
rc = fcntl(mutex-
} while (rc < 0 && errno == EINTR);
if (rc < 0) {
return errno;
}
mutex-
return APR_SUCCESS;
}
We would guarantee mutex lock through a file descriptor to the file:
"/var/lock/
And the "mutex-
Unfortunately, considering my stack trace, during the cache insertion:
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_cache_
Name : st->util_
Details:
Default:
Decimal:
Hex:0x7ff7ea75aee0
Binary:
Octal:
Name : proc_mutex
Details:
Default:
Decimal:
Hex:0x7ff7ea75aef8
Binary:
Octal:
Name : curr_locked
Details:0
Default:0
Decimal:0
Hex:0x0
Binary:0
Octal:0
I have curr_locked = 0
|
#5 |
Hi Max, it appears to me that mod_ldap is trying to manage a global (cross process + cross thread) lock outside of rmm, which is why it tells RMM to not use a lock.
Windows is a special case where there is only 1 process and N threads. But passing in a thread mutex only is probably wrong outside of windows.
Not having looked too closely, it seems likely that an LDAP_CACHE_LOCK() is missing somewhere where we call into rmm.
Have you been able to verify on a later release?
|
#6 |
I have not tried a later release in production, no, only because we have been soaking the change to see if it improved reliability.
I am looking at the mod_ldap changes between our current version (2.4.16) and the latest (2.4.23) and see that the only changes are whitespace, either adding/remove spaces or adding newlines before braces, so I would expect that 2.4.23 is similarly affected.
The LDAP_CACHE_LOCK construct is defined only to the util_ldap.c translation unit, it does not seem to be available, at present, to util_ldap_
|
#7 |
(In reply to Max Burke from comment #2)
> I have not tried a later release in production, no, only because we have
> been soaking the change to see if it improved reliability.
>
> I am looking at the mod_ldap changes between our current version (2.4.16)
> and the latest (2.4.23) and see that the only changes are whitespace, either
> adding/remove spaces or adding newlines before braces, so I would expect
> that 2.4.23 is similarly affected.
>
> The LDAP_CACHE_LOCK construct is defined only to the util_ldap.c translation
> unit, it does not seem to be available, at present, to util_ldap_
> where the RMM blocks are allocated and freed.
It's not directly available, but that lock is meant to be held by anyone who would ultimately get into the code that does the allocations. That is usually (AFAICT) via util_ald_
What kind of Require ldap-* do you use? I noticed the debian OP didn't list that bit.
|
#8 |
(In reply to Eric Covener from comment #3)
> What kind of Require ldap-* do you use? I noticed the debian OP didn't list
> that bit.
I'm unfamiliar with what you mean by "Require ldap-*", but here are the snippets from our httpd.conf for the ldap settings:
LoadModule authnz_ldap_module modules/
LoadModule ldap_module modules/mod_ldap.so
LDAPTrustedMode SSL
LDAPVerifyServe
LDAPSharedCacheSize 20000000
<AuthnProviderAlias ldap eac>
AuthLDAPURL "[... snip ... ]"
AuthLDAPBindDN "[ ... snip ... ]"
# including ldap password from a separate file
# AuthLDAPBindPas
Include conf/ldap_
</AuthnProvider
Fixed status - sorry wrong bug.
*** Bug 58483 has been marked as a duplicate of this bug. ***
Max or Eric,
Are there any changes in status of this bug ?
I can corroborate to what Max said, and proposed, since I'm analysing a dump that was brought to me, exactly in this situation (and also could find another old bug with the same race condition):
https:/
These are my notes so far:
Problem summary:
apr_rmm_init acts as a relocatable memory management initialization
it is used in: mod_auth_digest and util_ldap_cache
From the dump was brought to my knowledge, in the following sequence:
- util_ldap_
- util_ald_strdup()
- apr_rmm_calloc()
- find_block_
Had a "cache->rmm_addr" with no lock at "find_block_
cache->
And an invalid "next" offset (out of rmm->base-
This rmm_addr was initialized with NULL as a locking mechanism:
From apr-utils:
apr_rmm_init()
if (!lock) { <-- 2nd argument to apr_rmm_init()
lock = &nulllock;
}
From apache:
# mod_auth_digest
sts = apr_rmm_
# util_ldap_cache
result = apr_rmm_
It appears that the ldap module chose to use "rmm" for memory allocation, using
the shared memory approach, but without explicitly definiting a lock to it.
Without it, its up to the caller to guarantee that there are locks for rmm
synchronization (just like mod_auth_digest does, using global mutexes).
Because of that, there was a race condition in "find_block_
touching "rmm->base-
apache environment, since there were no lock guarantees inside rmm logic (lock
was "apr_anylock_none" and the locking calls don't do anything).
In find_block_of_size:
apr_rmm_off_t next = rmm->base-
We have:
rmm-
Decimal:356400
Hex:0x57030
But "next" turned into:
Name : next
Decimal:
Hex:0x73797363
Causing:
struct rmm_block_t *blk = (rmm_block_
if (blk->size == size)
To segfault.
|
#12 |
I recently re-fixed an issue that causes the lock in mod_ldap to not be used. It will be in 2.4.31. It is a one-line change.
Just for a reference on someone that finds this bug.
Distributions Fixes (Debian & Ubuntu):
https:/
https:/
With LP: #1752683 being worked on (about to suggest final fix for all Ubuntu versions), it is likely I'll suggest same fix for Bug: #814980, both based in upstream commit: 39ae6cd64268 brought to me by Eric.
Eric,
Even after your fix...
LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch for this bug has been applied, https:/
In the dump, in find_block_
apr_rmm_off_t next = rmm->base-
...
while(next) {
struct rmm_block_t *blk = (rmm_block_
blk gets value 0x5772e56b36226557 because "next" was corrupted (value: 0x57726573553d5
APR_
in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.
For the sake of exercising possibilities, if mod_ldap is calling APR RMM with external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is this:
Thread #19 7092 (Suspended : Container)
kill() at syscall-
<signal handler called>() at 0x7ff7e9cb7390
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_
ldapgroup_
apply_
apply_
authorize_
ap_run_
ap_process_
ap_process_
ap_process_
ap_process_
ap_process_
ap_run_
ap_process_
process_socket() at worker.c:631 0x7ff7e2f51f8b
worker_thread() at worker.c:990 0x7ff7e2f51f8b
start_thread() at pthread_
clone() at clone.S:109 0x7ff7e99e341d
Which means uldap_cache_
LDAP_CACHE_LOCK() translates into:
do {
if (st->util_
} while (0);
After the change proposed for this bug (where "util_ldap_
Name : util_ldap_
Hex:0x7ff7ea75aee0
Name : util_ldap_cache
Hex:0x7ff7e0e51038
Meaning that it got the ldap...
For this particular bug, the following upstream comment was done:
https:/
and the bug in question is this:
https:/
The same one for LP: #1752683
|
#15 |
Is fcntl being forced here somehow, or is it the default (apachectl -V|grep APR_USE) ? I wonder if an affected user could try a different mechanism, via e.g. https:/
Changed in apache2 (Ubuntu): | |
status: | New → In Progress |
assignee: | nobody → Rafael David Tinoco (inaddy) |
importance: | Undecided → Medium |
Thread #19 7092 (Suspended : Container)
kill() at syscall-
<signal handler called>() at 0x7ff7e9cb7390
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_
ldapgroup_
apply_
apply_
authorize_
ap_run_
ap_process_
ap_process_
ap_process_
ap_process_
ap_process_
ap_run_
ap_process_
process_socket() at worker.c:631 0x7ff7e2f51f8b
worker_thread() at worker.c:990 0x7ff7e2f51f8b
start_thread() at pthread_
clone() at clone.S:109 0x7ff7e99e341d
-------
static apr_rmm_off_t find_block_
{
apr_rmm_off_t next = rmm->base-
apr_rmm_off_t best = 0;
apr_rmm_off_t bestsize = 0;
while (next) {
struct rmm_block_t *blk = (rmm_block_
-------
The following code touches "rmm->base-
apr_rmm_destroy()
- ...
apr_rmm_init()
- ...
find_block_
- apr_rmm_calloc()
- apr_rmm_malloc()
move_block()
- apr_rmm_calloc()
- apr_rmm_free()
- apr_rmm_malloc()
-------
All APR calls to RMM have to have: APR_ANYLOCK_
be apr_anylock_none ORELSE the lock has to be guaranteed by the caller, in our
case, uldap_cache_
LDAP_CACHE_LOCK() directive is:
do {
if (st->util_
apr_global_
} while (0);
Where st is a ldap state struct, got by ap_get_
"util_ldap_
st-
in "util_ldap_
still after ldap merge, it does NOT contain "cache_shm" and APR_HAS_
was defined (cache settings are inherited in the virtual host, all server use
the same shared memory cache).
If re-initialized by util_ldap_
created with ap_global_
The following upstream bug is being worked on:
https:/
And I just did the following comment:
https:/
Which I'm reproducing in the next comment for documenting purposes.
Thread #19 7092 (Suspended : Container)
kill() at syscall-
<signal handler called>() at 0x7ff7e9cb7390
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_
ldapgroup_
apply_
apply_
authorize_
ap_run_
ap_process_
ap_process_
ap_process_
ap_process_
ap_process_
ap_run_
ap_process_
process_socket() at worker.c:631 0x7ff7e2f51f8b
worker_thread() at worker.c:990 0x7ff7e2f51f8b
start_thread() at pthread_
clone() at clone.S:109 0x7ff7e99e341d
-------
static apr_rmm_off_t find_block_
{
apr_rmm_off_t next = rmm->base-
apr_rmm_off_t best = 0;
apr_rmm_off_t bestsize = 0;
while (next) {
struct rmm_block_t *blk = (rmm_block_
-------
The following code touches "rmm->base-
apr_rmm_destroy()
- ...
apr_rmm_init()
- ...
find_block_
- apr_rmm_calloc()
- apr_rmm_malloc()
move_block()
- apr_rmm_calloc()
- apr_rmm_free()
- apr_rmm_malloc()
-------
All APR calls to RMM have to have: APR_ANYLOCK_
be apr_anylock_none ORELSE the lock has to be guaranteed by the caller, in our
case, uldap_cache_
LDAP_CACHE_LOCK() directive is:
do {
if (st->util_
apr_global_
} while (0);
Where st is a ldap state struct, got by ap_get_
"util_ldap_
st-
in "util_ldap_
still after ldap merge, it does NOT contain "cache_shm" and APR_HAS_
was defined (cache settings are inherited in the virtual host, all server use
the same shared memory cache).
If re-initialized by util_ldap_
created with ap_global_
|
#17 |
> Another question that raises is, why to use fcntl as a backing mechanism for
> the
> LDAP locking ? If the lock was supposed to be guaranteed among different
> nodes,
> then backing the lock in a file, IF the name was based also on the instance
> id,
> which in this case is not, it could make sense. But in a shared-threaded only
> environment, why ?
the locking is only meant to be among the N children of a single instance / parent process. You should find the PID here is the parent pid. So I think this aspect is OK.
I have seen fcntl related errors on other OS'es, where APR might use it by default, related to EDEADLOCK being returned when two unrelated threads use two unrelated fcntl mutexes.
|
#18 |
In a local test I set: `Mutex fcntl:/tmp/run default`
globally and added an assert to the two lock/unlock macros and got several aborts. The asserts with sysvsem never fired.
I will be removing the macros and adding inline funcitons that log and abort on failure.
|
#19 |
[Tue May 08 08:26:42.418299 2018] [ldap:crit] [pid 68897:tid 700008b10000] (11)Resource deadlock avoided: [client 127.0.0.1:60404] LDAP cache lock failed
Changed in apache2 (Ubuntu): | |
assignee: | Rafael David Tinoco (inaddy) → nobody |
Changed in apache2 (Ubuntu): | |
status: | In Progress → Triaged |
|
#20 |
Hi Rafael -- out of curiosity did you ever find how fcntl() came to be used here?
Changed in apache2: | |
importance: | Unknown → Medium |
status: | Unknown → Incomplete |
Hello Eric,
Sorry for taking so long in providing you feedback, I moved from one company to another and there is another engineer dealing with this case on the previous company (and I lost emails from this thread, only noticed it because debian bug merged this into its bug and I got my name highlighted).
Nevertheless, yes, I discovered - through the dump and execution logic - final user was enforcing it by configuration file without any particular reason for it.
I didn't have the chance to ask final user to remove that and try again (which I think subsequent communication here might indicate as next steps), but I'm almost sure that this is indeed the issue.
Tks for providing feedback so fast to all my investigation!! The only thing here as a *TODO* for upstream work, if I might say that, would be to understand better why fcntl is the cause for the race (not guaranteeing atomicity).
Likely a sync and/or I/O barrier issue ? Exactly how you described in your tests!
To follow up with results of testing non-file locking: yep, the end user we were working with initially had this in their config, as Rafael said:
Mutex file:${
and we asked them to update it to not use file for ldap, by changing to:
Mutex file:${
Mutex sem ldap-cache
After that change (done about 1 month ago) they restarted and have not seen the problem happen again. I don't have any answer to why exactly they set the default to file locking.
Changed in apache2: | |
status: | Incomplete → Confirmed |
(I originally filed this with Debian's apache team because when I searched for related information on this issue I turned up something in their bug tracker. Here is the link for more context: https:/ /bugs.debian. org/cgi- bin/bugreport. cgi?bug= 814980)
The anylock structure provided by mod_ldap was set to the type apr_anylock_none, which resulted in multiple threads mutating the shared RMM state at the same time without any concurrency guards. The issue I was seeing was that all the threads on the server were stuck inside the RMM internal function find_block_of_size every 2 or 3 days, requiring the server processes to be killed and restarted.
We are using Apache on Windows.
I made the following patch which passes in a lock to the RMM pool created in mod_ldap when APR_HAS_THREADS is defined. Since doing so we have not encountered any hangs:
diff -Naur httpd-2. 4.16\include\ util_ldap. h httpd-2. 4.16-ea\ include\ util_ldap. h 4.16\include\ util_ldap. h Mon Jul 14 05:07:55 2014 4.16-ea\ include\ util_ldap. h Mon Aug 29 10:20:08 2016 SHARED_ MEMORY
--- httpd-2.
+++ httpd-2.
@@ -169,6 +169,10 @@
#if APR_HAS_
apr_shm_t *cache_shm;
apr_rmm_t *cache_rmm;
+#if APR_HAS_THREADS
+ apr_thread_mutex_t *lock;
+ apr_anylock_t cache_rmm_anylock;
+#endif
#endif
/* cache ald */ 4.16\modules\ ldap\util_ ldap_cache. c httpd-2. 4.16-ea\ modules\ ldap\util_ ldap_cache. c 4.16\modules\ ldap\util_ ldap_cache. c Mon Aug 19 04:45:19 2013 4.16-ea\ modules\ ldap\util_ ldap_cache. c Mon Aug 29 10:23:04 2016
st->cache_ shm = NULL; mutex_destroy( st->lock) ; rmm_anylock. type = apr_anylock_none; rmm_anylock. lock.pm = NULL; size_get( st->cache_ shm);
diff -Naur httpd-2.
--- httpd-2.
+++ httpd-2.
@@ -410,6 +410,14 @@
return result;
}
+
+#if APR_HAS_THREADS
+ apr_thread_
+ st->lock = NULL;
+ st->cache_
+ st->cache_
+#endif
+
#endif
return APR_SUCCESS;
}
@@ -436,8 +444,18 @@
/* Determine the usable size of the shm segment. */
size = apr_shm_
+#if APR_HAS_THREADS mutex_create( &st->lock, APR_THREAD_ MUTEX_DEFAULT, st->pool); rmm_anylock. type = apr_anylock_ threadmutex; rmm_anylock. lock.tm = st->lock; rmm_anylock. type = apr_anylock_none; rmm_anylock. lock.pm = NULL; init(&st- >cache_ rmm, NULL, init(&st- >cache_ rmm, &st->cache_ rmm_anylock,
apr_ shm_baseaddr_ get(st- >cache_ shm), size,
st-> pool);
+ apr_thread_
+ st->cache_
+ st->cache_
+#else
+ st->lock = NULL;
+ st->cache_
+ st->cache_
+#endif
+
/* This will create a rmm "handler" to get into the shared memory area */
- result = apr_rmm_
+ result = apr_rmm_
if (result != APR_SUCCESS) {
Thanks!