Bug #1926265 “slapd enter in infinite loop on sched_yield syscal...” : Bionic (18.04) : Bugs : openldap package : Ubuntu

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2021-04-29:

#1

Thank you for taking the time to file a bug report.

This one looks like a rabbit hole :-(. I've also found many (very) old reports of similar problems, but they all appear to have been fixed a while ago (before Bionic was released). I even found a possible patch (from 2005) to fix the issue, and was able to determine that Bionic's openldap already carries an improved version of the patch (unsurprisingly). I've also found an old Launchpad bug (#15270) and the related Debian bug (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=255276) that reports the same problem as you, and is marked having been fixed in Debian (also back in 2005).

I am a bit surprised that you're experiencing this problem on Bionic. I understand that it is hard to provide steps for reproducing this problem, but I would like to ask you to provide as much information as you can, please. For example:

- Your full openldap configuration (please remove any confidential bits, of course).

- Any log messages from slapd or related services.

- If you can, please install the debug symbols for openldap/slapd and run "gdb -p $PROCESS_PID" (where "$PROCESS_PID" is slapd's PID), then run a "bt" command and attach the output to this bug.

- More information about what is going on in the system when the problem happens. For example, I've read that this might happen when the system load is high; do you notice that as well?

Meanwhile, I will mark this bug as Incomplete. Feel free to revert its status back to New once you provide more info. Thanks!

Changed in openldap (Ubuntu):
status:	New → Incomplete

Revision history for this message

lincvz (cvuillemez) wrote on 2021-04-30:

#2

Thanks for your support.

> - Your full openldap configuration (please remove any confidential bits, of course).
There is lots of ldif files with many private datas. Do you need a particular configuration ?

> - Any log messages from slapd or related services.
olcLogLevel is "sync stats". During the last 40 minutes before (forced) kill, slapd send no more logs to syslog. You can see below an extract but it doesn't help:

Apr 27 04:50:01 front01 slapd[1023]: conn=10282468 fd=732 TLS established tls_ssf=128 ssf=128
Apr 27 04:50:01 front01 slapd[1023]: conn=10282453 op=13 SRCH base="cn=Write,cn=Waiters,cn=Monitor" scope=0 deref=2 filter="(objectClass=*)"
Apr 27 04:50:01 front01 slapd[1023]: conn=10282453 op=13 SRCH attr=monitorCounter
Apr 27 05:31:59 front01 slapd[1023]: daemon: shutdown requested and initiated.
Apr 27 05:31:59 front01 slapd[1023]: conn=10281700 fd=16 closed (slapd shutdown)
[...777 more ..... ]
Apr 27 05:31:59 front01 slapd[1023]: conn=10276270 fd=854 closed (slapd shutdown)
Apr 27 05:31:59 front01 slapd[1023]: daemon: shutdown requested and initiated.
Apr 27 05:32:18 front01 slapd[122011]: @(#) $OpenLDAP: slapd (Ubuntu) (Feb 18 2021 14:22:42) $#012#011Debian OpenLDAP Maintainers <email address hidden>
Apr 27 05:32:18 front01 slapd[122012]: bdb_db_open: database "dc=fti,dc=net": unclean shutdown detected; attempting recovery.
Apr 27 05:32:26 front01 slapd[122012]: slapd starting

>- If you can, please install the debug symbols for openldap/slapd and run "gdb -p
> $PROCESS_PID" (where "$PROCESS_PID" is slapd's PID), then run a "bt" command and attach
> the output to this bug.
Ok I install the debugging package "slapd-dbgsym" to provide a backtrace next time.

>- More information about what is going on in the system when the problem happens. For
> example, I've read that this might happen when the system load is high; do you notice that
> as well?
The servers hosting slapd are constantly under load, but issue occur even when overall CPU usage is about 5% ..
Maybe it can help you: the problem didn't occur on old Trusty machines (same config, same load).

Thanks for your support.

> - Your full openldap configuration (please remove any confidential bits, of course).
There is lots of ldif files with many private datas. Do you need a particular configuration ?

> - Any log messages from slapd or related services.
olcLogLevel is "sync stats". During the last 40 minutes before (forced) kill, slapd send no more logs to syslog. You can see below an extract but it doesn't help:

Apr 27 04:50:01 front01 slapd[1023]: conn=10282468 fd=732 TLS established tls_ssf=128 ssf=128
Apr 27 04:50:01 front01 slapd[1023]: conn=10282453 op=13 SRCH base="cn=Write,cn=Waiters,cn=Monitor" scope=0 deref=2 filter="(objectClass=*)"
Apr 27 04:50:01 front01 slapd[1023]: conn=10282453 op=13 SRCH attr=monitorCounter
Apr 27 05:31:59 front01 slapd[1023]: daemon: shutdown requested and initiated.
Apr 27 05:31:59 front01 slapd[1023]: conn=10281700 fd=16 closed (slapd shutdown)
[...777 more ..... ]
Apr 27 05:31:59 front01 slapd[1023]: conn=10276270 fd=854 closed (slapd shutdown)
Apr 27 05:31:59 front01 slapd[1023]: daemon: shutdown requested and initiated.
Apr 27 05:32:18 front01 slapd[122011]: @(#) $OpenLDAP: slapd  (Ubuntu) (Feb 18 2021 14:22:42) $#012#011Debian OpenLDAP Maintainers <pkg-openldap-devel@lists.alioth.debian.org>
Apr 27 05:32:18 front01 slapd[122012]: bdb_db_open: database "dc=fti,dc=net": unclean shutdown detected; attempting recovery.
Apr 27 05:32:26 front01 slapd[122012]: slapd starting

>- If you can, please install the debug symbols for openldap/slapd and run "gdb -p 
> $PROCESS_PID" (where "$PROCESS_PID" is slapd's PID), then run a "bt" command and attach 
> the output to this bug.
Ok I install the debugging package "slapd-dbgsym" to provide a backtrace next time.

>- More information about what is going on in the system when the problem happens. For 
> example, I've read that this might happen when the system load is high; do you notice that 
> as well?
The servers hosting slapd are constantly under load, but issue occur even when overall CPU usage is about 5% ..
Maybe it can help you: the problem didn't occur on old Trusty machines (same config, same load).

Revision history for this message

lincvz (cvuillemez) wrote on 2021-04-30:

#3

I plan to migrate BDB backend to MDB. Maybe it could help ?

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2021-05-03:

#4

Thanks for following up.

It is hard to say what might be happening and whether switching to MDB will help or not. I'm still puzzled that you're seeing this hang on a relatively new version of OpenLDAP. The fact that it didn't happen on Trusty may be helpful when diagnosing the issue, but I can't say for sure.

I will wait until you are able to provide a backtrace or more information. I will see about talking to other OpenLDAP experts here and see if this bug rings any bells for them.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-07:

#5

Nice day, 2 slapd malfunction in the same time, maybe du to network connectivity issue.
I have the backtrace for one of them.
Please the the attachment.
Thanks.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-10:

#6

2021010_backtrace_front01_FULL.txt Edit (56.1 KiB, text/plain)

issue occured today to.
This time, I join a *full* backtrace .

lincvz (cvuillemez) on 2021-05-10

Changed in openldap (Ubuntu):
status:	Incomplete → New

Revision history for this message

Paride Legovini (paride) wrote on 2021-05-10:

#7

Hi, I went a bit down the rabbit hole Sergio mentioned and found the same old reports and nothing really useful. I'd also be curios to see if the problem still occurs with the newer releases.

Did you see this happen on more that one machine/deployment?

The backtrace didn't suggest my anything, but maybe Sergio will have better use for it.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-10:

#8

Hi, yes it can occur on any machine (slave or master) with the same configuration / OS.

Revision history for this message

Stephane Chazelas (stephane-chazelas+lp) wrote on 2021-05-14:

#9

Download full text (4.5 KiB)

The important backtrace in there is the one from thread 11:

#0 0x00007fb288428474 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1 0x00007fb2890c4518 in ?? () from /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2
No symbol table info available.
#2 0x00007fb287895848 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#3 0x00007fb28788f96a in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#4 0x00007fb287896d03 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#5 0x00007fb28789991c in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#6 0x00007fb2878a10cb in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#7 0x00007fb28789d572 in gnutls_handshake () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#8 0x00007fb289304199 in ?? () from /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2
No symbol table info available.
#9 0x00007fb289301abb in ldap_pvt_tls_accept () from /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2
No symbol table info available.
#10 0x0000556b843e6f69 in connection_read (cri=<synthetic pointer>, s=430) at ../../../../servers/slapd/connection.c:1375

debug symbols are missing there, but I have the exact same problem and get:

#0 0x00007f2a01101474 in __libc_read (fd=40, buf=0x7f29dc142ecb, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:27
#1 0x00007f2a01db0518 in sb_debug_read (sbiod=0x7f29dc10e940, buf=0x7f29dc142ecb, len=5) at ../../../../libraries/liblber/sockbuf.c:829
#2 0x00007f2a00558848 in _gnutls_stream_read (ms=0x7f29e8ffb41c, pull_func=0x7f2a01ff1da0 <tlsg_recv>, size=5, bufel=<synthetic pointer>, session=0x7f29dc008060) at buffers.c:344
#3 _gnutls_read (ms=0x7f29e8ffb41c, pull_func=0x7f2a01ff1da0 <tlsg_recv>, size=5, bufel=<synthetic pointer>, session=0x7f29dc008060) at buffers.c:424
#4 _gnutls_io_read_buffered (session=session@entry=0x7f29dc008060, total=5, recv_type=recv_type@entry=4294967295, ms=0x7f29e8ffb41c) at buffers.c:579
#5 0x00007f2a0055296a in recv_headers (ms=<optimized out>, record=0x7f29e8ffb470, htype=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, type=GNUTLS_HANDSHAKE, record_params=0x7f29dc1279f0, session=0x7f29dc008060) at record.c:1045
#6 _gnutls_recv_in_buffers (session=session@entry=0x7f29dc008060, type=type@entry=GNUTLS_HANDSHAKE, htype=htype@entry=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, ms=<optimized out>, ms@entry=0) at record.c:1173
#7 0x00007f2a00559d03 in _gnutls_handshake_io_recv_int (session=session@entry=0x7f29dc008060, htype=htype@entry=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, hsk=hsk@entry=0x7f29e8ffb580, optional=optional@entry=0) at buffers.c:1412
#8 0x00007f2a0055c91c in _gnutls_recv_handshake (session=session@entry=0x7f29dc008060, type=type@entry=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, optional=optional@entry=0, buf=buf@entry=0x7f29e8ffb830) at handshake.c:1465
#9 0x00007f2a005640cb in _gnutls_recv_client_kx_message (session=session@entry=0x7f29dc008060) at kx.c:563
#10 0x00007f2a00560572 in handshake_server (session=0x7f29dc008060) at handsha...

The important backtrace in there is the one from thread 11:

#0  0x00007fb288428474 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fb2890c4518 in ?? () from /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2
No symbol table info available.
#2  0x00007fb287895848 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#3  0x00007fb28788f96a in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#4  0x00007fb287896d03 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#5  0x00007fb28789991c in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#6  0x00007fb2878a10cb in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#7  0x00007fb28789d572 in gnutls_handshake () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
No symbol table info available.
#8  0x00007fb289304199 in ?? () from /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2
No symbol table info available.
#9  0x00007fb289301abb in ldap_pvt_tls_accept () from /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2
No symbol table info available.
#10 0x0000556b843e6f69 in connection_read (cri=<synthetic pointer>, s=430) at ../../../../servers/slapd/connection.c:1375

debug symbols are missing there, but I have the exact same problem and get:

#0  0x00007f2a01101474 in __libc_read (fd=40, buf=0x7f29dc142ecb, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x00007f2a01db0518 in sb_debug_read (sbiod=0x7f29dc10e940, buf=0x7f29dc142ecb, len=5) at ../../../../libraries/liblber/sockbuf.c:829
#2  0x00007f2a00558848 in _gnutls_stream_read (ms=0x7f29e8ffb41c, pull_func=0x7f2a01ff1da0 <tlsg_recv>, size=5, bufel=<synthetic pointer>, session=0x7f29dc008060) at buffers.c:344
#3  _gnutls_read (ms=0x7f29e8ffb41c, pull_func=0x7f2a01ff1da0 <tlsg_recv>, size=5, bufel=<synthetic pointer>, session=0x7f29dc008060) at buffers.c:424
#4  _gnutls_io_read_buffered (session=session@entry=0x7f29dc008060, total=5, recv_type=recv_type@entry=4294967295, ms=0x7f29e8ffb41c) at buffers.c:579
#5  0x00007f2a0055296a in recv_headers (ms=<optimized out>, record=0x7f29e8ffb470, htype=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, type=GNUTLS_HANDSHAKE, record_params=0x7f29dc1279f0, session=0x7f29dc008060) at record.c:1045
#6  _gnutls_recv_in_buffers (session=session@entry=0x7f29dc008060, type=type@entry=GNUTLS_HANDSHAKE, htype=htype@entry=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, ms=<optimized out>, ms@entry=0) at record.c:1173
#7  0x00007f2a00559d03 in _gnutls_handshake_io_recv_int (session=session@entry=0x7f29dc008060, htype=htype@entry=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, hsk=hsk@entry=0x7f29e8ffb580, optional=optional@entry=0) at buffers.c:1412
#8  0x00007f2a0055c91c in _gnutls_recv_handshake (session=session@entry=0x7f29dc008060, type=type@entry=GNUTLS_HANDSHAKE_CLIENT_KEY_EXCHANGE, optional=optional@entry=0, buf=buf@entry=0x7f29e8ffb830) at handshake.c:1465
#9  0x00007f2a005640cb in _gnutls_recv_client_kx_message (session=session@entry=0x7f29dc008060) at kx.c:563
#10 0x00007f2a00560572 in handshake_server (session=0x7f29dc008060) at handshake.c:3327
#11 gnutls_handshake (session=0x7f29dc008060) at handshake.c:2629
#12 0x00007f2a01ff2199 in tlsg_session_accept (session=0x7f29dc1133f0) at tls_g.c:363
#13 0x00007f2a01fefabb in ldap_pvt_tls_accept (sb=0x7f299c0051a0, ctx_arg=0x55d92cbca560) at tls2.c:425

and I've tracked it down to:

https://bugs.openldap.org/show_bug.cgi?id=8650#c12

Basically, what we see is one thread stuck in a busy loop doing read()s on the TCP socket which all return immediately with EAGAIN as the fd is in non-blocking mode.

In my cases, the client go offline just after sending the TLS client hello.  That lasts for 15 minutes or about, probably until some timeout after which the TCP connection is eventually considered dead.

It can be reproduced by running on a client:

gdb --args ldapsearch -H ldaps://ldap.example.com -x

Then in gdb:

break write
run
continue

Then the client is paused after sending the TLS "client hello".

https://bugs.openldap.org/show_bug.cgi?id=8650#c12 explains that it's https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48 that is responsible for the issue.

https://github.com/openldap/openldap/commit/4c1ab16ade18a253dd81df7e6eced4d920ac6a8e reverted that commit, but that one did not make it into bionic.

So cherry picking https://github.com/openldap/openldap/commit/4c1ab16ade18a253dd81df7e6eced4d920ac6a8e should fix it.

Revision history for this message

Stephane Chazelas (stephane-chazelas+lp) wrote on 2021-05-14:

#10

cherry picking https://github.com/openldap/openldap/commit/4c1ab16ade18a253dd81df7e6eced4d920ac6a8e should fix this particular issue but reintroduce https://bugs.openldap.org/show_bug.cgi?id=8650.

It may be necessary to pick https://github.com/openldap/openldap/commit/735e1ab14bb055344b4e767a216aa410aa7d1503 as well but as that's much more recent, I can't tell if it's valid on top of the package used in bionic. I'm going to do some tests.

Revision history for this message

Stephane Chazelas (stephane-chazelas+lp) wrote on 2021-05-14:

#11

Yes, https://github.com/openldap/openldap/commit/735e1ab14bb055344b4e767a216aa410aa7d1503 can't be directly applied there. There have been other changes in between in that section including changes in API, so it would take more effort to backport that fix.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-05-14:

#12

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openldap (Ubuntu):
status:	New → Confirmed

Revision history for this message

Stephane Chazelas (stephane-chazelas+lp) wrote on 2021-05-14:

#13

> It can be reproduced by running on a client:
>
> gdb --args ldapsearch -H ldaps://ldap.example.com -x
>
> Then in gdb:
>
> break write
> run
> continue

I can no longer reproduce it after I rebuild and install the libldap package with https://github.com/openldap/openldap/commit/4c1ab16ade18a253dd81df7e6eced4d920ac6a8e applied.

Note that since the above is enough to render slapd unresponsive for 15 minutes, it could be considered as a serious DoS vulnerability.

Revision history for this message

Ryan Tandy (rtandy) wrote on 2021-05-14: Re: [Bug 1926265] Re: slapd enter in infinite loop on sched_yield syscall

#14

openldap_2.4.45+dfsg-1ubuntu1.11.debdiff Edit (2.5 KiB, text/plain; charset=us-ascii)

On Fri, May 14, 2021 at 01:36:12PM -0000, Stephane Chazelas wrote:
>The important backtrace in there is the one from thread 11:
>
>#0 0x00007fb288428474 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
>No symbol table info available.
>#1 0x00007fb2890c4518 in ?? () from /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2
>No symbol table info available.
>#2 0x00007fb287895848 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
>No symbol table info available.

This is a valid issue, but are we certain it's the same one? The
reporter talked about sched_yield and their backtraces included several
threads of back_monitor waiting on some kind of lock.

>https://bugs.openldap.org/show_bug.cgi?id=8650#c12 explains that it's
>https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48
>that is responsible for the issue.
>
>https://github.com/openldap/openldap/commit/4c1ab16ade18a253dd81df7e6eced4d920ac6a8e
>reverted that commit, but that one did not make it into bionic.

Indeed. :( I didn't notice this went unfixed in an LTS, I'm sorry for
missing that.

>So cherry picking
>https://github.com/openldap/openldap/commit/4c1ab16ade18a253dd81df7e6eced4d920ac6a8e
>should fix it.

In this version it's a Debian patch, so probably just remove the
offending patch from d/patches, rather than import the revert?

On Fri, May 14, 2021 at 02:18:47PM -0000, Stephane Chazelas wrote:
>Yes,
>https://github.com/openldap/openldap/commit/735e1ab14bb055344b4e767a216aa410aa7d1503
>can't be directly applied there. There have been other changes in
>between in that section including changes in API, so it would take more
>effort to backport that fix.

Right. I'm not confident I can backport that correctly, so I'd feel
safer just doing the revert. However, sssd should also be tested, to
ensure the version in bionic isn't affected by ITS#9210
(https://bugs.openldap.org/show_bug.cgi?id=9210).

COMPLETELY UNTESTED debdiff attached.

Stephane Chazelas (stephane-chazelas+lp) on 2021-05-17

information type:

Public → Public Security

Revision history for this message

Stephane Chazelas (stephane-chazelas+lp) wrote on 2021-05-17:

#15

> This is a valid issue, but are we certain it's the same one? The
> reporter talked about sched_yield and their backtraces included several
> threads of back_monitor waiting on some kind of lock.

You're right. It may be a different issue (though possibly linked to the same root cause). In my case, all other threads are stuck on:

#0 0x00007f2a010fdad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d92cadaa78) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55d92cadaa28, cond=0x55d92cadaa50) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x55d92cadaa50, mutex=mutex@entry=0x55d92cadaa28) at pthread_cond_wait.c:655
#3 0x00007f2a01fc7475 in ldap_pvt_thread_cond_wait (cond=cond@entry=0x55d92cadaa50, mutex=mutex@entry=0x55d92cadaa28) at ../../../../libraries/libldap_r/thr_posix.c:277
#4 0x00007f2a01fc6d1b in ldap_int_thread_pool_wrapper (xpool=0x55d92cadaa20) at ../../../../libraries/libldap_r/tpool.c:683
#5 0x00007f2a010f76db in start_thread (arg=0x7f29f99b6700) at pthread_create.c:463
#6 0x00007f2a00e1971f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2021-05-19:

#16

Thanks for the further investigation, Stephane and Ryan. Much appreciated!

It would be interesting to know if lincvz could test an openldap built with Ryan's patch, to check if he can still reproduce the bug with it. I am going to prepare a PPA with Ryan's patch and let you know ASAP.

Ryan, IIUC the patch you're proposing fixes the issue experienced by Stephane, but we're not entirely sure that it's the same issue being reported in this bug. Am I right?

If that's correct, and if we verify that lincvz can still reproduce the bug even with the fix proposed by Ryan, then the best way forward would be for Stephane to open a new bug so that I can act on it.

Thanks.

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2021-05-19:

#17

OK, here it is:

https://launchpad.net/~sergiodj/+archive/ubuntu/openldap-bug1926265

lincvz, could you please give this a try and check if this package fixes the issue? Thank you!

Revision history for this message

Ryan Tandy (rtandy) wrote on 2021-05-19:

#18

On Wed, May 19, 2021 at 06:46:02PM -0000, Sergio Durigan Junior wrote:
>Ryan, IIUC the patch you're proposing fixes the issue experienced by
>Stephane, but we're not entirely sure that it's the same issue being
>reported in this bug. Am I right?

Yes, exactly.

(My debdiff just drops the buggy patch, at the cost of re-introducing
the original issue, which IMO has a much smaller impact than the
regression, so hopefully an acceptable trade-off.)

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-20:

#19

Thank you for the patch and your investigations. In the next few days, I cannot install the patched package on my production machines. I'll let you know when I can.

Revision history for this message

Stephane Chazelas (stephane-chazelas+lp) wrote on 2021-05-20:

#20

lincvz:
> Thank you for the patch and your investigations. In the next few days, I cannot install the patched package on my production machines. I'll let you know when I can.

Thanks.

Can you reproduce a similar issue with the modus operandi (using gdb) I describe above?

(Note that while it renders slapd unresponsive, service is restored as soon as you quit or detach gdb on the client).

If you do, it would be worth getting a backtrace on the slapd threads again, so see if you get the same ones as before or some similar to mine.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-21:

#21

Stephane,
I can't reproduce the hang on my test machine with gdb. (Despite your investigation seems right for me). CPU usage stay low, as usual on this machine (same slapd config than the production servers, but only 1 CPU is available).

To prove I made the right steps :-) :

$ gdb --args ldapsearch ldaps://front -x
[...]
(gdb) break write
Function "write" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (write) pending.
(gdb) run
Starting program: /usr/bin/ldapsearch ldaps://front -x
conti [Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, __GI___libc_write (fd=3, buf=0x55555578b2f0, nbytes=14) at ../sysdeps/unix/sysv/linux/write.c:27
27 ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) continue
Continuing.

Breakpoint 1, __GI___libc_write (fd=1, buf=0x55555578b2f0, nbytes=16) at ../sysdeps/unix/sysv/linux/write.c:27
27 in ../sysdeps/unix/sysv/linux/write.c

Another information (maybe not important, or not related to this case):
since monday, I started to migrate LDAP DB backend from BDB to LMDB.
So I must wait next slapd hang to be sure DB change have no impact on this issue.
(in the past there were sometimes 2 or 3 weeks elapsed between two slap hang).
note: I reverted DB backend on my test machine but without reproducing the issue too

Revision history for this message

Stephane Chazelas (stephane-chazelas+lp) wrote on 2021-05-21:

#22

Download full text (3.4 KiB)

Thanks.

I notice your first write() is on fd 3 while the 2nd is on stdout (while for me the first 2 are on fd 3, which in my case is the LDAPS socket). For the issue to be reproduced, we need the client to be paused after having send the TLS "client hello".

The first time the breakpoint is hit, you should be seeing something like:

```
#0 __GI___libc_write (fd=3, buf=0x555555ace4ab, nbytes=75) at ../sysdeps/unix/sysv/linux/write.c:27
#1 0x00007ffff797b408 in sb_debug_write (sbiod=0x555555ab4810, buf=0x555555ace4ab, len=75) at ../../../../libraries/liblber/sockbuf.c:854
#2 0x00007ffff6bd5a0a in _gnutls_writev_emu (fd=fd@entry=0x55555578f240, giovec=giovec@entry=0x7fffffff9b90, giovec_cnt=giovec_cnt@entry=3,
    vec=0, session=<optimized out>, session=<optimized out>) at buffers.c:447
#3 0x00007ffff6bd608c in _gnutls_writev (total=126, giovec_cnt=3, giovec=0x7fffffff9b90, session=0x55555578d3d0) at buffers.c:504
#4 _gnutls_io_write_flush (session=session@entry=0x55555578d3d0) at buffers.c:698
#5 0x00007ffff6bd7200 in _gnutls_handshake_io_write_flush (session=session@entry=0x55555578d3d0) at buffers.c:820
#6 0x00007ffff6bd9348 in _gnutls_send_handshake (session=session@entry=0x55555578d3d0, bufel=bufel@entry=0x555555acd620,
    type=type@entry=GNUTLS_HANDSHAKE_FINISHED) at handshake.c:1335
#7 0x00007ffff6bda4ed in _gnutls_send_finished (again=<optimized out>, session=0x55555578d3d0) at handshake.c:763
#8 send_handshake_final (session=session@entry=0x55555578d3d0, init=init@entry=1) at handshake.c:3098
#9 0x00007ffff6bdcd29 in handshake_client (session=0x55555578d3d0) at handshake.c:2946
#10 gnutls_handshake (session=0x55555578d3d0) at handshake.c:2626
#11 0x00007ffff7bbb183 in tlsg_session_accept (session=0x555555abb070) at tls_g.c:361
#12 0x00007ffff7bb8751 in ldap_int_tls_connect (ld=ld@entry=0x5555557890d0, conn=<optimized out>) at tls2.c:362
#13 0x00007ffff7bb9466 in ldap_int_tls_start (ld=ld@entry=0x5555557890d0, conn=conn@entry=0x555555789500, srv=srv@entry=0x555555789440)
    at tls2.c:860
#14 0x00007ffff7b912b6 in ldap_int_open_connection (ld=ld@entry=0x5555557890d0, conn=conn@entry=0x555555789500, srv=0x555555789440,
    async=async@entry=0) at open.c:448
#15 0x00007ffff7ba634d in ldap_new_connection (ld=ld@entry=0x5555557890d0, srvlist=srvlist@entry=0x555555789198, use_ldsb=use_ldsb@entry=1,
    connect=connect@entry=1, bind=bind@entry=0x0, m_req=m_req@entry=0, m_res=0) at request.c:487
#16 0x00007ffff7b9074a in ldap_open_defconn (ld=ld@entry=0x5555557890d0) at open.c:41
#17 0x00007ffff7ba7908 in ldap_send_initial_request (ld=ld@entry=0x5555557890d0, msgtype=msgtype@entry=96, dn=dn@entry=0x0, ber=0x5555557894a0,
    msgid=1) at request.c:130
#18 0x00007ffff7b9ab8f in ldap_sasl_bind (ld=0x5555557890d0, dn=0x0, mechanism=<optimized out>, cred=0x5555557692e0 <passwd>, sctrls=0x0,
    cctrls=0x0, msgidp=0x7fffffffa2dc) at sasl.c:164
#19 0x000055555555d33b in ?? ()
#20 0x000055555555844f in ?? ()
#21 0x00007ffff7388bf7 in __libc_start_main (main=0x5555555581b0, argc=4, argv=0x7fffffffe638, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffe628) at ../csu/libc-start.c:310
#22 0x0000555...

Thanks.

I notice your first write() is on fd 3 while the 2nd is on stdout (while for me the first 2 are on fd 3, which in my case is the LDAPS socket). For the issue to be reproduced, we need the client to be paused after having send the TLS "client hello".

The first time the breakpoint is hit, you should be seeing something like:

```
#0  __GI___libc_write (fd=3, buf=0x555555ace4ab, nbytes=75) at ../sysdeps/unix/sysv/linux/write.c:27
#1  0x00007ffff797b408 in sb_debug_write (sbiod=0x555555ab4810, buf=0x555555ace4ab, len=75) at ../../../../libraries/liblber/sockbuf.c:854
#2  0x00007ffff6bd5a0a in _gnutls_writev_emu (fd=fd@entry=0x55555578f240, giovec=giovec@entry=0x7fffffff9b90, giovec_cnt=giovec_cnt@entry=3,
    vec=0, session=<optimized out>, session=<optimized out>) at buffers.c:447
#3  0x00007ffff6bd608c in _gnutls_writev (total=126, giovec_cnt=3, giovec=0x7fffffff9b90, session=0x55555578d3d0) at buffers.c:504
#4  _gnutls_io_write_flush (session=session@entry=0x55555578d3d0) at buffers.c:698
#5  0x00007ffff6bd7200 in _gnutls_handshake_io_write_flush (session=session@entry=0x55555578d3d0) at buffers.c:820
#6  0x00007ffff6bd9348 in _gnutls_send_handshake (session=session@entry=0x55555578d3d0, bufel=bufel@entry=0x555555acd620,
    type=type@entry=GNUTLS_HANDSHAKE_FINISHED) at handshake.c:1335
#7  0x00007ffff6bda4ed in _gnutls_send_finished (again=<optimized out>, session=0x55555578d3d0) at handshake.c:763
#8  send_handshake_final (session=session@entry=0x55555578d3d0, init=init@entry=1) at handshake.c:3098
#9  0x00007ffff6bdcd29 in handshake_client (session=0x55555578d3d0) at handshake.c:2946
#10 gnutls_handshake (session=0x55555578d3d0) at handshake.c:2626
#11 0x00007ffff7bbb183 in tlsg_session_accept (session=0x555555abb070) at tls_g.c:361
#12 0x00007ffff7bb8751 in ldap_int_tls_connect (ld=ld@entry=0x5555557890d0, conn=<optimized out>) at tls2.c:362
#13 0x00007ffff7bb9466 in ldap_int_tls_start (ld=ld@entry=0x5555557890d0, conn=conn@entry=0x555555789500, srv=srv@entry=0x555555789440)
    at tls2.c:860
#14 0x00007ffff7b912b6 in ldap_int_open_connection (ld=ld@entry=0x5555557890d0, conn=conn@entry=0x555555789500, srv=0x555555789440,
    async=async@entry=0) at open.c:448
#15 0x00007ffff7ba634d in ldap_new_connection (ld=ld@entry=0x5555557890d0, srvlist=srvlist@entry=0x555555789198, use_ldsb=use_ldsb@entry=1,
    connect=connect@entry=1, bind=bind@entry=0x0, m_req=m_req@entry=0, m_res=0) at request.c:487
#16 0x00007ffff7b9074a in ldap_open_defconn (ld=ld@entry=0x5555557890d0) at open.c:41
#17 0x00007ffff7ba7908 in ldap_send_initial_request (ld=ld@entry=0x5555557890d0, msgtype=msgtype@entry=96, dn=dn@entry=0x0, ber=0x5555557894a0,
    msgid=1) at request.c:130
#18 0x00007ffff7b9ab8f in ldap_sasl_bind (ld=0x5555557890d0, dn=0x0, mechanism=<optimized out>, cred=0x5555557692e0 <passwd>, sctrls=0x0,
    cctrls=0x0, msgidp=0x7fffffffa2dc) at sasl.c:164
#19 0x000055555555d33b in ?? ()
#20 0x000055555555844f in ?? ()
#21 0x00007ffff7388bf7 in __libc_start_main (main=0x5555555581b0, argc=4, argv=0x7fffffffe638, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffe628) at ../csu/libc-start.c:310
#22 0x000055555555966a in ?? ()
```

And we need to make sure the server or client doesn't reject that connection outright.

It would worth running "tshark -i any -f 'port 636'" in the mean time to verify the "client hello" is sent and that the server is willing to carry on the connection.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-21:

#23

My bad..
You're right, in my test, the second write() is performed on stdout cause my ldapsearch command is wrong. I missed the '-H' arg to properly set the LDAP URI (but you too :-p ).
Consequently, the connection was in LDAP not LDAPS, and "ldaps://..." was the requested attributs :-s

So since I learnt to make a good ldapsearch and opened my eyes to read gdb output, I can fully reproducible the issue .
Thanks.

I'll test the patched version on my test machine.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-21:

#24

OK the patch fix the issue for me too in my test env.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-05-21:

#25

Thanks for all )

Bryce Harrington (bryce) on 2021-05-24

Changed in openldap (Ubuntu Bionic):
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Bryce Harrington (bryce) wrote on 2021-05-24:

#26

If I understand the work so far, this issue is only present in Bionic, since later releases already have the release (and the subsequent fix)?

Sounds like the next action here would be to SRU the changes packaged in https://launchpad.net/~sergiodj/+archive/ubuntu/openldap-bug1926265/+packages. A possible catch though is that the SRU process dislikes introducing known regressions, and it sounds like this revert patch would do so. So it may be worth discussing this with an SRU administrator first. If introducing the regression is a no-go, then that means a bionic-specific fix would need developed based on https://github.com/openldap/openldap/commit/735e1ab14bb055344b4e767a216aa410aa7d1503.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-06-01:

#27

Hi Bryce, when will the fix be officially released ?

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2021-06-01:

#28

Hi lincvz,

We are still trying to determine the best approach here. This is on my TODO list, and hopefully I can put something together by the end of this week.

Thank you for your patience.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-06-03:

#29

OK perfect )

Revision history for this message

lincvz (cvuillemez) wrote on 2021-06-21:

#30

Seems the patch is so hard to backport (?)

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2021-06-21:

#31

Hello lincvz,

Sorry about the delay. I have been busy with other stuff and did not have time to follow up on this bug. Here is the lay of the land right now:

1) Unfortunately, it is unlikely that we will be able to get the SRU team to accept an upload that reintroduces an issue. This means that removing the patch that causes the infinite loop is something that we would rather not do, because another problem will reappear.

2) The ideal course of action here would be to investigate whether it would be possible to backport the patch at https://github.com/openldap/openldap/commit/735e1ab14bb055344b4e767a216aa410aa7d1503 and make it fully work. This is probably a non-trivial task, as Ryan himself said.

I am still looking into a possible solution for this, but I cannot guarantee anything for now, I'm sorry (I'm very busy with other stuff, including updating OpenLDAP in Impish to 2.5!)

Given that I have provided a PPA with a package that fixes the issue for you, I would say that for now you are better off using it.

Last, but not least: if you feel like giving it a try and trying to backport https://github.com/openldap/openldap/commit/735e1ab14bb055344b4e767a216aa410aa7d1503 to the openldap version that's in Bionic, that would be awesome. We can review whatever you have and guide you through the SRU process.

Thank you.

Revision history for this message

lincvz (cvuillemez) wrote on 2021-06-22:

#32

Hi Sergio,
Thanks I appreciate your answer.
Yes I can use the package provided in your PPA, even if it's not very convenient to install and update it on Production machines. About that, will you maintain these packages with further security updates ?

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2021-06-23:

#33

Hi lincvz,

Unfortunately I don't plan to maintain the packages on the PPA; it would be too cumbersome to keep monitoring security updates and rebuilding the package every time. My intention with the PPA was to facilitate the diagnostic of the bug, and also to provide a hotfix for you.

I will try to talk to the SRU team again and see if we can reach some sort of consensus on how to deal with this bug, but unfortunately I cannot make promises here. Sorry about that.

Changed in openldap (Ubuntu Bionic):
importance:	High → Medium

Revision history for this message

lincvz (cvuillemez) wrote on 2021-08-06:

#34

Hi Sergio,
Sorry I don't understand why the bug's importance has decreased to "medium".
Du to client side behavior, issue is totally unpredictable, and slapd no longer respond to requests.
Morever, the hot fix is not suited for production machines, and can introduce regression.
Thanks.

Revision history for this message

Paride Legovini (paride) wrote on 2022-02-03:

#35

Hi,

If I understand correctly this is fixed in the newer Ubuntu releases (>= Focal), so I'm marking the bug task for the Ubuntu devel release accordingly. Speaking of Bionic as far as I can tell the situation didn't evolve in any way, so I'm leaving the task as it is.

Changed in openldap (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Austin Dunham (wardred) wrote on 2022-04-12:

#36

I was able to reproduce this on our test system with comment 13.

This effects our LDAP system too.

We use LDAP for single sign on and with Radius for WiFi, so when it locks up people can't login to a lot of our systems and WiFi authentication stops.

This happens anywhere from once a week to once a month.

We have some work to do on our virtualization platform before we can update to something other than Ubuntu 18.04 LTS.

I don't see lincvz's PPA anymore. . . and if my understanding is correct that would basically make LDAP static until we did upgrade.

Is there a recommended way forward at this point if one stays on 18.04?

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2022-04-13:

#37

I still haven't been able to sit down and work on this issue. My comment #31 still applies, though.

I think I have deleted the PPA by mistake. I can recreate it and reupload the package, but I'm not making any promises to maintain the PPA nor to backport security fixes; it's just a workaround. Let me know if you'd like that.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2022-06-02:

#38

Waiting for an answer, setting incomplete to reflect that

Changed in openldap (Ubuntu Bionic):
status:	Triaged → Incomplete

Ubuntu
openldap package

slapd enter in infinite loop on sched_yield syscall

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	openldap (Ubuntu)	Fix Released	Undecided	Unassigned
	Bionic	Incomplete	Medium	Unassigned

Ubuntuopenldap package

slapd enter in infinite loop on sched_yield syscall

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
openldap package