Intermittent hangs during ldap_search_ext when TLS enabled

Bug #1921562 reported by Vincent Vanlaer on 2021-03-27
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
openldap
Fix Released
Medium
openldap (Ubuntu)
Undecided
Unassigned
Focal
Undecided
Utkarsh Gupta
Groovy
Undecided
Unassigned

Bug Description

[Impact]
========

When connecting to an LDAP server with TLS, ldap_search_ext can hang if during the initial TLS handshake a signal is received by the process. The cause of this bug is the same as https://bugs.openldap.org/show_bug.cgi?id=8650.

In our case this bug cause failures in the SSSD LDAP backend at least once per day, resulting in authentication errors followed by a sssd_be restart after a timeout has been hit.

[Test Plan]
===========

When using openldap on 20.04, this bug causes failures in the SSSD LDAP backend, resulting in authentication errors followed by a sssd_be restart after a timeout has been hit:

Mar 19 19:05:31 mail auth[867454]: pam_sss(dovecot:auth): received for user redacted: 4 (System error)
Mar 19 19:05:32 mail sssd_be[867455]: Starting up

With the patched version, this should no longer be a problem.

[Where Problems Could Occur]
============================

With this patch applied, there may be few edge cases in (and varying b/w) different versions of GnuTLS. And also some bits that are discussed in https://bugs.openldap.org/show_bug.cgi?id=8650.

But that said, the patched version is already being run in production for over two weeks time (at the time of writing - 07/04/21). So I believe the SRU will clearly benefit from this and has lower risk of regression.

[More Info]
===========

A reduced version of the patch linked above can be found attached to this bug report. This patch has been applied to version 2.4.49+dfsg-2ubuntu1.7 and has been running in production for approximately a week and the issue has no longer occurred. No other issues have appeared during this period.

Related branches

Full_Name: Ryan Tandy
Version: RE24
OS: Debian
URL:
Submission from: (NULL) (24.68.41.160)
Submitted by: ryan

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861838

That bug's submitter seems to have unintentionally configured their slapd with
the entire list of system CAs. They're fixing it, but we have a bug here too.

When the ServerHello is larger than 16kb, gnutls_handshake can return
GNUTLS_E_AGAIN. In theory this was always possible, but I'm only seeing it
happen with gnutls 3.x and haven't the exact change responsible.

We need to loop gnutls_handshake until it completes, like we do already in the
re-handshake case.

changed notes
changed state Open to Test
moved from Incoming to Software Bugs

Committed the fix, and pinged the submitter to test it.

changed notes
changed state Test to Release

changed notes
changed state Release to Closed

Hi Ryan,

I'm running into a problem with slapd 2.4.46 hanging on Ubuntu 18.04,
which seems to be a side effect of the ITS#8650 patch:

https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48

slapd will run fine for a while, but during some periods of
high-traffic, it'll hang. It'll peg the CPU at 100% and won't respond to
any new LDAP connections. After some time, it'll resume working again,
but overall it's fairly unreliable.

strace on slapd during the hang shows that it's constantly making read()
calls that return EAGAIN. After doing a gdb stack trace on slapd, I
realized that these read() calls are happening as part of the busywait
for loop in tlsg_session_accept() that repeatedly calls
gnutls_handshake() when it gets EAGAIN. When slapd recovers from this
hang state, the first message it prints is a TLS negotiation failure
error on the culprit file descriptor.

If I back out the ITS#8650 patch, the problem goes away. If I insert
sleep(1) in the for loop, slapd no longer pegs the CPU at 100%, but it
still becomes unresponsive during these high-traffic periods.

I don't know what's happening during these high-traffic periods that
causes the TLS negotiation to go astray. Unfortunately it's not easy to
reproduce this problem outside of this production environment, given the
diversity of clients running different OSes with various versions of SSL
libraries.

I'm wondering if there is a better way to handle EAGAIN returned from
gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my
simplistic attempt at inserting a sleep() call which doesn't really seem
to help. I'm wondering how the GnuTLS developers intend for people to
use gnutls_handshake() properly, so as to gracefully handle sessions
that involve long packets on the one hand, without opening up a
vulnerability to chew up lots of system resources on the other hand.

Regards,

     -Kartik

Hi Kartik,

On Fri, Aug 03, 2018 at 11:19:06AM -0400, Kartik Subbarao wrote:
>I'm running into a problem with slapd 2.4.46 hanging on Ubuntu 18.04,
>which seems to be a side effect of the ITS#8650 patch:
>
>https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48
>
>slapd will run fine for a while, but during some periods of
>high-traffic, it'll hang. It'll peg the CPU at 100% and won't respond
>to any new LDAP connections. After some time, it'll resume working
>again, but overall it's fairly unreliable.

Thanks for letting me know about this. This patch is running on quite a
few systems by now, I'm sorry the problem wasn't caught sooner. :/

>I'm wondering if there is a better way to handle EAGAIN returned from
>gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my
>simplistic attempt at inserting a sleep() call which doesn't really
>seem to help. I'm wondering how the GnuTLS developers intend for
>people to use gnutls_handshake() properly, so as to gracefully handle
>sessions that involve long packets on the one hand, without opening up
>a vulnerability to chew up lots of system resources on the other hand.

Right. I mean, this is how GnuTLS' own example shows to do it:

https://gitlab.com/gnutls/gnutls/blob/master/doc/examples/ex-client-dtls.c#L73-77

We could place a limit on the number of iterations, though any such
limit would have to be arbitrary.

There might be an asynchronous GnuTLS API that could be used to avoid
tying up slapd while this is going on.

I will look at how some other GnuTLS servers deal with this...

On 08/03/2018 12:09 PM, Ryan Tandy wrote:
> Thanks for letting me know about this. This patch is running on quite
> a few systems by now, I'm sorry the problem wasn't caught sooner. :/

No worries, thanks for responding so quickly on this!

>> I'm wondering if there is a better way to handle EAGAIN returned from
>> gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my
>> simplistic attempt at inserting a sleep() call which doesn't really
>> seem to help. I'm wondering how the GnuTLS developers intend for
>> people to use gnutls_handshake() properly, so as to gracefully handle
>> sessions that involve long packets on the one hand, without opening
>> up a vulnerability to chew up lots of system resources on the other
>> hand.
>
> Right. I mean, this is how GnuTLS' own example shows to do it:
>
> https://gitlab.com/gnutls/gnutls/blob/master/doc/examples/ex-client-dtls.c#L73-77
>

Hmm, that's a head-scratcher. It doesn't seem very effective to have a
non-blocking I/O interface and then recommend wrapping it in a busywait
loop :-)

> We could place a limit on the number of iterations, though any such
> limit would have to be arbitrary.
>
> There might be an asynchronous GnuTLS API that could be used to avoid
> tying up slapd while this is going on.
>
> I will look at how some other GnuTLS servers deal with this...

Cool, thanks Ryan.

Regards,

     -Kartik

Hi Karthik,

Sorry about the lack of updates on this one.

It looks clear that my patch for this ITS was wrong and needs to be
reverted.

Looking again at the original issue, after reverting the patch, I've
found that the behaviour varies with GnuTLS version. I need to figure
out why this is, which probably means spending some time bisecting
GnuTLS changes.

If I create a patch to log some additional debug info about the GnuTLS
setup, would you be willing to run it in your environment? For example
I'm curious whether EAGAIN is being returned from the read side or write
side (guessing read, but would be nice to confirm).

thanks
Ryan

On 08/26/2018 03:07 PM, Ryan Tandy wrote:
> If I create a patch to log some additional debug info about the GnuTLS
> setup, would you be willing to run it in your environment? For example
> I'm curious whether EAGAIN is being returned from the read side or
> write side (guessing read, but would be nice to confirm).

Sure, send me the patch. If the patch just does some additional GnuTLS
logging, then I don't see a problem with running it in the production
environment and sending you the results.

Regards,

     -Kartik

fixed in master
fixed in RE24 (2.4.46)
Needs to be fixed further

changed notes
changed state Closed to Open

Made some good progress on this one this evening.

The original issue this ITS is about is that gnutls_handshake() can, in
some versions of GnuTLS, return GNUTLS_E_AGAIN even when the socket is
blocking. Specifically, this happens in the case I described with a
large CA list sent by the server.

For slapd, the patch I committed is unfortunately completely wrong. It
has been using non-blocking sockets forever, EAGAIN is expected and
handled robustly -- or it was, until I introduced the busy-loop.

For clients I'm still working on figuring out the right path forward.
There is some EAGAIN handling conditional on LDAP_USE_NON_BLOCKING_TLS
which itself is behind LDAP_DEVEL. However this code is meant for
non-blocking sockets, and in my case it ends up stuck in poll() waiting
for a notification that never arrives. In 2.4, ret == 1 simply falls
into the success case and proceeds to send data without completing the
handshake first.

It's possible that what I actually want here is a (ret > 0) case in
ldap_int_tls_start for when LDAP_USE_NON_BLOCKING_TLS is absent and
ldap_int_tls_connect returns 1. (I'd also need to adapt the non-blocking
path to be able to handle a blocking socket as well.)

But it's also possible that gnutls_handshake() returning GNUTLS_E_AGAIN
with a blocking socket is simply a GnuTLS bug that was introduced at
some point. I still need to determine exactly when and why its behaviour
changed. (It is still happening with 3.5.19.)

In any case, my patch has to be reverted, as its impact (making slapd
busy-loop) is obviously worse than the status quo (misbehaving clients
in a specific case). I have pushed that revert now and will continue
digging as time permits.

On Tue, Sep 18, 2018 at 10:55:50PM -0700, Ryan Tandy wrote:
>There is some EAGAIN handling conditional on LDAP_USE_NON_BLOCKING_TLS
>which itself is behind LDAP_DEVEL. However this code is meant for
>non-blocking sockets, and in my case it ends up stuck in poll()
>waiting for a notification that never arrives.

This turned out to be because the fd and timeout fields are only
initialized when timeout is configured and the non-blocking behaviour
was triggered because of it. Otherwise the code simply doesn't
anticipate EAGAIN could be returned and the behaviour is more or less
undefined; it ends up calling poll() with fd = -1 and a garbage timeout.

>It's possible that what I actually want here is a (ret > 0) case in
>ldap_int_tls_start for when LDAP_USE_NON_BLOCKING_TLS is absent and
>ldap_int_tls_connect returns 1. (I'd also need to adapt the
>non-blocking path to be able to handle a blocking socket as well.)

More precisely, what I actually want is a (ret > 0) case that is used
unless both USE_NON_BLOCKING_TLS is true and a timeout is configured.

If we added to ber_sockbuf_ctrl() the ability to query whether the
socket is non-blocking, for GnuTLS at least we could bypass poll() and
go straight back into ldap_int_tls_connect(). However I don't see a lot
of benefit to this as long as calling poll() on a blocking socket has no
downside.

*** Issue 9210 has been marked as a duplicate of this issue. ***

The other way we can get a non-blocking socket is if the client set one up itself and gave it to us via ldap_init_fd(). sssd does this, or used to: bug 9210.

*** Issue 9210 has been marked as a duplicate of this issue. ***

Created attachment 708
test program with non-blocking socket

Here's a test program that exercises the scenario with a non-blocking socket, similar to the case described in bug 9210. Currently it fails on 2.4 with LDAP_SERVER_DOWN and on 2.5 with LDAP_TIMEOUT, but succeeds if you comment out the fcntl(). Any patch needs to correct that as well as the scenario described here with a blocking socket.

Commits:
  • 735e1ab
by Howard Chu at 2020-04-12T22:18:51+00:00
ITS#8650 loop on incomplete TLS handshake

Vincent Vanlaer (vincenttc) wrote :
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openldap (Ubuntu):
status: New → Confirmed
description: updated
Changed in openldap:
importance: Unknown → Medium
status: Unknown → Fix Released

On Sat, Mar 27, 2021 at 01:06:42AM -0000, Vincent Vanlaer wrote:
>https://git.openldap.org/openldap/openldap/-/commit/735e1ab

Note that this commit is for OpenLDAP 2.5 and needs adjustment for the
2.4 branch. The commit ids for 2.4 are:

https://git.openldap.org/openldap/openldap/-/commit/7cf7aa3141
https://git.openldap.org/openldap/openldap/-/commit/85fc8974f5

The attachment "retry-tls-connect-on-eintr-eagain.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Vincent Vanlaer (vincenttc) wrote :

Just to be sure, is there anything that I would need to do in order to have the bugfix applied in a new openldap release for Ubuntu 20.04?

Utkarsh Gupta (utkarsh) wrote :

Hello Vincent,

Since the Debian version has these fixes incorporated, this would be fixed in Hirsute already (as it's in sync (with a minor delta)). For Focal, it will need somebody affected to commit to doing the necessary QA after the update is prepared (without that QA, we won't be able to land the update).

The process is documented at https://wiki.ubuntu.com/StableReleaseUpdates#Procedure. I'll add this task to the server team's backlog. If you'd like to do it sooner, you are welcome to prepare the update yourself following the documented process.

Thanks!

Changed in openldap (Ubuntu):
status: Confirmed → Fix Released
Changed in openldap (Ubuntu Focal):
status: New → Triaged
tags: added: bitesize server-next
Utkarsh Gupta (utkarsh) wrote :

Hello Vincent,

I've uploaded a fixed package in my PPA: https://launchpad.net/~utkarsh/+archive/ubuntu/experimental-dump. Could you please test this if it work alright for you before I push this to the official archive?

Thanks!

Changed in openldap (Ubuntu Focal):
assignee: nobody → Utkarsh Gupta (utkarsh)
status: Triaged → In Progress
Utkarsh Gupta (utkarsh) on 2021-04-08
description: updated

To be clear the fix that was mentioned is in 2.4.49+dfsg-4 and later,

 442 openldap (2.4.49+dfsg-4) unstable; urgency=medium
...
 453 * Import upstream patch to properly retry gnutls_handshake() after it
 454 returns GNUTLS_E_AGAIN. (ITS#8650) (Closes: #861838)

Thereby groovy is fixed as well, marking that in the bug tasks.

Changed in openldap (Ubuntu Groovy):
status: New → Fix Released
Vincent Vanlaer (vincenttc) wrote :

I've deployed the patch, I'll let you know whether it works and if any regressions occur.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.