Intermittent hangs during ldap_search_ext when TLS enabled

Bug #1921562 reported by Vincent Vanlaer
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
openldap
Fix Released
Medium
openldap (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Utkarsh Gupta
Groovy
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
========

When connecting to an LDAP server with TLS, ldap_search_ext can hang if during the initial TLS handshake a signal is received by the process. The cause of this bug is the same as https://bugs.openldap.org/show_bug.cgi?id=8650.

In our case this bug cause failures in the SSSD LDAP backend at least once per day, resulting in authentication errors followed by a sssd_be restart after a timeout has been hit.

[Test Plan]
===========

When using openldap on 20.04, this bug causes failures in the SSSD LDAP backend, resulting in authentication errors followed by a sssd_be restart after a timeout has been hit:

Mar 19 19:05:31 mail auth[867454]: pam_sss(dovecot:auth): received for user redacted: 4 (System error)
Mar 19 19:05:32 mail sssd_be[867455]: Starting up

With the patched version, this should no longer be a problem.

[Where Problems Could Occur]
============================

With this patch applied, there may be few edge cases in (and varying b/w) different versions of GnuTLS. And also some bits that are discussed in https://bugs.openldap.org/show_bug.cgi?id=8650.

But that said, the patched version is already being run in production for over two weeks time (at the time of writing - 07/04/21). So I believe the SRU will clearly benefit from this and has lower risk of regression.

[More Info]
===========

A reduced version of the patch linked above can be found attached to this bug report. This patch has been applied to version 2.4.49+dfsg-2ubuntu1.7 and has been running in production for approximately a week and the issue has no longer occurred. No other issues have appeared during this period.

Related branches

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

Full_Name: Ryan Tandy
Version: RE24
OS: Debian
URL:
Submission from: (NULL) (24.68.41.160)
Submitted by: ryan

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861838

That bug's submitter seems to have unintentionally configured their slapd with
the entire list of system CAs. They're fixing it, but we have a bug here too.

When the ServerHello is larger than 16kb, gnutls_handshake can return
GNUTLS_E_AGAIN. In theory this was always possible, but I'm only seeing it
happen with gnutls 3.x and haven't the exact change responsible.

We need to loop gnutls_handshake until it completes, like we do already in the
re-handshake case.

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

changed notes
changed state Open to Test
moved from Incoming to Software Bugs

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

Committed the fix, and pinged the submitter to test it.

Revision history for this message
In , Quanah-x (quanah-x) wrote :

changed notes
changed state Test to Release

Revision history for this message
In , Quanah-x (quanah-x) wrote :

changed notes
changed state Release to Closed

Revision history for this message
In , Kartik Subbarao (subbarao) wrote :

Hi Ryan,

I'm running into a problem with slapd 2.4.46 hanging on Ubuntu 18.04,
which seems to be a side effect of the ITS#8650 patch:

https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48

slapd will run fine for a while, but during some periods of
high-traffic, it'll hang. It'll peg the CPU at 100% and won't respond to
any new LDAP connections. After some time, it'll resume working again,
but overall it's fairly unreliable.

strace on slapd during the hang shows that it's constantly making read()
calls that return EAGAIN. After doing a gdb stack trace on slapd, I
realized that these read() calls are happening as part of the busywait
for loop in tlsg_session_accept() that repeatedly calls
gnutls_handshake() when it gets EAGAIN. When slapd recovers from this
hang state, the first message it prints is a TLS negotiation failure
error on the culprit file descriptor.

If I back out the ITS#8650 patch, the problem goes away. If I insert
sleep(1) in the for loop, slapd no longer pegs the CPU at 100%, but it
still becomes unresponsive during these high-traffic periods.

I don't know what's happening during these high-traffic periods that
causes the TLS negotiation to go astray. Unfortunately it's not easy to
reproduce this problem outside of this production environment, given the
diversity of clients running different OSes with various versions of SSL
libraries.

I'm wondering if there is a better way to handle EAGAIN returned from
gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my
simplistic attempt at inserting a sleep() call which doesn't really seem
to help. I'm wondering how the GnuTLS developers intend for people to
use gnutls_handshake() properly, so as to gracefully handle sessions
that involve long packets on the one hand, without opening up a
vulnerability to chew up lots of system resources on the other hand.

Regards,

     -Kartik

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

Hi Kartik,

On Fri, Aug 03, 2018 at 11:19:06AM -0400, Kartik Subbarao wrote:
>I'm running into a problem with slapd 2.4.46 hanging on Ubuntu 18.04,
>which seems to be a side effect of the ITS#8650 patch:
>
>https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48
>
>slapd will run fine for a while, but during some periods of
>high-traffic, it'll hang. It'll peg the CPU at 100% and won't respond
>to any new LDAP connections. After some time, it'll resume working
>again, but overall it's fairly unreliable.

Thanks for letting me know about this. This patch is running on quite a
few systems by now, I'm sorry the problem wasn't caught sooner. :/

>I'm wondering if there is a better way to handle EAGAIN returned from
>gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my
>simplistic attempt at inserting a sleep() call which doesn't really
>seem to help. I'm wondering how the GnuTLS developers intend for
>people to use gnutls_handshake() properly, so as to gracefully handle
>sessions that involve long packets on the one hand, without opening up
>a vulnerability to chew up lots of system resources on the other hand.

Right. I mean, this is how GnuTLS' own example shows to do it:

https://gitlab.com/gnutls/gnutls/blob/master/doc/examples/ex-client-dtls.c#L73-77

We could place a limit on the number of iterations, though any such
limit would have to be arbitrary.

There might be an asynchronous GnuTLS API that could be used to avoid
tying up slapd while this is going on.

I will look at how some other GnuTLS servers deal with this...

Revision history for this message
In , Kartik Subbarao (subbarao) wrote :

On 08/03/2018 12:09 PM, Ryan Tandy wrote:
> Thanks for letting me know about this. This patch is running on quite
> a few systems by now, I'm sorry the problem wasn't caught sooner. :/

No worries, thanks for responding so quickly on this!

>> I'm wondering if there is a better way to handle EAGAIN returned from
>> gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my
>> simplistic attempt at inserting a sleep() call which doesn't really
>> seem to help. I'm wondering how the GnuTLS developers intend for
>> people to use gnutls_handshake() properly, so as to gracefully handle
>> sessions that involve long packets on the one hand, without opening
>> up a vulnerability to chew up lots of system resources on the other
>> hand.
>
> Right. I mean, this is how GnuTLS' own example shows to do it:
>
> https://gitlab.com/gnutls/gnutls/blob/master/doc/examples/ex-client-dtls.c#L73-77
>

Hmm, that's a head-scratcher. It doesn't seem very effective to have a
non-blocking I/O interface and then recommend wrapping it in a busywait
loop :-)

> We could place a limit on the number of iterations, though any such
> limit would have to be arbitrary.
>
> There might be an asynchronous GnuTLS API that could be used to avoid
> tying up slapd while this is going on.
>
> I will look at how some other GnuTLS servers deal with this...

Cool, thanks Ryan.

Regards,

     -Kartik

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

Hi Karthik,

Sorry about the lack of updates on this one.

It looks clear that my patch for this ITS was wrong and needs to be
reverted.

Looking again at the original issue, after reverting the patch, I've
found that the behaviour varies with GnuTLS version. I need to figure
out why this is, which probably means spending some time bisecting
GnuTLS changes.

If I create a patch to log some additional debug info about the GnuTLS
setup, would you be willing to run it in your environment? For example
I'm curious whether EAGAIN is being returned from the read side or write
side (guessing read, but would be nice to confirm).

thanks
Ryan

Revision history for this message
In , Kartik Subbarao (subbarao) wrote :

On 08/26/2018 03:07 PM, Ryan Tandy wrote:
> If I create a patch to log some additional debug info about the GnuTLS
> setup, would you be willing to run it in your environment? For example
> I'm curious whether EAGAIN is being returned from the read side or
> write side (guessing read, but would be nice to confirm).

Sure, send me the patch. If the patch just does some additional GnuTLS
logging, then I don't see a problem with running it in the production
environment and sending you the results.

Regards,

     -Kartik

Revision history for this message
In , O-bugs (o-bugs) wrote :

fixed in master
fixed in RE24 (2.4.46)
Needs to be fixed further

Revision history for this message
In , Quanah-x (quanah-x) wrote :

changed notes
changed state Closed to Open

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

Made some good progress on this one this evening.

The original issue this ITS is about is that gnutls_handshake() can, in
some versions of GnuTLS, return GNUTLS_E_AGAIN even when the socket is
blocking. Specifically, this happens in the case I described with a
large CA list sent by the server.

For slapd, the patch I committed is unfortunately completely wrong. It
has been using non-blocking sockets forever, EAGAIN is expected and
handled robustly -- or it was, until I introduced the busy-loop.

For clients I'm still working on figuring out the right path forward.
There is some EAGAIN handling conditional on LDAP_USE_NON_BLOCKING_TLS
which itself is behind LDAP_DEVEL. However this code is meant for
non-blocking sockets, and in my case it ends up stuck in poll() waiting
for a notification that never arrives. In 2.4, ret == 1 simply falls
into the success case and proceeds to send data without completing the
handshake first.

It's possible that what I actually want here is a (ret > 0) case in
ldap_int_tls_start for when LDAP_USE_NON_BLOCKING_TLS is absent and
ldap_int_tls_connect returns 1. (I'd also need to adapt the non-blocking
path to be able to handle a blocking socket as well.)

But it's also possible that gnutls_handshake() returning GNUTLS_E_AGAIN
with a blocking socket is simply a GnuTLS bug that was introduced at
some point. I still need to determine exactly when and why its behaviour
changed. (It is still happening with 3.5.19.)

In any case, my patch has to be reverted, as its impact (making slapd
busy-loop) is obviously worse than the status quo (misbehaving clients
in a specific case). I have pushed that revert now and will continue
digging as time permits.

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

On Tue, Sep 18, 2018 at 10:55:50PM -0700, Ryan Tandy wrote:
>There is some EAGAIN handling conditional on LDAP_USE_NON_BLOCKING_TLS
>which itself is behind LDAP_DEVEL. However this code is meant for
>non-blocking sockets, and in my case it ends up stuck in poll()
>waiting for a notification that never arrives.

This turned out to be because the fd and timeout fields are only
initialized when timeout is configured and the non-blocking behaviour
was triggered because of it. Otherwise the code simply doesn't
anticipate EAGAIN could be returned and the behaviour is more or less
undefined; it ends up calling poll() with fd = -1 and a garbage timeout.

>It's possible that what I actually want here is a (ret > 0) case in
>ldap_int_tls_start for when LDAP_USE_NON_BLOCKING_TLS is absent and
>ldap_int_tls_connect returns 1. (I'd also need to adapt the
>non-blocking path to be able to handle a blocking socket as well.)

More precisely, what I actually want is a (ret > 0) case that is used
unless both USE_NON_BLOCKING_TLS is true and a timeout is configured.

If we added to ber_sockbuf_ctrl() the ability to query whether the
socket is non-blocking, for GnuTLS at least we could bypass poll() and
go straight back into ldap_int_tls_connect(). However I don't see a lot
of benefit to this as long as calling poll() on a blocking socket has no
downside.

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

*** Issue 9210 has been marked as a duplicate of this issue. ***

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

The other way we can get a non-blocking socket is if the client set one up itself and gave it to us via ldap_init_fd(). sssd does this, or used to: bug 9210.

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

*** Issue 9210 has been marked as a duplicate of this issue. ***

Revision history for this message
In , Ryan Tandy (rtandy) wrote :

Created attachment 708
test program with non-blocking socket

Here's a test program that exercises the scenario with a non-blocking socket, similar to the case described in bug 9210. Currently it fails on 2.4 with LDAP_SERVER_DOWN and on 2.5 with LDAP_TIMEOUT, but succeeds if you comment out the fcntl(). Any patch needs to correct that as well as the scenario described here with a blocking socket.

Revision history for this message
In , Howard Chu (hyc) wrote :

Commits:
  • 735e1ab
by Howard Chu at 2020-04-12T22:18:51+00:00
ITS#8650 loop on incomplete TLS handshake

Revision history for this message
Vincent Vanlaer (vincenttc) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openldap (Ubuntu):
status: New → Confirmed
description: updated
Changed in openldap:
importance: Unknown → Medium
status: Unknown → Fix Released
Revision history for this message
Ryan Tandy (rtandy) wrote : Re: [Bug 1921562] [NEW] Intermittent hangs during ldap_search_ext when TLS enabled

On Sat, Mar 27, 2021 at 01:06:42AM -0000, Vincent Vanlaer wrote:
>https://git.openldap.org/openldap/openldap/-/commit/735e1ab

Note that this commit is for OpenLDAP 2.5 and needs adjustment for the
2.4 branch. The commit ids for 2.4 are:

https://git.openldap.org/openldap/openldap/-/commit/7cf7aa3141
https://git.openldap.org/openldap/openldap/-/commit/85fc8974f5

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "retry-tls-connect-on-eintr-eagain.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Vincent Vanlaer (vincenttc) wrote :

Just to be sure, is there anything that I would need to do in order to have the bugfix applied in a new openldap release for Ubuntu 20.04?

Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Hello Vincent,

Since the Debian version has these fixes incorporated, this would be fixed in Hirsute already (as it's in sync (with a minor delta)). For Focal, it will need somebody affected to commit to doing the necessary QA after the update is prepared (without that QA, we won't be able to land the update).

The process is documented at https://wiki.ubuntu.com/StableReleaseUpdates#Procedure. I'll add this task to the server team's backlog. If you'd like to do it sooner, you are welcome to prepare the update yourself following the documented process.

Thanks!

Changed in openldap (Ubuntu):
status: Confirmed → Fix Released
Changed in openldap (Ubuntu Focal):
status: New → Triaged
tags: added: bitesize server-next
Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Hello Vincent,

I've uploaded a fixed package in my PPA: https://launchpad.net/~utkarsh/+archive/ubuntu/experimental-dump. Could you please test this if it work alright for you before I push this to the official archive?

Thanks!

Changed in openldap (Ubuntu Focal):
assignee: nobody → Utkarsh Gupta (utkarsh)
status: Triaged → In Progress
Utkarsh Gupta (utkarsh)
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

To be clear the fix that was mentioned is in 2.4.49+dfsg-4 and later,

 442 openldap (2.4.49+dfsg-4) unstable; urgency=medium
...
 453 * Import upstream patch to properly retry gnutls_handshake() after it
 454 returns GNUTLS_E_AGAIN. (ITS#8650) (Closes: #861838)

Thereby groovy is fixed as well, marking that in the bug tasks.

Changed in openldap (Ubuntu Groovy):
status: New → Fix Released
Revision history for this message
Vincent Vanlaer (vincenttc) wrote :

I've deployed the patch, I'll let you know whether it works and if any regressions occur.

Revision history for this message
Vincent Vanlaer (vincenttc) wrote :

The bug hasn't returned since I installed the fixed package and no new issues have cropped up.

Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Hi Vincent,

> The bug hasn't returned since I installed the fixed package and no
> new issues have cropped up.

Awesome, thank you for your confirmation! \o/

Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Vincent, or anyone else affected,

Accepted openldap into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/openldap/2.4.49+dfsg-2ubuntu1.8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in openldap (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Vincent Vanlaer (vincenttc) wrote :

I've been running 2.4.49+dfsg-2ubuntu1.8 from focal-proposed for the past few days and the issue has not returned. As otherwise the issue would occur at least once per day, I consider it fixed. Furthermore, no other issues have cropped up in the meantime.

tags: added: verification-done-focal
removed: verification-needed-focal
Utkarsh Gupta (utkarsh)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openldap - 2.4.49+dfsg-2ubuntu1.8

---------------
openldap (2.4.49+dfsg-2ubuntu1.8) focal; urgency=medium

  * d/p/ITS-8650-loop-on-incomplete-TLS-handshake.patch:
    Import upstream patch to properly retry gnutls_handshake() after it
    returns GNUTLS_E_AGAIN. (ITS#8650) (LP: #1921562)

 -- Utkarsh Gupta <email address hidden> Thu, 08 Apr 2021 09:52:01 +0530

Changed in openldap (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for openldap has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.