ldap_result returns -1 when called from sssd

Bug #929888 reported by Ryan Murray on 2012-02-09
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
openldap (Fedora)
Fix Released
Critical
openldap (Ubuntu)
Critical
Unassigned

Bug Description

sssd fails to obtain ldap results and marks the ldap server as "offline" when used with libldap-2.4-2 2.4.28-1.1ubuntu2, as ldap_result always returns -1. Reverting to libldap-2.4-2 2.4.25-1.1ubuntu4 fixes the problem.

This seems to be an upstream bug, also seen in Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=771484

From the sssd log (w/ logging using gettimeofday():)

(Thu Feb 9 14:59:28:249698 2012) [sssd[be[EXAMPLE.COM]]] [sdap_get_generic_step] (6): calling ldap_search_ext with [(objectclass=*)][].
(Thu Feb 9 14:59:28:250239 2012) [sssd[be[EXAMPLE.COM]]] [sdap_get_generic_step] (8): ldap_search_ext called, msgid = 2
(Thu Feb 9 14:59:28:250282 2012) [sssd[be[EXAMPLE.COM]]] [sdap_process_result] (8): Trace: sh[0xf7cd40], connected[1], ops[0xf85c00], ldap[0xf739a0]
(Thu Feb 9 14:59:28:250317 2012) [sssd[be[EXAMPLE.COM]]] [sdap_process_result] (4): ldap_result gave -1

tcpdump at the same time, 192.168.1.2 client, 192.168.1.1 server, STARTTLS in use:

14:59:28.068940 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [S], seq 592009926, win 14600, options [mss 1460,sackOK,TS val 5276654 ecr 0,nop,wscale 7], length 0
14:59:28.070612 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [S.], seq 3103016433, ack 592009927, win 14480, options [mss 1460,sackOK,TS val 109205234 ecr 5276654,nop,wscale 6], length 0
14:59:28.070686 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [.], ack 1, win 115, options [nop,nop,TS val 5276655 ecr 109205234], length 0
14:59:28.071294 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 1:32, ack 1, win 115, options [nop,nop,TS val 5276655 ecr 109205234], length 31
14:59:28.072793 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [.], ack 32, win 227, options [nop,nop,TS val 109205234 ecr 5276655], length 0
14:59:28.072841 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [P.], seq 1:15, ack 32, win 227, options [nop,nop,TS val 109205234 ecr 5276655], length 14
14:59:28.072862 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [.], ack 15, win 115, options [nop,nop,TS val 5276655 ecr 109205234], length 0
14:59:28.079040 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 32:149, ack 15, win 115, options [nop,nop,TS val 5276657 ecr 109205234], length 117
14:59:28.097160 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [P.], seq 15:101, ack 149, win 227, options [nop,nop,TS val 109205240 ecr 5276657], length 86
14:59:28.098343 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [.], seq 101:1549, ack 149, win 227, options [nop,nop,TS val 109205240 ecr 5276657], length 1448
14:59:28.098525 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [.], ack 1549, win 137, options [nop,nop,TS val 5276662 ecr 109205240], length 0
14:59:28.099813 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [.], seq 1549:2997, ack 149, win 227, options [nop,nop,TS val 109205240 ecr 5276657], length 1448
14:59:28.099839 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [P.], seq 2997:3458, ack 149, win 227, options [nop,nop,TS val 109205240 ecr 5276657], length 461
14:59:28.099843 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [P.], seq 3458:3467, ack 149, win 227, options [nop,nop,TS val 109205240 ecr 5276657], length 9
14:59:28.099995 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [.], ack 3458, win 182, options [nop,nop,TS val 5276662 ecr 109205240], length 0
14:59:28.104322 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 149:672, ack 3467, win 182, options [nop,nop,TS val 5276663 ecr 109205240], length 523
14:59:28.104361 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 672:678, ack 3467, win 182, options [nop,nop,TS val 5276663 ecr 109205240], length 6
14:59:28.106046 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 678:811, ack 3467, win 182, options [nop,nop,TS val 5276664 ecr 109205240], length 133
14:59:28.107332 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [.], ack 811, win 260, options [nop,nop,TS val 109205243 ecr 5276663], length 0
14:59:28.240882 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [P.], seq 3467:3473, ack 811, win 260, options [nop,nop,TS val 109205276 ecr 5276663], length 6
14:59:28.240926 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [P.], seq 3473:3654, ack 811, win 260, options [nop,nop,TS val 109205276 ecr 5276663], length 181
14:59:28.241193 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [.], ack 3654, win 205, options [nop,nop,TS val 5276698 ecr 109205276], length 0

14:59:28.249698 calling ldap_search_ext (objectclass=*)

14:59:28.250206 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 811:1200, ack 3654, win 205, options [nop,nop,TS val 5276700 ecr 109205276], length 389

14:59:28.250239 ldap_search_ext called, msgid = 2
14:59:28.250282 ldap_result called
14:59:28.250317 ldap_result returns -1

14:59:28.250535 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 1200:1381, ack 3654, win 205, options [nop,nop,TS val 5276700 ecr 109205276], length 181
14:59:28.250886 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 1381:1498, ack 3654, win 205, options [nop,nop,TS val 5276700 ecr 109205276], length 117
14:59:28.251052 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [P.], seq 1498:1727, ack 3654, win 205, options [nop,nop,TS val 5276700 ecr 109205276], length 229
14:59:28.251075 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [F.], seq 1727, ack 3654, win 205, options [nop,nop,TS val 5276700 ecr 109205276], length 0
14:59:28.251295 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [.], ack 1381, win 294, options [nop,nop,TS val 109205279 ecr 5276700], length 0
14:59:28.253382 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [P.], seq 3654:3723, ack 1498, win 294, options [nop,nop,TS val 109205279 ecr 5276700], length 69
14:59:28.253430 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [R], seq 592011424, win 0, length 0
14:59:28.253477 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [F.], seq 3723, ack 1498, win 294, options [nop,nop,TS val 109205279 ecr 5276700], length 0
14:59:28.253488 IP 192.168.1.2.53857 > 192.168.1.1.389: Flags [R], seq 592011424, win 0, length 0
14:59:28.253918 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [R], seq 3103020087, win 0, length 0
14:59:28.253938 IP 192.168.1.1.389 > 192.168.1.2.53857: Flags [R], seq 3103020087, win 0, length 0

Description of problem:
sssd ID searches break with openldap-2.4.28-1.fc17.

How reproducible:
Always

Steps to Reproduce:
1. Set up sssd on a Rawhide box
2. Run getent passwd <userid>, eg. "getent passwd kdreyer"

Actual results:
(user is not found)

Expected results:
kdreyer:*:500:100:Ken Dreyer:/home/kdreyer:/bin/bash

Additional info:

My LDAP server is CentOS 5, openldap-servers-2.3.43-12.el5_7.10

(regular openldap-clients commands in 2.4.28-1.fc17, for example ldapwhoami, or ldapsearch, work fine against this server. It is only sssd that breaks here.)

In the sssd_KTDREYER.log:

(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [simple_bind_send] (0x0100): Executing simple bind as: (null)
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [simple_bind_send] (0x2000): ldap simple bind sent, msgid = 2
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_process_result] (0x2000): Trace: sh[0xb8eee4c0], connected[1], ops[0xb8f76e10], ldap[0xb8ee1ca0]
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_process_result] (0x0100): ldap_result gave -1, something bad happend!
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_handle_release] (0x2000): Trace: sh[0xb8eee4c0], connected[1], ops[0xb8f76e10], ldap[0xb8ee1ca0], destructor_lock[0], release_memory[0]
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [remove_connection_callback] (0x4000): Successfully removed connection callback.
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'salt.ktdreyer.com' as 'not working'

My sssd config is working fine on EL5, EL6, and F15. Thanks to sgallagh in #sssd, I found out that when I dowgrade openldap to 2.4.26-5.fc17, sssd works again.
http://koji.fedoraproject.org/koji/buildinfo?buildID=267220

There is a regression in openldap-libs somewhere between 2.4.26-5.fc17 and 2.4.28-1.fc17. We don't get any useful information from openldap, we only receive -1 back from ldap_result().

I managed to reproduce this with openldap-2.4.28-2. However, after playing around with gdb trying to find out some more info about what's going on, it started to work and I can't reproduce this anymore. I really have no idea what happened. I didn't change any setting or anything in ldap database.

Here is roughly what I did:
1. run sssd -d 10 -i
2. find sssd_be pid and run gdb
3. attach gdb to sssd_be process
4. set breakpoint at sdap_async.c:1019
5. run getent passwd <email address hidden>
6. make few steps in gdb, then continue

Any ideas about what might have gone wrong? (or more precisely, gone 'right' in this particular case)

I know that attaching gdb to a process makes the process pause, so maybe the execution order changed somehow. What really escapes me, though, is why I can't reproduce this after I restart the sssd and slapd, or even reboot the machine.

Ah ok, I removed the cache file and it's reproducible again. But still not sure why it started to work.

(In reply to comment #3)
> Ah ok, I removed the cache file and it's reproducible again. But still not sure
> why it started to work.

SSSD caches the entries internally into an on-disk cache (as you figured out). I believe that the second request came straight out of the cache without even contacting SSSD.

If you set:
entry_cache_timeout = 1
into the [domain/redhat.com] section of your sssd.conf, then all cache entries will be only valid for 1 second, so even the second request should go all the way into the LDAP code.

Actually, one more gotcha..the cache expiration time is stored in the cache itself, so you'll probably want to either remove the cache after you changed entry_cache_timeout parameter or use the sss_cache tool (available in 1.7 an later) to mark the cached entries as expired.

Created attachment 559011
SSSD log after returning expected result.

I'm pretty much hopeless on this one. I attached a log where I got the expected result from openldap; that means after running 'getent passwd <email address hidden>', sssd contacted ldap and retrieved the info. What I don't understand is why there is that 'ldap_result gave -1' at the end of the log. I removed the cache files before running the command.

Anyone from sssd team willing to give a hint on what might have gone wrong?

It looks like some kind of a weird race condition to me.
This is how I test it:
1. start slapd (no nonstandard configuration + minimal tree to get the result)
2. (re)start slapd -d 10 -i
3. getent passwd <email address hidden>
4. rm -rf /var/lib/sss/db/*
5. goto 2.

At some point, step 3 prints the expected result.

(In reply to comment #7)
> I'm pretty much hopeless on this one. I attached a log where I got the expected
> result from openldap; that means after running 'getent passwd
> <email address hidden>', sssd contacted ldap and retrieved the info. What I don't
> understand is why there is that 'ldap_result gave -1' at the end of the log. I
> removed the cache files before running the command.
>
> Anyone from sssd team willing to give a hint on what might have gone wrong?
>
> It looks like some kind of a weird race condition to me.
> This is how I test it:
> 1. start slapd (no nonstandard configuration + minimal tree to get the result)
> 2. (re)start slapd -d 10 -i
> 3. getent passwd <email address hidden>
> 4. rm -rf /var/lib/sss/db/*
> 5. goto 2.
>
> At some point, step 3 prints the expected result.

I can prepare a scratch build that would print openldap debug info to stderr. Would that be helpful?

I think it might be. Thank you.

I'm not sure what exact version are you testing with so I'm going to attach a patch that applies on the upstream 1.7 branch - that should be pretty close to what you have.

In order to use the debugging, you need to export the SSSD_DEBUG_LDAP_SEARCH variable with the desired LDAP debug level.

For example:
export SSSD_DEBUG_LDAP_SEARCH="0xffff"
would set LDAP_DEBUG_ANY

Then run SSSD on the foreground, the debug messages would simply appear on stderr.

Created attachment 559069
A non-upstream patch to enable libldap debugging

We're getting reports about this breakage all over the place. Please look into this as soon as possible.

Stephen, can you reproduce the bug? Please, can you retest with openldap from the following repository?

http://repos.fedorapeople.org/repos/jvcelak/openldap/

The latest package there (openldap-2.4.29-0.3.tc4) is possibly the future 2.4.29, which will be released soon. It is already tagged in upstream git. If it is a regression, then it might be already fixed.

Timo Aaltonen (tjaalton) on 2012-02-10
Changed in openldap (Ubuntu):
importance: Undecided → Critical
status: New → Confirmed

I have tried rebuilding SSSD against:

$ rpm -qa \*openldap\*
openldap-devel-2.4.29-0.3.tc4.fc17.x86_64
openldap-2.4.29-0.3.tc4.fc17.x86_64

and I am still seeing the bug.

Unsetting needinfo, the requested information was provided in comment #14. Please let us know if you need any additional information to produce a fix.

I have a little more information about this.

SSSD's event loop listens on the file descriptor we get from
ber_sockbuf_ctrl(sb, LBER_SB_OPT_GET_FD, &ber_fd);

We call a callback to process results from LDAP when that file descriptor becomes marked as readable by the kernel (assumption: there's data on it to be read).

However, it appears that this file descriptor is now sometimes being set "readable" without data that can be processed by ldap_result().

This is the source of the regression. Downgrading to F16's openldap works properly.

In the case above, it seems like the file descriptor is being marked readable before the response has actually arrived. Additionally, see BZ #790414 for another manifestation of this. In that particular case, we're getting it marked as readable AFTER the response has already been fully processed.

btw I bisected the version where the regression appeared and openldap-2.4.26-6.fc17 is NOT affected, while openldap-2.4.28-1.fc17 IS affected.

Thank you for the additional info. I'm already looking into this.

How do you test that the descriptor is readable?

SSSD relies on tevent (http://tevent.samba.org) for its mainloop. We register file descriptors with it and a callback to invoke on read or write readiness. Under the hood, tevent uses epoll().

Things started working again when I reverted upstream commit 06ec9f1db2bb9dc304a4adcd0d0506203cf9e6b6

I'm not actually familiar enough with openldap sources to say whether bailing out if "serviced" is not set in any of the branches is the right thing to do or whether there is another bug, but maybe this can help you triage the bug further..

Proposing as an F17 Alpha blocker so we can kick around the possible consequences of this in the Friday meeting: it's a judgement call. It's at least possible that any kind of usage of LDAP with F17 could be broken due to this bug; at minimum, we know it can break login via SSSD. Thus it's a conditional breakage of the criterion "Following on from the previous criterion, after firstboot is completed and on subsequent boots, a system installed according to any of the above criteria (or the appropriate Beta or Final criteria, when applying this criterion to those releases) must boot to a working graphical environment without unintended user intervention. This includes correctly accessing any encrypted partitions when the correct passphrase is supplied" - breaks it in the case of using SSSD.

Note that the bug can affect configurations where only the client is on F17 - you don't have to have the server running F17 to run into this bug. So that's a factor.

--
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Timo Aaltonen (tjaalton) wrote :

Ryan, can you build openldap with the proposed patch and test?

Timo Aaltonen (tjaalton) wrote :

Built it myself (64bit) and put it here:

http://kernel.ubuntu.com/~tjaalton/lp929888

install the lib and test.

Timo Aaltonen (tjaalton) wrote :

scratch that, it was tested elsewhere and didn't help :/

-1 to alpha +1 to nth +1 to beta

Jason Sharp (jsharp) wrote :

Not sure If i helps, I had sssd working in 12.04 with Alpha 1...never thought to look at the package version for openldap, but what ever version was included with Alpha 1 was ok. If that helps

Discussed at the blocker review meeting of 2012-02-17. Agreed that this is not a blocker, we don't consider SSSD auth important/popular enough to block an Alpha release, but it is accepted as NTH so long as the fix is isolated to openldap as we'd expect. If the fix requires touching more general components, we may revisit the NTH status.

NTH means that if a fix for this is made available ahead of an RC spin, we will likely choose to pull that fix into the spin, through the freeze - but we will not delay the release if no fix shows up.

--
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

We have identified the cause and we are looking for a possible solution. If we don't find any before the RC deadline, I will revert the upstream patch identified by Jakub in (comment #20). So SSSD in F17 will work.

'RC deadline' is basically today: go/no-go is wednesday, so we need to compose rc3 today really in order to have time to test it.

--
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Note this is not a release blocker

So instead of stressing everybody out ( Releng/QA/You guys) just introduce the proper fix via update before beta...

*** Bug 795502 has been marked as a duplicate of this bug. ***

http://www.openldap.org/its/index.cgi/Incoming?id=7167
contains a patch that partially resolves the issue, but I think we also have a bug in sssd where we erroneously assumed that calling ldap_result() after all results had already been returned was acceptable.
It was not, but ldap_result() used to return 0 anyways.

I think we should open a new bug for the opneldap libraries and return this one to SSSD, so we can handle the other half of the fix in SSSD itself.

Sorry disregard comment #29, I misinterpreted one reply from HYC.

The patch in http://www.openldap.org/its/index.cgi/Incoming?id=7167 should fully fix the SSSD issue w/o any code change on the SSSD side.

If this checks out we can update 7167 and consider the whole incident fixed. The patch in 7167#9 has also already been pushed to the openldap git master tree.

Scratch build with upstream patch applied:
http://koji.fedoraproject.org/koji/taskinfo?taskID=3806953

Please, test!

It works in my test configuration. Can someone verify?

Works for me, too. Great!

Timo Aaltonen (tjaalton) wrote :

The fix to this issue got applied upstream and I uploaded it to precise (referred to the duplicate bug on the changelog, boo).

Please reopen if you still see it with libldap 2.4.28-1.1ubuntu3.

Changed in openldap (Ubuntu):
status: Confirmed → Fix Released

Resolved in:
openldap-2.4.29-3.fc17
openldap-2.4.29-3.fc18

openldap-2.4.29-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/openldap-2.4.29-3.fc17

Package openldap-2.4.29-3.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing openldap-2.4.29-3.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-2113/openldap-2.4.29-3.fc17
then log in and leave karma (feedback).

I confirmed that openldap-2.4.29-3.fc18 fixes the original bug. Thanks all for working on this!

*** Bug 790414 has been marked as a duplicate of this bug. ***

openldap-2.4.29-3.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.

Changed in openldap (Fedora):
importance: Unknown → Critical
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.