Hangs at "Starting kernel log daemon..." with LDAP

Bug #382615 reported by Daniel Richard G.
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
sysklogd (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: sysklogd

This is similar to bug #218320, but occurs in a different context. The failure mode here is a bit complicated, so bear with me as I describe it.

Ubuntu Jaunty system. sysklogd and klogd packages at version 1.5-5ubuntu3. passwd and group lookups handled by LDAP.

In certain circumstances (which I will explain later), the system hangs at "Starting kernel log daemon..." on bootup. Logins are not even possible at this point---you enter "root", the password, and it just hangs there. Networking is not up.

I modified one of the /etc/event.d/tty* scripts to start up a root shell directly on the console. Examination of the process table at the point of the hang showed that the start-stop-daemon(8) command that actually invokes /sbin/klogd, from /etc/rc2.d/S11klogd (a.k.a. /etc/init.d/klogd) had become stuck.

I modified the init.d script to run the problematic start-stop-daemon(8) invocation through strace(1). Rebooted, another hang. The last part of the generated strace log:

----8<----
socket(PF_FILE, 0x80002 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/dev/log"...}, 110) = 0
send(3, "<83>Jun 1 16:16:37 start-stop-daemon: nss_ldap: could not connect to any LDAP server as cn=nssldap,ou=Accounts,dc=example,dc=com - Can't contact LDAP server"..., 158, MSG_NOSIGNAL) = 158
time(NULL) = 1243887397
send(3, "<86>Jun 1 16:16:37 start-stop-daemon: nss_ldap: failed to bind to LDAP server ldaps://172.25.85.16/: Can't contact LDAP server"..., 127, MSG_NOSIGNAL) = 127
time(NULL) = 1243887397
send(3, "<86>Jun 1 16:16:37 start-stop-daemon: nss_ldap: reconnecting to LDAP server..."..., 79, MSG_NOSIGNAL
---->8----

The system cannot connect to the LDAP server because the network is not up. But networking should be up already at this point, yes? Well, I also see this in the process table (via "pstree -ap"):

----8<----
init,1
[...]
  |-ifup,1120 --allow auto eth0
  | `-sh,2107 -c ...
  | `-dhclient3,2110 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0
---->8----

What is dhclient3 doing? Attaching with strace(1) shows

----8<----
Process 2110 attached - interrupt to quit
send(6, "<30>Jun 1 16:16:39 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7"..., 88, MSG_NOSIGNAL <unfinished ...>
---->8----

Looks like it's trying to talk to /dev/log, just as start-stop-daemon(8) was doing. Now, /dev/log is a socket handled by syslogd(8). So it seems syslogd(8) might be slacking off, causing the holdups in these two programs. If we attach to syslogd(8) with strace(1), what do we see?

----8<----
Process 2272 attached - interrupt to quit
send(17, "<83>Jun 1 16:57:32 syslogd: nss_ldap: could not connect to any LDAP server as cn=nssldap,ou=Accounts,dc=example,dc=com - Can't contact LDAP server"..., 148, MSG_NOSIGNAL <unfinished ...>
---->8----

Uh oh. Is syslogd(8) itself similarly stuck, trying to talk to... er... itself?

I didn't really dig further, but you get the idea of some sort of reentrancy-deadlock with LDAP lookups that horks up syslogd(8), and in turn, everything else. There are two interesting points of note to add:

1. If you send SIGTERM to syslogd(8), killing it, the logjam clears up completely. Networking comes up, the boot process finishes, and you can log in. No syslog, but hey, working system.

2. The reason why this normally doesn't happen is because of nssldap-update-ignoreusers(8). This command is automatically invoked as part of the shutdown process; it appends a line to /etc/ldap.conf specifying users (or groups?) to be ignored by LDAP, presumably avoiding the superfluous lookups that produce the type of errors seen in the above excerpts. My system encountered this because it was "ghosted" from another system that had not undergone a reboot after installing LDAP, so the particular line was missing from /etc/ldap.conf, and so the problematic lookups were occurring. (If I reboot the system normally after SIGTERM'ing syslogd(8) as above, the problem goes away.)

I think it's not enough to say that just running nssldap-update-ignoreusers(8) fixes the problem and thus there is no bug; this is clearly a race condition that syslogd(8) right now does not handle gracefully, and ought to do so. It's entirely possible that other manifestations of the hanging behavior, like bug #218320, have the same underlying cause. I believe the ignoreusers bit should ultimately be there just to avoid spurious LDAP errors, not because its absence breaks syslog and system bootup.

Revision history for this message
Daniel Richard G. (skunk) wrote :

A couple points worth noting:

1. The deadlock is somewhat timing-dependent. It happens every time on a bare-metal PC, but I am unable to reproduce it on a slower Xen guest system.

2. For whatever reason, this deadlock never happened on Intrepid, all other circumstances being equal (same PC, same LDAP setup, no ignoreusers line in /etc/ldap.conf).

Revision history for this message
Mary Strimel (mary-strimel) wrote :

I am experiencing this situation on about my 5th reboot after a bare-metal install of Jaunty 64 bit. Doing "killall syslogd" from a terminal gets the boot process a little farther along, but then it hangs on something like "kinit: no resume image found, resuming normal boot ..."

Revision history for this message
Daniel Richard G. (skunk) wrote :

Hi Mary,

I've kind of forgotten about this report, since the bug hasn't bitten me in a while. I *think* the workaround is to run nssldap-update-ignoreusers(8), which adds an "ignoreusers" line to /etc/ldap.conf that prevents LDAP from being accessed for any system user (which I believe was what caused the problem in the first place).

Could you check to see if that's already in place? IIRC, that program is automatically run in one of the shutdown scripts, but if the system has never gotten that far to begin with, it might not be there.

Revision history for this message
Patrick Benton (pbenton) wrote :

I have 9.10.
I have the ignore users line and the ignore users commend runs when the system shuts down. I can hit my power button and it shuts down gracefully.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.