Hangs at "Starting kernel log daemon..." with LDAP
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
sysklogd (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: sysklogd
This is similar to bug #218320, but occurs in a different context. The failure mode here is a bit complicated, so bear with me as I describe it.
Ubuntu Jaunty system. sysklogd and klogd packages at version 1.5-5ubuntu3. passwd and group lookups handled by LDAP.
In certain circumstances (which I will explain later), the system hangs at "Starting kernel log daemon..." on bootup. Logins are not even possible at this point---you enter "root", the password, and it just hangs there. Networking is not up.
I modified one of the /etc/event.d/tty* scripts to start up a root shell directly on the console. Examination of the process table at the point of the hang showed that the start-stop-
I modified the init.d script to run the problematic start-stop-
----8<----
socket(PF_FILE, 0x80002 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/
send(3, "<83>Jun 1 16:16:37 start-stop-daemon: nss_ldap: could not connect to any LDAP server as cn=nssldap,
time(NULL) = 1243887397
send(3, "<86>Jun 1 16:16:37 start-stop-daemon: nss_ldap: failed to bind to LDAP server ldaps:/
time(NULL) = 1243887397
send(3, "<86>Jun 1 16:16:37 start-stop-daemon: nss_ldap: reconnecting to LDAP server..."..., 79, MSG_NOSIGNAL
---->8----
The system cannot connect to the LDAP server because the network is not up. But networking should be up already at this point, yes? Well, I also see this in the process table (via "pstree -ap"):
----8<----
init,1
[...]
|-ifup,1120 --allow auto eth0
| `-sh,2107 -c ...
| `-dhclient3,2110 -e IF_METRIC=100 -pf /var/run/
---->8----
What is dhclient3 doing? Attaching with strace(1) shows
----8<----
Process 2110 attached - interrupt to quit
send(6, "<30>Jun 1 16:16:39 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7"..., 88, MSG_NOSIGNAL <unfinished ...>
---->8----
Looks like it's trying to talk to /dev/log, just as start-stop-
----8<----
Process 2272 attached - interrupt to quit
send(17, "<83>Jun 1 16:57:32 syslogd: nss_ldap: could not connect to any LDAP server as cn=nssldap,
---->8----
Uh oh. Is syslogd(8) itself similarly stuck, trying to talk to... er... itself?
I didn't really dig further, but you get the idea of some sort of reentrancy-deadlock with LDAP lookups that horks up syslogd(8), and in turn, everything else. There are two interesting points of note to add:
1. If you send SIGTERM to syslogd(8), killing it, the logjam clears up completely. Networking comes up, the boot process finishes, and you can log in. No syslog, but hey, working system.
2. The reason why this normally doesn't happen is because of nssldap-
I think it's not enough to say that just running nssldap-
A couple points worth noting:
1. The deadlock is somewhat timing-dependent. It happens every time on a bare-metal PC, but I am unable to reproduce it on a slower Xen guest system.
2. For whatever reason, this deadlock never happened on Intrepid, all other circumstances being equal (same PC, same LDAP setup, no ignoreusers line in /etc/ldap.conf).