nscd stopped responding after LDAP server crash

Bug #322800 reported by gcc on 2009-01-29
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
glibc (Ubuntu)
Undecided
Unassigned
Nominated for Dapper by gcc
Nominated for Hardy by Dave Lehman

Bug Description

Binary package hint: nscd

After our LDAP server hung and had to be restarted, nscd on one of our boxes got into a bad state. It had hundreds of threads running, hundreds of connections to its own nscd socket, and kept polling something, but not usefully, and not appearing to make any progress. This causes the host system to be very slow, as many commands require NSS lookups which the system attempts to send to nscd, but nscd never answers them.

root@net-backup:/home/chris# lsb_release -rd
Description: Ubuntu 6.06.2 LTS
Release: 6.06

ii nscd 2.3.6-0ubuntu2 GNU C Library: Name Service Cache Daemon
ii libnss-ldap 238-1.1ubuntu1 NSS module for using LDAP as a naming servic

gcc (chris+ubuntu-qwirx) wrote :
gcc (chris+ubuntu-qwirx) wrote :
gcc (chris+ubuntu-qwirx) wrote :

I did enter nscd as the package originally, I swear

gcc (chris+ubuntu-qwirx) wrote :

Something similar happens on Hardy Heron. Just found NSCD spinning on 100% CPU doing this:

[pid 5411] accept(9, 0, NULL) = -1 EMFILE (Too many open files)
[pid 5411] epoll_wait(10, {{EPOLLRDNORM, {u32=9, u64=9}}}, 100, 29988) = 1
[pid 5411] time(NULL) = 1234268027
[pid 5411] accept(9, 0, NULL) = -1 EMFILE (Too many open files)
[pid 5411] epoll_wait(10, {{EPOLLRDNORM, {u32=9, u64=9}}}, 100, 29988) = 1
[pid 5411] time(NULL) = 1234268027

Clearly if it gets an error from accept() it should at least log an error and wait a bit, not just spin in a tight loop.

The problem is that it has opened too many unix sockets:

nscd 5411 root 1018u unix 0xd97dee00 3767356 /var/run/nscd/socket
nscd 5411 root 1019u unix 0xd976d600 3767367 /var/run/nscd/socket
nscd 5411 root 1020u unix 0xf4187e00 3767369 /var/run/nscd/socket
nscd 5411 root 1021u unix 0xd97de200 3767375 /var/run/nscd/socket
nscd 5411 root 1022u unix 0xf70dc000 3767438 /var/run/nscd/socket
nscd 5411 root 1023u unix 0xd9508e00 3767440 /var/run/nscd/socket

There are 33 threads running at the moment according to /proc/5411/task, so there must be a file descriptor leak somewhere in nscd.

chris@fen-ndiyo3(~)$ lsb_release -rd
Description: Ubuntu 8.04.2
Release: 8.04

ii nscd 2.7-10ubuntu4 GNU C Library: Name Service Cache Daemon
ii libnss-ldap 258-1ubuntu3 NSS module for using LDAP as a naming servic

Michael Jeanson (mjeanson) wrote :

Same thing here, on Hardy.

# lsb_release -rd
Description: Ubuntu 8.04.3 LTS
Release: 8.04

ii nscd 2.7-10ubuntu5 GNU C Library: Name Service Cache Daemon
ii libnss-ldap 258-1ubuntu3 NSS module for using LDAP as a naming service
ii linux-image-2.6.24-24-server 2.6.24-24.59 Linux kernel image for version 2.6.24 on x86

I found references to a similar problem in RedHat bugzilla #496201 which was related to a kernel bug but it should not affect the Hardy kernel.

Kasper Souren (guaka) wrote :

Same problem on 9.04.

# lsb_release -rd
Description: Ubuntu 9.04
Release: 9.04

# dpkg -l|grep libnss-ldap
ii libnss-ldap 261-2.1ubuntu1 NSS module for using LDAP as a naming servic

# dpkg -l|grep nscd
ii nscd 2.9-4ubuntu6.1 GNU C Library: Name Service Cache Daemon

Right after installing:

# lsof |grep nscd |wc
    133 1193 13026

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers