nfs/rpc.statd becomes unresponsive

Bug #964750 reported by hewbert
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

We've tested this on: Ubuntu 10.04 LTS, 11.10, and Debian 6.0.4, all on x64, current updates, and a pretty vanilla installation.

Condition summary:
We have ~1600 users with "live" network homes, all Mac clients. There's typically around 150 simultaneous connections. Under these conditions, we can reliably get NFS to become unresponsive within a couple of hours, just by logging in ~150 users and opening Word (for example). There's no clear indication on what exactly causes the failures. Being Mac clients 10.6 and below, these are using NFS3.

We've tested using one physical server, with a hardware RAID and an ext3 filesystem. We've also tested on two separate VMs with ext4. All systems in question used LVM.

Here's what our server logs indicate when the failures happen:
Mar 23 15:40:47 debfs mountd[2365]: authenticated mount request from 172.30.109.132:1020 for /srv/homes (/srv/homes)
Mar 23 15:40:58 debfs mountd[2365]: authenticated mount request from 172.30.109.73:1020 for /srv/homes (/srv/homes)
Mar 23 15:41:06 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.106.249
Mar 23 15:41:06 debfs mountd[2365]: authenticated mount request from 172.30.109.27:1020 for /srv/homes (/srv/homes)
Mar 23 15:41:09 debfs mountd[2365]: authenticated mount request from 172.30.109.63:1020 for /srv/homes (/srv/homes)
Mar 23 15:41:14 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.106.249
** Mar 23 15:41:19 debfs kernel: [ 8395.736310] statd: server rpc.statd not responding, timed out
** Mar 23 15:41:19 debfs kernel: [ 8395.736331] lockd: cannot unmonitor hs13406s4354.dsdk12.schoollocal
Mar 23 15:41:38 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.137.223
Mar 23 15:41:52 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.137.223
Mar 23 15:41:54 debfs kernel: [ 8430.737038] statd: server rpc.statd not responding, timed out
Mar 23 15:41:54 debfs kernel: [ 8430.737054] lockd: cannot unmonitor hslib23s5174.dsdk12.schoollocal
Mar 23 15:42:10 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.110.25
Mar 23 15:42:15 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.110.25
Mar 23 15:42:29 debfs kernel: [ 8465.737071] statd: server rpc.statd not responding, timed out
Mar 23 15:42:29 debfs kernel: [ 8465.737090] lockd: cannot unmonitor MS20603S4451.dsdk12.schoollocal
Mar 23 15:42:31 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.110.20
Mar 23 15:42:40 debfs rpc.statd[741]: Received erroneous SM_UNMON request from debfs for 172.30.110.20

Upon closer examination, the [lockd] process shows a 'D' state when this is going on. Usually, my only recourse is to reboot the server.

I've tried different values for RPCNFSDCOUNT in /etc/default/nfs-kernel-server, and have NEED_STATD=yes in /etc/default/nfs-common. Otherwise, everything is pretty well stock.

Here's the /etc/exports:
/srv/homes 172.30.0.0/16(insecure_locks,insecure,rw,sync,no_root_squash,no_subtree_check)
I've tested with 'insecure_locks' and without. The 'insecure' option is to make the Mac clients happy.

Unfortunately, modifying the NFS options on the clients would be rather difficult in our environment.

More information can be provided as needed.

Revision history for this message
hewbert (josh-hewbert) wrote :

Okay, I've tested these same conditions under a CentOS 6.2 vanilla system, kernel 2.6.32-220.el6.x86_64. Don't know how far upstream this goes, but I'm conducting the same under FreeBSD/FreeNAS and will update this report with the results.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.