Ubuntu
linux package

"nfs: server not responding, timed out", then nfs4 locks up

Bug #477711 reported by Arne Christensen on 2009-11-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Invalid	Low	Unassigned

Bug Description

Even after applying kernel patches as described in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/379385 I occasionally experience the following problem on my nfs4 client computer. "Occasionally" means that recently (and according to syslog) it happened on Oct 14, Nov 3, Nov 4 and Nov 7. It typically happens after the nfs4 client computer goes out of sleep. I understand why sleep may upset the nfs connection, but it clearly recovers "most of the time".

The computers serving as nfs4 client and server both run a fully updated Jaunty (09.04). Exception: The client is a few security patches behind because I'm running a home-built kernel with the patch mentioned in http://bugzilla.kernel.org/attachment.cgi?id=18427 (comment #19).

The problem is that nfs4 locks up so that all accesses to my nfs4 mount point get delayed indefinitely, with no way of interrupting it. syslog from a typical failure is:

Nov 3 07:07:41 helle kernel: [13262.273844] WARNING: at kernel/power/main.c:177 suspend_test_finish+0x80/0x90()
Nov 3 07:07:41 helle kernel: [13262.273847] Component: resume devices
Nov 3 07:07:41 helle kernel: [13262.273851] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat usb_storage i915 drm binfmt_misc bridge stp bnep input_polldev video output nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipt_REJECT ipt_LOG xt_limit xt_tcpudp xt_state ipt_addrtype ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device psmouse iTCO_wdt iTCO_vendor_support intel_agp snd ppdev serio_raw agpgart pcspkr soundcore snd_page_alloc parport_pc parport usbhid r8169 mii
Nov 3 07:07:41 helle kernel: [13262.273940] Pid: 11870, comm: pm-suspend Tainted: G W 2.6.28.10-nfspatch.2 #1
Nov 3 07:07:41 helle kernel: [13262.273944] Call Trace:
Nov 3 07:07:41 helle kernel: [13262.273955] [<c0138c00>] warn_slowpath+0x60/0x80
Nov 3 07:07:41 helle kernel: [13262.273965] [<c015201a>] ? down_trylock+0x2a/0x40
Nov 3 07:07:41 helle anacron[12181]: Anacron 2.3 started on 2009-11-03
Nov 3 07:07:41 helle kernel: [13262.273970] [<c013925d>] ? try_acquire_console_sem+0xd/0x30
Nov 3 07:07:41 helle kernel: [13262.273979] [<c02baab0>] ? kobject_put+0x20/0x50
Nov 3 07:07:41 helle kernel: [13262.273988] [<c04ee341>] ? printk+0x18/0x1f
Nov 3 07:07:41 helle anacron[12181]: Will run job `cron.daily' in 5 min.
Nov 3 07:07:41 helle kernel: [13262.273993] [<c01629f0>] suspend_test_finish+0x80/0x90
Nov 3 07:07:41 helle anacron[12181]: Jobs will be executed sequentially
Nov 3 07:07:41 helle kernel: [13262.273999] [<c0162abe>] suspend_devices_and_enter+0xbe/0x1c0
Nov 3 07:07:41 helle kernel: [13262.274004] [<c04ee341>] ? printk+0x18/0x1f
Nov 3 07:07:41 helle kernel: [13262.274009] [<c0162d24>] enter_state+0x134/0x1a0
Nov 3 07:07:41 helle kernel: [13262.274015] [<c0162e0d>] state_store+0x7d/0xc0
Nov 3 07:07:41 helle kernel: [13262.274019] [<c0162d90>] ? state_store+0x0/0xc0
Nov 3 07:07:41 helle kernel: [13262.274025] [<c02ba984>] kobj_attr_store+0x24/0x30
Nov 3 07:07:41 helle kernel: [13262.274031] [<c01fe2d2>] sysfs_write_file+0x92/0xf0
Nov 3 07:07:41 helle kernel: [13262.274038] [<c01b18dc>] vfs_write+0x9c/0x170
Nov 3 07:07:41 helle kernel: [13262.274043] [<c01fe240>] ? sysfs_write_file+0x0/0xf0
Nov 3 07:07:41 helle kernel: [13262.274048] [<c01b1a6d>] sys_write+0x3d/0x70
Nov 3 07:07:41 helle kernel: [13262.274055] [<c0103eeb>] sysenter_do_call+0x12/0x2f
Nov 3 07:07:41 helle kernel: [13262.274059] ---[ end trace 208e2da6917d6433 ]---
Nov 3 07:07:41 helle kernel: [13262.274172] PM: Finishing wakeup.
Nov 3 07:07:41 helle kernel: [13262.274177] Restarting tasks ... done.
Nov 3 07:07:41 helle acpid: client 3045[0:0] has disconnected
Nov 3 07:07:41 helle acpid: client connected from 3045[0:0]
Nov 3 07:07:41 helle NetworkManager: <info> Waking up...
Nov 3 07:07:41 helle NetworkManager: <info> (ttyS0): ignoring due to lack of mobile broadband capabilties
Nov 3 07:07:41 helle NetworkManager: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Nov 3 07:07:41 helle NetworkManager: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Nov 3 07:10:01 helle /USR/SBIN/CRON[12341]: (root) CMD ([ -x /usr/sbin/update-motd ] && /usr/sbin/update-motd 2>/dev/null)
Nov 3 07:10:44 helle kernel: [13445.292025] nfs: server bbb not responding, timed out
Nov 3 07:10:44 helle kernel: [13445.292052] Error: state recovery failed on NFSv4 server bbb with error 5
Nov 3 07:12:44 helle kernel: [13565.292027] nfs: server bbb not responding, still trying

After 46 minutes I may get:

Nov 3 07:58:46 helle kernel: [16327.748034] nfs: server bbb not responding, still trying

However this has been seen only a few times because in most cases the computer has been powered down and rebooted long before...

While the client is "nfs-dead" it is still possible to ssh to it and do all kinds of things (e.g. inspecting syslog!)

The nfs4 mount point is defined thusly in /etc/fstab:
bbb:/helle-nfs /home/helle nfs4 bg 0 0

"Standard" documentation (uname-a.log, version.log, dmesg.log and lspci-vnvn.log) attached.