"nfs: server not responding, timed out", then nfs4 locks up

Bug #477711 reported by Arne Christensen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Low
Unassigned

Bug Description

Even after applying kernel patches as described in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/379385 I occasionally experience the following problem on my nfs4 client computer. "Occasionally" means that recently (and according to syslog) it happened on Oct 14, Nov 3, Nov 4 and Nov 7. It typically happens after the nfs4 client computer goes out of sleep. I understand why sleep may upset the nfs connection, but it clearly recovers "most of the time".

The computers serving as nfs4 client and server both run a fully updated Jaunty (09.04). Exception: The client is a few security patches behind because I'm running a home-built kernel with the patch mentioned in http://bugzilla.kernel.org/attachment.cgi?id=18427 (comment #19).

The problem is that nfs4 locks up so that all accesses to my nfs4 mount point get delayed indefinitely, with no way of interrupting it. syslog from a typical failure is:

Nov 3 07:07:41 helle kernel: [13262.273844] WARNING: at kernel/power/main.c:177 suspend_test_finish+0x80/0x90()
Nov 3 07:07:41 helle kernel: [13262.273847] Component: resume devices
Nov 3 07:07:41 helle kernel: [13262.273851] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat usb_storage i915 drm binfmt_misc bridge stp bnep input_polldev video output nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipt_REJECT ipt_LOG xt_limit xt_tcpudp xt_state ipt_addrtype ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device psmouse iTCO_wdt iTCO_vendor_support intel_agp snd ppdev serio_raw agpgart pcspkr soundcore snd_page_alloc parport_pc parport usbhid r8169 mii
Nov 3 07:07:41 helle kernel: [13262.273940] Pid: 11870, comm: pm-suspend Tainted: G W 2.6.28.10-nfspatch.2 #1
Nov 3 07:07:41 helle kernel: [13262.273944] Call Trace:
Nov 3 07:07:41 helle kernel: [13262.273955] [<c0138c00>] warn_slowpath+0x60/0x80
Nov 3 07:07:41 helle kernel: [13262.273965] [<c015201a>] ? down_trylock+0x2a/0x40
Nov 3 07:07:41 helle anacron[12181]: Anacron 2.3 started on 2009-11-03
Nov 3 07:07:41 helle kernel: [13262.273970] [<c013925d>] ? try_acquire_console_sem+0xd/0x30
Nov 3 07:07:41 helle kernel: [13262.273979] [<c02baab0>] ? kobject_put+0x20/0x50
Nov 3 07:07:41 helle kernel: [13262.273988] [<c04ee341>] ? printk+0x18/0x1f
Nov 3 07:07:41 helle anacron[12181]: Will run job `cron.daily' in 5 min.
Nov 3 07:07:41 helle kernel: [13262.273993] [<c01629f0>] suspend_test_finish+0x80/0x90
Nov 3 07:07:41 helle anacron[12181]: Jobs will be executed sequentially
Nov 3 07:07:41 helle kernel: [13262.273999] [<c0162abe>] suspend_devices_and_enter+0xbe/0x1c0
Nov 3 07:07:41 helle kernel: [13262.274004] [<c04ee341>] ? printk+0x18/0x1f
Nov 3 07:07:41 helle kernel: [13262.274009] [<c0162d24>] enter_state+0x134/0x1a0
Nov 3 07:07:41 helle kernel: [13262.274015] [<c0162e0d>] state_store+0x7d/0xc0
Nov 3 07:07:41 helle kernel: [13262.274019] [<c0162d90>] ? state_store+0x0/0xc0
Nov 3 07:07:41 helle kernel: [13262.274025] [<c02ba984>] kobj_attr_store+0x24/0x30
Nov 3 07:07:41 helle kernel: [13262.274031] [<c01fe2d2>] sysfs_write_file+0x92/0xf0
Nov 3 07:07:41 helle kernel: [13262.274038] [<c01b18dc>] vfs_write+0x9c/0x170
Nov 3 07:07:41 helle kernel: [13262.274043] [<c01fe240>] ? sysfs_write_file+0x0/0xf0
Nov 3 07:07:41 helle kernel: [13262.274048] [<c01b1a6d>] sys_write+0x3d/0x70
Nov 3 07:07:41 helle kernel: [13262.274055] [<c0103eeb>] sysenter_do_call+0x12/0x2f
Nov 3 07:07:41 helle kernel: [13262.274059] ---[ end trace 208e2da6917d6433 ]---
Nov 3 07:07:41 helle kernel: [13262.274172] PM: Finishing wakeup.
Nov 3 07:07:41 helle kernel: [13262.274177] Restarting tasks ... done.
Nov 3 07:07:41 helle acpid: client 3045[0:0] has disconnected
Nov 3 07:07:41 helle acpid: client connected from 3045[0:0]
Nov 3 07:07:41 helle NetworkManager: <info> Waking up...
Nov 3 07:07:41 helle NetworkManager: <info> (ttyS0): ignoring due to lack of mobile broadband capabilties
Nov 3 07:07:41 helle NetworkManager: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Nov 3 07:07:41 helle NetworkManager: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Nov 3 07:10:01 helle /USR/SBIN/CRON[12341]: (root) CMD ([ -x /usr/sbin/update-motd ] && /usr/sbin/update-motd 2>/dev/null)
Nov 3 07:10:44 helle kernel: [13445.292025] nfs: server bbb not responding, timed out
Nov 3 07:10:44 helle kernel: [13445.292052] Error: state recovery failed on NFSv4 server bbb with error 5
Nov 3 07:12:44 helle kernel: [13565.292027] nfs: server bbb not responding, still trying

After 46 minutes I may get:

Nov 3 07:58:46 helle kernel: [16327.748034] nfs: server bbb not responding, still trying

However this has been seen only a few times because in most cases the computer has been powered down and rebooted long before...

While the client is "nfs-dead" it is still possible to ssh to it and do all kinds of things (e.g. inspecting syslog!)

The nfs4 mount point is defined thusly in /etc/fstab:
bbb:/helle-nfs /home/helle nfs4 bg 0 0

"Standard" documentation (uname-a.log, version.log, dmesg.log and lspci-vnvn.log) attached.

Revision history for this message
Arne Christensen (arne-chr) wrote :
Revision history for this message
Arne Christensen (arne-chr) wrote :
Revision history for this message
Arne Christensen (arne-chr) wrote :
Revision history for this message
Arne Christensen (arne-chr) wrote :
Revision history for this message
Arne Christensen (arne-chr) wrote :

After reporting this bug, I switched to using UDP rather than TCP, and I have not seen the problem for 3 weeks now!

The exact change is that I added ",proto=udp" to /etc/fstab on the client like this:
bbb:/helle-nfs /home/helle nfs4 bg,proto=udp 0 0

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Arne,
     That is great news. I'll change this bug to invalid since you have resolved your issue. Please feel free to reopen it should this behavior return.

Thanks!

-JFo

Changed in linux (Ubuntu):
status: New → Invalid
importance: Undecided → Low
Revision history for this message
Arne Christensen (arne-chr) wrote :

Well, if TCP is supported (and even is the default!) I think it should work. I see switching to UDP as a workaround that I was lucky to find. There is no sense in accepting that a provided feature (NFS4 over TCP) does not work reliably. It would be better remove it, so other people don't fall into this trap (now I'm a bit provocative, I know)!
The best would be to keep this bug report open until more data can be gathered which might lead to a proper fix. A fortunate side effect of the present bug report is that it serves as documentation of a workaround for others that just need to get the thing to work. That's part of the reason I wrote a descriptive title.
On a side note, I realize that there may be an underlying problem in the TCP implementation, but NFS4 should not block completely even in that case.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.