"nfs: server not responding, timed out", then nfs4 locks up
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
Low
|
Unassigned |
Bug Description
Even after applying kernel patches as described in https:/
The computers serving as nfs4 client and server both run a fully updated Jaunty (09.04). Exception: The client is a few security patches behind because I'm running a home-built kernel with the patch mentioned in http://
The problem is that nfs4 locks up so that all accesses to my nfs4 mount point get delayed indefinitely, with no way of interrupting it. syslog from a typical failure is:
Nov 3 07:07:41 helle kernel: [13262.273844] WARNING: at kernel/
Nov 3 07:07:41 helle kernel: [13262.273847] Component: resume devices
Nov 3 07:07:41 helle kernel: [13262.273851] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat usb_storage i915 drm binfmt_misc bridge stp bnep input_polldev video output nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipt_REJECT ipt_LOG xt_limit xt_tcpudp xt_state ipt_addrtype ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device psmouse iTCO_wdt iTCO_vendor_support intel_agp snd ppdev serio_raw agpgart pcspkr soundcore snd_page_alloc parport_pc parport usbhid r8169 mii
Nov 3 07:07:41 helle kernel: [13262.273940] Pid: 11870, comm: pm-suspend Tainted: G W 2.6.28.
Nov 3 07:07:41 helle kernel: [13262.273944] Call Trace:
Nov 3 07:07:41 helle kernel: [13262.273955] [<c0138c00>] warn_slowpath+
Nov 3 07:07:41 helle kernel: [13262.273965] [<c015201a>] ? down_trylock+
Nov 3 07:07:41 helle anacron[12181]: Anacron 2.3 started on 2009-11-03
Nov 3 07:07:41 helle kernel: [13262.273970] [<c013925d>] ? try_acquire_
Nov 3 07:07:41 helle kernel: [13262.273979] [<c02baab0>] ? kobject_
Nov 3 07:07:41 helle kernel: [13262.273988] [<c04ee341>] ? printk+0x18/0x1f
Nov 3 07:07:41 helle anacron[12181]: Will run job `cron.daily' in 5 min.
Nov 3 07:07:41 helle kernel: [13262.273993] [<c01629f0>] suspend_
Nov 3 07:07:41 helle anacron[12181]: Jobs will be executed sequentially
Nov 3 07:07:41 helle kernel: [13262.273999] [<c0162abe>] suspend_
Nov 3 07:07:41 helle kernel: [13262.274004] [<c04ee341>] ? printk+0x18/0x1f
Nov 3 07:07:41 helle kernel: [13262.274009] [<c0162d24>] enter_state+
Nov 3 07:07:41 helle kernel: [13262.274015] [<c0162e0d>] state_store+
Nov 3 07:07:41 helle kernel: [13262.274019] [<c0162d90>] ? state_store+
Nov 3 07:07:41 helle kernel: [13262.274025] [<c02ba984>] kobj_attr_
Nov 3 07:07:41 helle kernel: [13262.274031] [<c01fe2d2>] sysfs_write_
Nov 3 07:07:41 helle kernel: [13262.274038] [<c01b18dc>] vfs_write+
Nov 3 07:07:41 helle kernel: [13262.274043] [<c01fe240>] ? sysfs_write_
Nov 3 07:07:41 helle kernel: [13262.274048] [<c01b1a6d>] sys_write+0x3d/0x70
Nov 3 07:07:41 helle kernel: [13262.274055] [<c0103eeb>] sysenter_
Nov 3 07:07:41 helle kernel: [13262.274059] ---[ end trace 208e2da6917d6433 ]---
Nov 3 07:07:41 helle kernel: [13262.274172] PM: Finishing wakeup.
Nov 3 07:07:41 helle kernel: [13262.274177] Restarting tasks ... done.
Nov 3 07:07:41 helle acpid: client 3045[0:0] has disconnected
Nov 3 07:07:41 helle acpid: client connected from 3045[0:0]
Nov 3 07:07:41 helle NetworkManager: <info> Waking up...
Nov 3 07:07:41 helle NetworkManager: <info> (ttyS0): ignoring due to lack of mobile broadband capabilties
Nov 3 07:07:41 helle NetworkManager: <info> Unmanaged Device found; state CONNECTED forced. (see http://
Nov 3 07:07:41 helle NetworkManager: <info> Unmanaged Device found; state CONNECTED forced. (see http://
Nov 3 07:10:01 helle /USR/SBIN/
Nov 3 07:10:44 helle kernel: [13445.292025] nfs: server bbb not responding, timed out
Nov 3 07:10:44 helle kernel: [13445.292052] Error: state recovery failed on NFSv4 server bbb with error 5
Nov 3 07:12:44 helle kernel: [13565.292027] nfs: server bbb not responding, still trying
After 46 minutes I may get:
Nov 3 07:58:46 helle kernel: [16327.748034] nfs: server bbb not responding, still trying
However this has been seen only a few times because in most cases the computer has been powered down and rebooted long before...
While the client is "nfs-dead" it is still possible to ssh to it and do all kinds of things (e.g. inspecting syslog!)
The nfs4 mount point is defined thusly in /etc/fstab:
bbb:/helle-nfs /home/helle nfs4 bg 0 0
"Standard" documentation (uname-a.log, version.log, dmesg.log and lspci-vnvn.log) attached.
After reporting this bug, I switched to using UDP rather than TCP, and I have not seen the problem for 3 weeks now!
The exact change is that I added ",proto=udp" to /etc/fstab on the client like this:
bbb:/helle-nfs /home/helle nfs4 bg,proto=udp 0 0