NFS4 kills system (no reboot possible)

Bug #578866 reported by H.-Dirk Schmitt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

I have migrated from a NFS3 infrastructure to NFS4 (without kerberos) (to workaround #525154)

Setup:
* ubuntu/lucid amd64
   autofs5/lucid uptodate 5.0.4-3.1ubuntu5
   linux-image-generic/lucid uptodate 2.6.32.22.23 [because I have got troubles to boot the system with a -server kernel (since lucid) I'm using currently the -generic kernel]
   nfs-kernel-server/lucid uptodate 1:1.2.0-4ubuntu4
* create /srv/nfs4 and export it via NFS4
* have bind mounts from /srv/nfs4 to the traditional mount points of the exported shares

With NFS4 I can't use bind mount with autofs (out of the box).
So I have to access "shared" drives locally also with NFS4.

If I copy a big file (e.g. a CD image) to a share mounted via NFS4 locally after short time the system is blocked.
* LoadAvg grows to infinity
* After some time I see messages about blocked task correlated to nfs or accessing nfs shares on the local sever and all clients accessing this server
* shutdown/reboot will also blocked and not come to an end
  To reboot the system I have to issue a hard reboot on the server console

The problem doesn't occur if I:
* copy only smaller files
* copy files from client to the server (e.g. a 160GB hdd image was processed without error)

The problem occurs on both nfs servers.

Hardware:
2 similar boxes with:
* TYAN Thunder K8WE S2895
* 2 Opteron K8 CPUs
* SCSI and SATA hdds

[As a workaround I will create diverted autofs configuration with explicit local binding mounts.]
---
Architecture: amd64
DistroRelease: Ubuntu 10.04
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=de_DE:de:en_US:en
 PATH=(custom, no user)
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
 LC_PAPER=de_DE.UTF-8
ProcVersionSignature: Ubuntu 2.6.32-24.39-server 2.6.32.15+drm33.5
Regression: Yes
Reproducible: Yes
Tags: lucid regression-release needs-upstream-testing
Uname: Linux 2.6.32-24-server x86_64
UserGroups:

Steve Langasek (vorlon)
affects: nfs-utils (Ubuntu) → linux (Ubuntu)
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi H.-Dirk,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/releases/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 578866

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
H.-Dirk Schmitt (dirk-computer42) wrote :

I retest the issue with the current lucid 10.04.1 version on the server:

mount -t nfs4 -o rw,rsize=8192,wsize=8192 $(hostname):/share /mnt/
dd if=/dev/zero of=/mnt/testfile

The load is constant increasing.
After some minutes the following log messages occurs:

Aug 10 21:08:21 pluto kernel: [ 3600.620029] INFO: task kswapd0:36 blocked for more than 120 seconds.
Aug 10 21:08:21 pluto kernel: [ 3600.622490] INFO: task kswapd1:37 blocked for more than 120 seconds.
Aug 10 21:08:21 pluto kernel: [ 3600.625284] INFO: task nfsd:4222 blocked for more than 120 seconds.
Aug 10 21:08:21 pluto kernel: [ 3600.628445] INFO: task nfsd:4223 blocked for more than 120 seconds.
Aug 10 21:08:21 pluto kernel: [ 3600.632244] INFO: task nfsd:4224 blocked for more than 120 seconds.
Aug 10 21:08:21 pluto kernel: [ 3600.636158] INFO: task nfsd:4226 blocked for more than 120 seconds.
Aug 10 21:08:21 pluto kernel: [ 3600.638376] INFO: task nfsd:4227 blocked for more than 120 seconds.
Aug 10 21:08:21 pluto kernel: [ 3600.639152] INFO: task dd:23415 blocked for more than 120 seconds.

tags: added: apport-collected
description: updated
Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
H.-Dirk Schmitt (dirk-computer42) wrote :

seems to be fixed on 2.6.32-28-server #55-Ubuntu SMP Mon Jan 10 23:57:16 UTC 2011 x86_64 GNU/Linux
or natty kernels

Changed in linux (Ubuntu):
status: New → Fix Released
Revision history for this message
Harry Flink (to-harryflink-from-launchpad) wrote :

I try to run "bzr status" on our couple hundred megabyte development branch on our new NFSv4 share. This causes Bazaar on client to halt and cause errors on latest Ubuntu 10.10 AMD64 kernel 2.6.35-27-generic. I wonder if this is the same bug and whether it was fixed yet on 2.6.35-27 kernels.

This message appears repeately (every 120sec) in the syslog:
Mar 14 17:25:46 hf kernel: [ 1080.078609] INFO: task bzr:2348 blocked for more than 120 seconds.
Mar 14 17:25:46 hf kernel: [ 1080.078611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 17:25:46 hf kernel: [ 1080.078614] bzr D 00000000ffffa30f 0 2348 2347 0x00000000
Mar 14 17:25:46 hf kernel: [ 1080.078618] ffff880341823c08 0000000000000086 ffff880300000000 0000000000015980
Mar 14 17:25:46 hf kernel: [ 1080.078623] ffff880341823fd8 0000000000015980 ffff880341823fd8 ffff880344e3db80
Mar 14 17:25:46 hf kernel: [ 1080.078626] 0000000000015980 0000000000015980 ffff880341823fd8 0000000000015980
Mar 14 17:25:46 hf kernel: [ 1080.078630] Call Trace:
Mar 14 17:25:46 hf kernel: [ 1080.078637] [<ffffffff81101c50>] ? sync_page+0x0/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078641] [<ffffffff81589093>] io_schedule+0x73/0xc0
Mar 14 17:25:46 hf kernel: [ 1080.078644] [<ffffffff81101c8d>] sync_page+0x3d/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078647] [<ffffffff8158996f>] __wait_on_bit+0x5f/0x90
Mar 14 17:25:46 hf kernel: [ 1080.078650] [<ffffffff81101e43>] wait_on_page_bit+0x73/0x80
Mar 14 17:25:46 hf kernel: [ 1080.078655] [<ffffffff8107faf0>] ? wake_bit_function+0x0/0x40
Mar 14 17:25:46 hf kernel: [ 1080.078659] [<ffffffff8110c605>] ? pagevec_lookup_tag+0x25/0x40
Mar 14 17:25:46 hf kernel: [ 1080.078662] [<ffffffff8110230d>] filemap_fdatawait_range+0x10d/0x1a0
Mar 14 17:25:46 hf kernel: [ 1080.078666] [<ffffffff811023cb>] filemap_fdatawait+0x2b/0x30
Mar 14 17:25:46 hf kernel: [ 1080.078669] [<ffffffff811026d4>] filemap_write_and_wait+0x44/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078684] [<ffffffffa1802fec>] nfs_setattr+0x14c/0x160 [nfs]
Mar 14 17:25:46 hf kernel: [ 1080.078688] [<ffffffff8116c69b>] notify_change+0x16b/0x310
Mar 14 17:25:46 hf kernel: [ 1080.078692] [<ffffffff81152694>] do_truncate+0x64/0xa0
Mar 14 17:25:46 hf kernel: [ 1080.078695] [<ffffffff8115292b>] T.784+0xeb/0x110
Mar 14 17:25:46 hf kernel: [ 1080.078698] [<ffffffff8115295e>] sys_ftruncate+0xe/0x10
Mar 14 17:25:46 hf kernel: [ 1080.078703] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.