nfsd from nfs-kernel-server very slow and system load from 25%-100% from nfsd

Bug #879334 reported by Vagelis Nonas on 2011-10-21
190
This bug affects 39 people
Affects Status Importance Assigned to Milestone
linux (Debian)
Fix Released
Unknown
linux (Ubuntu)
Undecided
Unassigned
nfs-utils (Ubuntu)
Undecided
Unassigned

Bug Description

I have a diskless ubuntu 10.10 machine which I boot regularly using pxe-boot from another ubuntu machine where I have the root filesystem of the diskless machine exported over nfs.

I set it up about a year ago using 10.10. In the mean while the server machine got upgraded to 11.04 and as of yesterday to 11.10.

After the upgrade to 11.10 the diskless machine is dead slow (most of the times it wont even boot completely) and the load on the server machine is high (25%-100% as shown from top). If in the middle of the diskless computer booting I do a restart of the nfs server, the client computer proceeds with the boot a bit more and then it gets stuck again. I have to restart and nfs-server 3-4 times in order to get the gdm login screen at the client machine

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: nfs-kernel-server 1:1.2.4-1ubuntu2
ProcVersionSignature: Ubuntu 3.0.0-12.20-generic 3.0.4
Uname: Linux 3.0.0-12-generic i686
ApportVersion: 1.23-0ubuntu3
Architecture: i386
Date: Fri Oct 21 12:53:02 2011
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: nfs-utils
UpgradeStatus: Upgraded to oneiric on 2011-10-20 (1 days ago)

Vagelis Nonas (vnonas) wrote :
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status: New → Confirmed
Tom Vijlbrief (tvijlbrief) wrote :

while true;
do
sleep 1
rm -f data*
echo data > data1
echo data > data2
done

This file access pattern which results from
a more complex script which writes more data hangs my client
(nfs server not responding)
and creates looping nfsd processes on the server.

Tom Vijlbrief (tvijlbrief) wrote :

Looks similar to bug # 585657 (similar dmesg on the client)
which was solved for 10.10

This morning my workstation hung on reading a large file after
I wrote a big file.

Looping nfsd daemons on the server...

Vagelis Nonas (vnonas) wrote :

I have made a fresh install of the diskless machine with the latest ubuntu (11.10) and the problem has gone away.

However, now that you mentioned about big files I noticed in the old diskless file system (ubuntu 10.10) there are many hidden .nfsxxxxxxxxxx files some of them quite big. I suppose they are left overs....

Here are a few examples (the size appears before the date):
115351481 16252 -rw-r--r-- 1 root root 16642048 Feb 26 2011 /mnt1/ubuntu/var/cache/apt/.nfs0000000002a50011000000e2
115351436 16792 -rw-r--r-- 1 root root 17236093 Sep 11 15:57 /mnt1/ubuntu/var/cache/apt/.nfs0000000006e01f8c00000063
115351478 16020 -rw-r--r-- 1 root root 16404288 Dec 20 2010 /mnt1/ubuntu/var/cache/apt/.nfs000000000057c00900000032
115351480 16208 -rw-r--r-- 1 root root 16594500 Feb 16 2011 /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5001200000036
115351483 16704 -rw-r--r-- 1 root root 17102785 May 5 2011 /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5004d00000032
115351460 16824 -rw-r--r-- 1 root root 17283067 Oct 22 07:45 /mnt1/ubuntu/var/cache/apt/.nfs0000000006e01fa40000009d
115351839 16820 -rw-r--r-- 1 root root 17284552 Oct 13 07:26 /mnt1/ubuntu/var/cache/apt/.nfs0000000006e0211f0000001b
115351479 16152 -rw-r--r-- 1 root root 16536128 Feb 2 2011 /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5000f00000026
115351470 16804 -rw-r--r-- 1 root root 17258153 Sep 17 07:51 /mnt1/ubuntu/var/cache/apt/.nfs0000000006e01fae000000bd
115351755 16820 -rw-r--r-- 1 root root 17261588 Oct 5 07:33 /mnt1/ubuntu/var/cache/apt/.nfs0000000006e020cb00000019
115351482 16380 -rw-r--r-- 1 root root 16769594 Apr 22 2011 /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5004c0000001b
115352738 4 -rw-r--r-- 1 root root 1681 Feb 2 2011 /mnt1/ubuntu/etc/.nfs0000000006e024a200000026
115353070 2097156 -rw-r--r-- 1 root root 2147483648 Mar 18 2011 /mnt1/ubuntu/etc/.nfs00000000001b807b00000028

Download full text (4.0 KiB)

I wonder if your fresh install uses different mount options. Eg nfs 4 while
the old install used version 3?

I have an old exports on my server so I expect it uses 3. Did you convert
your exports?
Op 16 nov. 2011 12:35 schreef "Vagelis Nonas" <email address hidden>
het volgende:

> I have made a fresh install of the diskless machine with the latest
> ubuntu (11.10) and the problem has gone away.
>
> However, now that you mentioned about big files I noticed in the old
> diskless file system (ubuntu 10.10) there are many hidden .nfsxxxxxxxxxx
> files some of them quite big. I suppose they are left overs....
>
> Here are a few examples (the size appears before the date):
> 115351481 16252 -rw-r--r-- 1 root root 16642048 Feb 26 2011
> /mnt1/ubuntu/var/cache/apt/.nfs0000000002a50011000000e2
> 115351436 16792 -rw-r--r-- 1 root root 17236093 Sep 11 15:57
> /mnt1/ubuntu/var/cache/apt/.nfs0000000006e01f8c00000063
> 115351478 16020 -rw-r--r-- 1 root root 16404288 Dec 20 2010
> /mnt1/ubuntu/var/cache/apt/.nfs000000000057c00900000032
> 115351480 16208 -rw-r--r-- 1 root root 16594500 Feb 16 2011
> /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5001200000036
> 115351483 16704 -rw-r--r-- 1 root root 17102785 May 5 2011
> /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5004d00000032
> 115351460 16824 -rw-r--r-- 1 root root 17283067 Oct 22 07:45
> /mnt1/ubuntu/var/cache/apt/.nfs0000000006e01fa40000009d
> 115351839 16820 -rw-r--r-- 1 root root 17284552 Oct 13 07:26
> /mnt1/ubuntu/var/cache/apt/.nfs0000000006e0211f0000001b
> 115351479 16152 -rw-r--r-- 1 root root 16536128 Feb 2 2011
> /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5000f00000026
> 115351470 16804 -rw-r--r-- 1 root root 17258153 Sep 17 07:51
> /mnt1/ubuntu/var/cache/apt/.nfs0000000006e01fae000000bd
> 115351755 16820 -rw-r--r-- 1 root root 17261588 Oct 5 07:33
> /mnt1/ubuntu/var/cache/apt/.nfs0000000006e020cb00000019
> 115351482 16380 -rw-r--r-- 1 root root 16769594 Apr 22 2011
> /mnt1/ubuntu/var/cache/apt/.nfs0000000002a5004c0000001b
> 115352738 4 -rw-r--r-- 1 root root 1681 Feb 2 2011
> /mnt1/ubuntu/etc/.nfs0000000006e024a200000026
> 115353070 2097156 -rw-r--r-- 1 root root 2147483648 Mar 18 2011
> /mnt1/ubuntu/etc/.nfs00000000001b807b00000028
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/879334
>
> Title:
> nfsd from nfs-kernel-server very slow and system load from 25%-100%
> from nfsd
>
> Status in “nfs-utils” package in Ubuntu:
> Confirmed
>
> Bug description:
> I have a diskless ubuntu 10.10 machine which I boot regularly using
> pxe-boot from another ubuntu machine where I have the root filesystem
> of the diskless machine exported over nfs.
>
> I set it up about a year ago using 10.10. In the mean while the server
> machine got upgraded to 11.04 and as of yesterday to 11.10.
>
> After the upgrade to 11.10 the diskless machine is dead slow (most of
> the times it wont even boot completely) and the load on the server
> machine is high (25%-100% as shown from top). If...

Read more...

Vagelis Nonas (vnonas) wrote :

I dont think my exports use nfs v3 options, this is my exports file (unchanged from previous install)

/mnt1/ubuntu_new 192.168.1.46(rw,no_root_squash,sync,no_subtree_check) 192.168.1.194(rw,no_root_squash,sync,no_subtree_check)
/mnt1/mac 192.168.1.47(rw,no_root_squash,sync,no_subtree_check) 192.168.1.50(rw,no_root_squash,sync,no_subtree_check)

However I can make another test with the old filesystem, after deleting the big hidden .nfsxxxxxxxx files and see if it defaults to nfs v3, if you think it might be of use to you.

Tom Vijlbrief (tvijlbrief) wrote :

According to https://help.ubuntu.com/community/SettingUpNFSHowTo

you need fsid in exports to use nfs 4. So I think we both use 4.

I wonder what is different after your reinstall...
Op 16 nov. 2011 13:30 schreef "Vagelis Nonas" <email address hidden>
het volgende:

> I dont think my exports use nfs v3 options, this is my exports file
> (unchanged from previous install)
>
> /mnt1/ubuntu_new
> 192.168.1.46(rw,no_root_squash,sync,no_subtree_check)
> 192.168.1.194(rw,no_root_squash,sync,no_subtree_check)
> /mnt1/mac 192.168.1.47(rw,no_root_squash,sync,no_subtree_check)
> 192.168.1.50(rw,no_root_squash,sync,no_subtree_check)
>
> However I can make another test with the old filesystem, after deleting
> the big hidden .nfsxxxxxxxx files and see if it defaults to nfs v3, if
> you think it might be of use to you.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/879334
>
> Title:
> nfsd from nfs-kernel-server very slow and system load from 25%-100%
> from nfsd
>
> Status in “nfs-utils” package in Ubuntu:
> Confirmed
>
> Bug description:
> I have a diskless ubuntu 10.10 machine which I boot regularly using
> pxe-boot from another ubuntu machine where I have the root filesystem
> of the diskless machine exported over nfs.
>
> I set it up about a year ago using 10.10. In the mean while the server
> machine got upgraded to 11.04 and as of yesterday to 11.10.
>
> After the upgrade to 11.10 the diskless machine is dead slow (most of
> the times it wont even boot completely) and the load on the server
> machine is high (25%-100% as shown from top). If in the middle of the
> diskless computer booting I do a restart of the nfs server, the client
> computer proceeds with the boot a bit more and then it gets stuck
> again. I have to restart and nfs-server 3-4 times in order to get the
> gdm login screen at the client machine
>
> ProblemType: Bug
> DistroRelease: Ubuntu 11.10
> Package: nfs-kernel-server 1:1.2.4-1ubuntu2
> ProcVersionSignature: Ubuntu 3.0.0-12.20-generic 3.0.4
> Uname: Linux 3.0.0-12-generic i686
> ApportVersion: 1.23-0ubuntu3
> Architecture: i386
> Date: Fri Oct 21 12:53:02 2011
> ProcEnviron:
> LANG=en_US.UTF-8
> SHELL=/bin/bash
> SourcePackage: nfs-utils
> UpgradeStatus: Upgraded to oneiric on 2011-10-20 (1 days ago)
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/879334/+subscriptions
>

Tom Vijlbrief (tvijlbrief) wrote :

I meant I think we both use 3.
Op 16 nov. 2011 14:43 schreef "Tom Vijlbrief" <email address hidden> het
volgende:

> According to https://help.ubuntu.com/community/SettingUpNFSHowTo
>
> you need fsid in exports to use nfs 4. So I think we both use 4.
>
> I wonder what is different after your reinstall...
> Op 16 nov. 2011 13:30 schreef "Vagelis Nonas" <email address hidden>
> het volgende:
>
>> I dont think my exports use nfs v3 options, this is my exports file
>> (unchanged from previous install)
>>
>> /mnt1/ubuntu_new
>> 192.168.1.46(rw,no_root_squash,sync,no_subtree_check)
>> 192.168.1.194(rw,no_root_squash,sync,no_subtree_check)
>> /mnt1/mac 192.168.1.47(rw,no_root_squash,sync,no_subtree_check)
>> 192.168.1.50(rw,no_root_squash,sync,no_subtree_check)
>>
>> However I can make another test with the old filesystem, after deleting
>> the big hidden .nfsxxxxxxxx files and see if it defaults to nfs v3, if
>> you think it might be of use to you.
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/879334
>>
>> Title:
>> nfsd from nfs-kernel-server very slow and system load from 25%-100%
>> from nfsd
>>
>> Status in “nfs-utils” package in Ubuntu:
>> Confirmed
>>
>> Bug description:
>> I have a diskless ubuntu 10.10 machine which I boot regularly using
>> pxe-boot from another ubuntu machine where I have the root filesystem
>> of the diskless machine exported over nfs.
>>
>> I set it up about a year ago using 10.10. In the mean while the server
>> machine got upgraded to 11.04 and as of yesterday to 11.10.
>>
>> After the upgrade to 11.10 the diskless machine is dead slow (most of
>> the times it wont even boot completely) and the load on the server
>> machine is high (25%-100% as shown from top). If in the middle of the
>> diskless computer booting I do a restart of the nfs server, the client
>> computer proceeds with the boot a bit more and then it gets stuck
>> again. I have to restart and nfs-server 3-4 times in order to get the
>> gdm login screen at the client machine
>>
>> ProblemType: Bug
>> DistroRelease: Ubuntu 11.10
>> Package: nfs-kernel-server 1:1.2.4-1ubuntu2
>> ProcVersionSignature: Ubuntu 3.0.0-12.20-generic 3.0.4
>> Uname: Linux 3.0.0-12-generic i686
>> ApportVersion: 1.23-0ubuntu3
>> Architecture: i386
>> Date: Fri Oct 21 12:53:02 2011
>> ProcEnviron:
>> LANG=en_US.UTF-8
>> SHELL=/bin/bash
>> SourcePackage: nfs-utils
>> UpgradeStatus: Upgraded to oneiric on 2011-10-20 (1 days ago)
>>
>> To manage notifications about this bug go to:
>>
>> https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/879334/+subscriptions
>>
>

Vagelis Nonas (vnonas) wrote :

Yes you are right, it must be v3 the protocol we both use. I'll make a test again tomorrow of the old file system and post back my observations.

Do you know why there are those hidden .nfsxxxxxxxxxxxxx files? I noticed they exist in the new file system too. Are they left overs, or are they necessary for the operation of nfs? Do you have such files in your server at the filesystems you export?

Vagelis Nonas (vnonas) wrote :

the hidden .nfsxxxxxxx files can be removed without problem, they are "left overs", created by the nfs protocol inefficiencies

Tom Vijlbrief (tvijlbrief) wrote :

The .nfs files are created when you remove (unlink(2) in Unix lingo) a
file while it is still opened (used) by a program. This is quite
normal behavior for Unix programs.
In a local file system the file is only removed from the directory
(can no longer be opened by other programs) and it is not really
destroyed until the Unix kernel detects that the last program
accessing it exits.
The nfs servers implements it (the remove action) by renaming the file
to .nfsXXX at the server so that it can still be used by the client.
It does not know when the last accessing client exits, so it keeps the
file until a cleanup job is run.

2011/11/16 Vagelis Nonas <email address hidden>:
> Yes you are right, it must be v3 the protocol we both use. I'll make a
> test again tomorrow of the old file system and post back my
> observations.
>
> Do you know why there are those hidden .nfsxxxxxxxxxxxxx   files? I
> noticed they exist in the new file system too. Are they left overs, or
> are they necessary for the operation of nfs? Do you have such files in
> your server at the filesystems you export?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/879334
>
> Title:
>  nfsd from nfs-kernel-server very slow and system load from 25%-100%
>  from nfsd
>
> Status in “nfs-utils” package in Ubuntu:
>  Confirmed
>
> Bug description:
>  I have a diskless ubuntu 10.10 machine which I boot regularly using
>  pxe-boot from another ubuntu machine where I have the root filesystem
>  of the diskless machine exported over nfs.
>
>  I set it up about a year ago using 10.10. In the mean while the server
>  machine got upgraded to 11.04 and as of yesterday to 11.10.
>
>  After the upgrade to 11.10 the diskless machine is dead slow (most of
>  the times it wont even boot completely) and the load on the server
>  machine is high (25%-100% as shown from top). If in the middle of the
>  diskless computer booting I do a restart of the nfs server, the client
>  computer proceeds with the boot a bit more and then it gets stuck
>  again. I have to restart and nfs-server 3-4 times in order to get the
>  gdm login screen at the client machine
>
>  ProblemType: Bug
>  DistroRelease: Ubuntu 11.10
>  Package: nfs-kernel-server 1:1.2.4-1ubuntu2
>  ProcVersionSignature: Ubuntu 3.0.0-12.20-generic 3.0.4
>  Uname: Linux 3.0.0-12-generic i686
>  ApportVersion: 1.23-0ubuntu3
>  Architecture: i386
>  Date: Fri Oct 21 12:53:02 2011
>  ProcEnviron:
>   LANG=en_US.UTF-8
>   SHELL=/bin/bash
>  SourcePackage: nfs-utils
>  UpgradeStatus: Upgraded to oneiric on 2011-10-20 (1 days ago)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/879334/+subscriptions
>

Nobuaki Nakamura (yubird) wrote :

in kernel 3.0.0-14-server too.

Tom Vijlbrief (tvijlbrief) wrote :

I converted my NFS server (which started life many Ubuntus ago,
probably as a 7.04 server) to v4 exports which solved my problem.

A newer Ubuntu server installation works fine as v3 server, so the
problem is probably caused by some old left over configuration
files....

2011/12/16 Nobuaki Nakamura <email address hidden>:
> in kernel 3.0.0-14-server too.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/879334
>
> Title:
>  nfsd from nfs-kernel-server very slow and system load from 25%-100%
>  from nfsd
>
> Status in “nfs-utils” package in Ubuntu:
>  Confirmed
>
> Bug description:
>  I have a diskless ubuntu 10.10 machine which I boot regularly using
>  pxe-boot from another ubuntu machine where I have the root filesystem
>  of the diskless machine exported over nfs.
>
>  I set it up about a year ago using 10.10. In the mean while the server
>  machine got upgraded to 11.04 and as of yesterday to 11.10.
>
>  After the upgrade to 11.10 the diskless machine is dead slow (most of
>  the times it wont even boot completely) and the load on the server
>  machine is high (25%-100% as shown from top). If in the middle of the
>  diskless computer booting I do a restart of the nfs server, the client
>  computer proceeds with the boot a bit more and then it gets stuck
>  again. I have to restart and nfs-server 3-4 times in order to get the
>  gdm login screen at the client machine
>
>  ProblemType: Bug
>  DistroRelease: Ubuntu 11.10
>  Package: nfs-kernel-server 1:1.2.4-1ubuntu2
>  ProcVersionSignature: Ubuntu 3.0.0-12.20-generic 3.0.4
>  Uname: Linux 3.0.0-12-generic i686
>  ApportVersion: 1.23-0ubuntu3
>  Architecture: i386
>  Date: Fri Oct 21 12:53:02 2011
>  ProcEnviron:
>   LANG=en_US.UTF-8
>   SHELL=/bin/bash
>  SourcePackage: nfs-utils
>  UpgradeStatus: Upgraded to oneiric on 2011-10-20 (1 days ago)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/879334/+subscriptions

Ivan Frederiks (idfred) wrote :

Got exactly same symptoms after upgrade from 11.04 to 12.04

Originally nfs server was set up on Ubuntu 11.04 i386.

String in /etc/exports sounds like
/srv/share 192.168.2.0/24(rw,no_root_squash,async,no_subtree_check)

Diskless client is Debian 6.0.5 i386

nfs-kernel-server:
  Installed: 1:1.2.5-3ubuntu3
  Candidate: 1:1.2.5-3ubuntu3
  Version table:
 *** 1:1.2.5-3ubuntu3 0
        500 http://de.archive.ubuntu.com/ubuntu/ precise/main i386 Packages

tags: added: precise
removed: running-unity
Jeff Ebert (jeffrey-ebertland) wrote :

This thread on linux-nfs seems to be the same issue:
http://www.spinics.net/lists/linux-nfs/msg30552.html

Also, Bug #1006446 seems to be the same issue (marked as Duplicate).

Jeff Ebert (jeffrey-ebertland) wrote :

Another thread on linux-nfs that appears to be the same issue, this one using kernel version 3.3.3.
http://www.spinics.net/lists/linux-nfs/msg29935.html

No response to this thread, however.

Jeff Ebert (jeffrey-ebertland) wrote :

There is an upstream bug here:
https://bugzilla.kernel.org/show_bug.cgi?id=40912

I have tried the latest mainstream kernel (3.5.0) using the instructions here:
https://wiki.ubuntu.com/KernelTeam/GitKernelBuild

I still see the high CPU load on the NFS server.

I then reversed the patch suggested in the above bug.

$ git show 9660439861aa8dbd5e2b8087f33e20760c2c9afc
commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
Author: Olga Kornievskaia <email address hidden>
Date: Tue Oct 21 14:13:47 2008 -0400

    svcrpc: take advantage of tcp autotuning

I also reversed the patch mentioned here manually, since I could not find the commit hash for it
http://lists.openwall.net/netdev/2012/01/20/81

Unfortunately, this patched version of 3.5.0 does not boot. I may have screwed up something else along the way, but I wanted to report this in case somebody has more time to experiment.

This particular patch looks like an ongoing problem for nfsd. It was reverted due to performance issues in 2009.

commit 7f4218354fe312b327af06c3d8c95ed5f214c8ca
Author: J. Bruce Fields <email address hidden>
Date: Wed May 27 18:51:06 2009 -0400

    nfsd: Revert "svcrpc: take advantage of tcp autotuning"

    This reverts commit 47a14ef1af48c696b214ac168f056ddc79793d0e "svcrpc:
    take advantage of tcp autotuning", which uncovered some further problems
    in the server rpc code, causing significant performance regressions in
    common cases.

    We will likely reinstate this patch after releasing 2.6.30 and applying
    some work on the underlying fixes to the problem (developed by Trond).

    Reported-by: Jeff Moyer <email address hidden>
    Cc: Olga Kornievskaia <email address hidden>
    Cc: Jim Rees <email address hidden>
    Cc: Trond Myklebust <email address hidden>
    Signed-off-by: J. Bruce Fields <email address hidden>

It was reintroduced in May 2011, commit a74d70b63f1a0230831bcca3145d85ae016f9d4c .

Hope this helps somebody...

Jeff Ebert (jeffrey-ebertland) wrote :

I reverted to linux-image-2.6.38-15-generic-pae (2.6.38-15.61) and the NFS performance is back to normal, and the CPU load dropped down to almost nothing, as before. This is clearly a linux kernel regression.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 879334

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Karsten Suehring (suehring) wrote :

I have added logs of a test setup in Bug #1077612 and steps how to reproduce in Bug #1006446 (before noticing that was marked duplicate)

Citing from the summary in Bug #1077612:

I tested with upstream Debian in my virtual machines: Squeeze has a server load of 7-10% (which seems high, but might be related to using a VM). When upgraded to Debian Wheezy the load goes up to 40% as in Ubuntu 12.04. When I boot the old 2.6 kernel from Squeeze, the load goes back to the original values.

On Ubuntu 12.04 I tried several share and mount options. The only change that showed an effect was mounting with -o proto=udp which reduced the load to around 15% which is still more than the old kernel, but much better than the 40% with tcp.

(end cite)

I'm rather surprised that not more people are running into this issues because it seems to be a show-stopper for Ubuntu NFS servers.

Since then I have been forced to leave the ubuntu server platform and went
to Redhat enterprise linux.
Canonical needs to understand that these are critical bugs that should have
been fixed in hours or days, not in weeks or months.

On Sun, Nov 11, 2012 at 1:24 PM, Karsten Suehring <<email address hidden>
> wrote:

> ** Bug watch added: Debian Bug tracker #692957
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=692957
>
> ** Also affects: linux (Debian) via
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=692957
> Importance: Unknown
> Status: Unknown
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1006446).
> https://bugs.launchpad.net/bugs/879334
>
> Title:
> nfsd from nfs-kernel-server very slow and system load from 25%-100%
> from nfsd
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/879334/+subscriptions
>

Vagelis Nonas (vnonas) wrote :

I agree 100%. It's been over a year from the time of the initial report. Personally, I dont think it is gonna be solved any time soon.

So the bottom line is that if you need a production nfs server you better use an "old stable" kernel. This looks really sad to me, because I can see that neither Canonical nor the mainstream kernel developers can fix a bug introduced (probably) back in 2008, affecting a very important piece of functionality (nfs servers and clients).

Karsten Suehring (suehring) wrote :
Download full text (6.1 KiB)

I'm adding some more test data here:

As a workaround I tried to install an old Ubuntu 2.6 kernel (linux-image-2.6.35-31-generic_2.6.35-31.63_amd64.deb) into 12.04.1.

I saw a number of locking issues reported and thought these might be caused by using the kernel in a wrong environment. But now after I have downgraded the servers back to 10.10 and kept the clients at 12.04.1, I still see kernel messages like the following:

[ 5474.132324] ------------[ cut here ]------------
[ 5474.132346] WARNING: at /build/buildd/linux-2.6.35/net/sunrpc/sched.c:597 rpc_exit_task+0x5c/0x60 [sunrpc]()
[ 5474.132349] Hardware name: PowerEdge R710
[ 5474.132351] Modules linked in: ipmi_si mpt2sas raid_class mptctl ipmi_devintf ipmi_msghandler dell_rbu nfsd autofs4 xfs exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc joydev ftdi_sio usbhid hid bnx2 usbserial shpchp psmouse i7core_edac serio_raw edac_core hed lp power_meter parport dcdbas ses enclosure mptsas mptscsih mptbase usb_storage scsi_transport_sas megaraid_sas [last unloaded: ipmi_si]
[ 5474.132386] Pid: 1746, comm: rpciod/16 Tainted: G W 2.6.35-32-server #67-Ubuntu
[ 5474.132388] Call Trace:
[ 5474.132399] [<ffffffff810616df>] warn_slowpath_common+0x7f/0xc0
[ 5474.132403] [<ffffffff8106173a>] warn_slowpath_null+0x1a/0x20
[ 5474.132414] [<ffffffffa016bd4c>] rpc_exit_task+0x5c/0x60 [sunrpc]
[ 5474.132426] [<ffffffffa016c52e>] __rpc_execute+0x5e/0x280 [sunrpc]
[ 5474.132437] [<ffffffffa016c7f0>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
[ 5474.132448] [<ffffffffa016c805>] rpc_async_schedule+0x15/0x20 [sunrpc]
[ 5474.132455] [<ffffffff8107b395>] run_workqueue+0xc5/0x1a0
[ 5474.132460] [<ffffffff8107b513>] worker_thread+0xa3/0x110
[ 5474.132464] [<ffffffff810801a0>] ? autoremove_wake_function+0x0/0x40
[ 5474.132468] [<ffffffff8107b470>] ? worker_thread+0x0/0x110
[ 5474.132472] [<ffffffff8107fc26>] kthread+0x96/0xa0
[ 5474.132477] [<ffffffff8100aea4>] kernel_thread_helper+0x4/0x10
[ 5474.132481] [<ffffffff8107fb90>] ? kthread+0x0/0xa0
[ 5474.132484] [<ffffffff8100aea0>] ? kernel_thread_helper+0x0/0x10
[ 5474.132487] ---[ end trace 5a3838b115992a79 ]---
[ 6091.800511] ------------[ cut here ]------------
[ 6091.800532] WARNING: at /build/buildd/linux-2.6.35/net/sunrpc/sched.c:597 rpc_exit_task+0x5c/0x60 [sunrpc]()
[ 6091.800536] Hardware name: PowerEdge R710
[ 6091.800537] Modules linked in: ipmi_si mpt2sas raid_class mptctl ipmi_devintf ipmi_msghandler dell_rbu nfsd autofs4 xfs exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc joydev ftdi_sio usbhid hid bnx2 usbserial shpchp psmouse i7core_edac serio_raw edac_core hed lp power_meter parport dcdbas ses enclosure mptsas mptscsih mptbase usb_storage scsi_transport_sas megaraid_sas [last unloaded: ipmi_si]
[ 6091.800572] Pid: 1744, comm: rpciod/14 Tainted: G W 2.6.35-32-server #67-Ubuntu
[ 6091.800575] Call Trace:
[ 6091.800585] [<ffffffff810616df>] warn_slowpath_common+0x7f/0xc0
[ 6091.800590] [<ffffffff8106173a>] warn_slowpath_null+0x1a/0x20
[ 6091.800601] [<ffffffffa016bd4c>] rpc_exit_task+0x5c/0x60 [sunrpc]
[ 6091.800612] [<ffffffffa016c52e>] __rpc_execute+0x5e/0x280 [sunrpc]
[ 6091.800623] [<ffffffffa016c7f0>] ? rpc...

Read more...

Changed in linux (Debian):
status: Unknown → Incomplete
Gordon Dracup (gordon-dracup) wrote :

I am not sure if this is related, but I recently upgraded my server from 10.04 to 12.04LTS. Any large files on the server e.g. ISO files showed incorrect file sizes when opened in Nautilus. These large files were unusable from the clients (also running 12.04), although they were fine on the server. It is an old 32bit server - Althon processor only used for backups and serving audio, video etc.

Solved the problem by moving to NFSv4. Changed exports file on server to:-

/nfs/srv 192.168.xx.0/24(rw,fsid=0,insecure,no_subtree_check,async)

and the fstab on the clients to:-

192.168.xx.x:/ /nfs/srv nfs4 _netdev,auto 0 0

Only been running with this setup for a couple of days, but so far, so good.

Apologies if this is unrelated to this bug. Wasn't what to do with this information as is possibly of use to others out there?

Giuseppe Vacanti (gvacanti) wrote :

Running Ubuntu 12.04, 3.2.0-35-generic-pae, when clients access data on NFS mounted partitions the load on the server goes through the roof (>50). I'm testing during the holiday period when there is nobody else running anything on the machines. Had this problem with NFS3 and moved to NFS4 hoping to fix it, but it is still there. Adding my comment to keep the pressure going.

pascal (pascal-pascallen) wrote :

Same here.
Mounting homedirs is a pain.
Rsync of homedirs for laptops is taking ages.
Loads of >20.
Bug confirmed.

Sven Rudolph (rudolph) wrote :

Same here.
NFSv4 mounted home dirs (from a Ubuntu 12.04 LTS NFSv4 server) become very slow. Eventually the client machine freezes completely -> reset button.
Using Opensuse 12.2 as NFSv4 client produces no problems at all.
This is reproducable.

Same on some of our servers with NFSv4 mounted directories. All Ubuntu 12.04 LTS, NFSv4 servers and clients.
Frequent messages in /var/log/syslog:
[...]
Jan 25 11:49:39 xxx kernel: [ 8996.289241] Call Trace:
Jan 25 11:49:39 xxx kernel: [ 8996.289246] [<ffffffff81659ebf>] schedule+0x3f/0x60
Jan 25 11:49:39 xxx kernel: [ 8996.289249] [<ffffffff8165acc7>] __mutex_lock_slowpath+0xd7/0x150
Jan 25 11:49:39 xxx kernel: [ 8996.289253] [<ffffffff8165a8da>] mutex_lock+0x2a/0x50
Jan 25 11:49:39 xxx kernel: [ 8996.289256] [<ffffffff81186404>] do_last+0x2b4/0x730
Jan 25 11:49:39 xxx kernel: [ 8996.289260] [<ffffffff81187c21>] path_openat+0xd1/0x3f0
Jan 25 11:49:39 xxx kernel: [ 8996.289263] [<ffffffff81183565>] ? putname+0x35/0x50
Jan 25 11:49:39 xxx kernel: [ 8996.289266] [<ffffffff81187fc3>] ? user_path_at_empty+0x63/0xa0
Jan 25 11:49:39 xxx kernel: [ 8996.289275] [<ffffffffa01337db>] ? nfs_attribute_cache_expired+0x1b/0x70 [nfs]
Jan 25 11:49:39 xxx kernel: [ 8996.289279] [<ffffffff81188062>] do_filp_open+0x42/0xa0
Jan 25 11:49:39 xxx kernel: [ 8996.289284] [<ffffffff81319c11>] ? strncpy_from_user+0x31/0x40
Jan 25 11:49:39 xxx kernel: [ 8996.289287] [<ffffffff811833aa>] ? do_getname+0x10a/0x180
Jan 25 11:49:39 xxx kernel: [ 8996.289291] [<ffffffff8165bdce>] ? _raw_spin_lock+0xe/0x20
Jan 25 11:49:39 xxx kernel: [ 8996.289294] [<ffffffff81195377>] ? alloc_fd+0xf7/0x150
Jan 25 11:49:39 xxx kernel: [ 8996.289298] [<ffffffff81177688>] do_sys_open+0xf8/0x240
Jan 25 11:49:39 xxx kernel: [ 8996.289301] [<ffffffff811777f0>] sys_open+0x20/0x30
Jan 25 11:49:39 xxx kernel: [ 8996.289304] [<ffffffff816643c2>] system_call_fastpath+0x16/0x1b
[...]

Eventually the servers which act as NFS clients freeze completely -> remote reset. (It's just a test system).
This is reproducable.
Needless to say, that this disqualifies Ubuntu 12.04 LTS as an NFS client.

A fix of this bug would be highly appreciated!

Doug Schaapveld (djschaap) wrote :

I am still seeing slow nfs performance and high cpu with 3.5.0-22, but found a thread suggesting a fix went into 3.5.4 in September. Haven't been able to test myself yet.

J. Bruce Fields (4):
nfsd4: fix security flavor of NFSv4.0 callback
svcrpc: fix BUG() in svc_tcp_clear_pages
svcrpc: fix svc_xprt_enqueue/svc_recv busy-looping
svcrpc: sends on closed socket should stop immediately

http://lwn.net/Articles/516478/

cheryl (exwyeorzee) wrote :
Download full text (9.9 KiB)

Sorry, I did not know what to do with my report, so I am attaching it here since it seems to be the same problem.

I am running desktop Ubuntu 12.04 lts on 4 separate gigabit-networked machines, for my full-home media center, with the tuner installed in the 'server' (desktop install running NFS server) under mythtv, and 3 desktop install NFS clients in separate rooms. I had to upgrade the pre-existing 'server' (my learning platform) from 11.10 to 12.04 to match the clients because mythtv does not interoperate with differing versions, and I did not want to downgrade the clients to 11.10, I wanted a long term network install that is reliable and low-maintenance.

Now I have terrible network performance. The tuner works fine within the 'server', and I can view shows on the server, channel surf, record, play back, etc. no problems. Over the NFS network at the desktop clients, the media center system is almost completely broken.

If I start viewing a video media file, or listen to ripped audio, or i.e. open any media file at all, that is stored on the server, viewing on a client over the network, or if I attempt to edit the commercials out of shows on a client from over the network within mythtv editor, or even open a text file, the client will pause/hang for at least 30 seconds while 'loading' the file, and then finally it will start sequential streaming the media with OK performance on one or maybe two clients max - but when using Videolan VLC to view server media files on a client, I had to increase the buffer by 10X (from 3 to 30 seconds of standard definition programming, approximately) to avoid long stuttering pauses in playback. Within mythtv frontend application at the client side, the video editing over the network is abominably slow, needing tenths of seconds, to seconds, to minutes, to hours, to completely hung, for the editor to respond to each keypress, getting slower all the time until it eventually grinds to a halt.

Listing directories, editing files, viewing media, using any of the text editors or media players I have installed, all have at least 30 seconds of delay on 'opening' (sending a command, either from a terminal window, or a nautilus window, or a text editor, or whatever), and the entire network slowly grinds to a standstill eventually, with mythtv locked in unusable state at the clients, even though it is still working fine on the server.

My server is a core 2 duo and so is my main media center client. The server is fully populated with 8gig of memory and terabytes of storage, and the client is sparsely populated with 2 gig of memory. I realize this is underpowered for hdtv media applications but surely a core 2 duo should be able to serve at least one standard definition media file at a time without any performance issues at all, and should be able to handle text editors with its eyes closed. I also have an i7 laptop client with 8 gig of memory and a terabyte of storage that suffers from the same poor network performance, even after disabling the troublesome Broadcom wireless power management, or even after plugging in the 1 gigabit wired connection and disabling of wireless.

I have no security at all configured on t...

Tom Vijlbrief (tvijlbrief) wrote :
Download full text (12.2 KiB)

@cheryl

Converting your exports and mounts to nfs version v4 will probably fix your
issue. I had similar issues and that fixed it for me and others.
Op 3 feb. 2013 18:50 schreef "cheryl" <email address hidden> het
volgende:

> Sorry, I did not know what to do with my report, so I am attaching it
> here since it seems to be the same problem.
>
> I am running desktop Ubuntu 12.04 lts on 4 separate gigabit-networked
> machines, for my full-home media center, with the tuner installed in the
> 'server' (desktop install running NFS server) under mythtv, and 3
> desktop install NFS clients in separate rooms. I had to upgrade the pre-
> existing 'server' (my learning platform) from 11.10 to 12.04 to match
> the clients because mythtv does not interoperate with differing
> versions, and I did not want to downgrade the clients to 11.10, I wanted
> a long term network install that is reliable and low-maintenance.
>
> Now I have terrible network performance. The tuner works fine within the
> 'server', and I can view shows on the server, channel surf, record, play
> back, etc. no problems. Over the NFS network at the desktop clients, the
> media center system is almost completely broken.
>
> If I start viewing a video media file, or listen to ripped audio, or
> i.e. open any media file at all, that is stored on the server, viewing
> on a client over the network, or if I attempt to edit the commercials
> out of shows on a client from over the network within mythtv editor, or
> even open a text file, the client will pause/hang for at least 30
> seconds while 'loading' the file, and then finally it will start
> sequential streaming the media with OK performance on one or maybe two
> clients max - but when using Videolan VLC to view server media files on
> a client, I had to increase the buffer by 10X (from 3 to 30 seconds of
> standard definition programming, approximately) to avoid long stuttering
> pauses in playback. Within mythtv frontend application at the client
> side, the video editing over the network is abominably slow, needing
> tenths of seconds, to seconds, to minutes, to hours, to completely hung,
> for the editor to respond to each keypress, getting slower all the time
> until it eventually grinds to a halt.
>
> Listing directories, editing files, viewing media, using any of the text
> editors or media players I have installed, all have at least 30 seconds
> of delay on 'opening' (sending a command, either from a terminal window,
> or a nautilus window, or a text editor, or whatever), and the entire
> network slowly grinds to a standstill eventually, with mythtv locked in
> unusable state at the clients, even though it is still working fine on
> the server.
>
> My server is a core 2 duo and so is my main media center client. The
> server is fully populated with 8gig of memory and terabytes of storage,
> and the client is sparsely populated with 2 gig of memory. I realize
> this is underpowered for hdtv media applications but surely a core 2 duo
> should be able to serve at least one standard definition media file at a
> time without any performance issues at all, and should be able to handle
> text editors with its eyes closed. I also h...

Torsten Bronger (bronger) wrote :

System load figures are hard to compare. I have an AMD Turion II processor (approx. twice as fast as an Atom), and I manage to have 40 MByte/s NFS throughput. The CPU is at 80%. Is this already an unusual high number, meaning that I may be affected by this bug?

I'm using Ubuntu Server 13.04.

Karsten Suehring (suehring) wrote :

I did some testing with a newer kernel on Debian a while ago. The 3.x series seemed to have higher load indeed, but as far as I could test, it did not kill the server. If you have multiple clients to the server it would be a good test to start several writes (like dd from /dev/random) over NFS and see if the server can handle that. It did not many writes to completely kill my server on 12.04.

It would be good news to hear that just by upgrading to newer kernel versions in later releases this issue would be resolved even without Ubuntu proactively working on a fix. But it will still remain an issue in the "Long term support" release 12.04.

My bottom line here is that Ubuntu is apparently caring more about mobile phones now than servers which is especially unfortunate after many people finally talked Dell into more support for Ubuntu. I did not even get a reply from the Ubuntu sales department (except for an automatic reply which promised to do so within two days) when we offered to pay for resolving the issue...

I have been struggling at work with the same problem for the last few months, and accidentally stumbled on a fix that seems to work under Ubuntu 12.04 (3.5.0-32-generic 64-bit kernel) :

1. On clients set rsize=8192,wsize=8192 in /etc/fstab (Smaller values of rsize,wsize will also work, but reduce throughput).

Previously with rsize=wsize=32K, any 2 clients writing large files to our server (with 1 Gbit NICs for clients and 2x1Gbit NICs for the server) would freeze : all clients would appear to hang on accessing the NFS server, but the hang would eventually resolve itself (after several hours of writes at around 100k/sec on the server grinding away continuously with jbd2, or within minutes if one of the 2 clients with the large writes temporarily had its ethernet turned off and then back on after the other client's write had completed).

Now 4 clients can write to the server at 53 MB/sec saturating the server bandwidth at 210 MB/sec and the network and server remain responsive from other interactive clients.

Changed in linux (Debian):
status: Incomplete → Fix Released
masakre (informatikoa1) wrote :

Is this bug solved?? I am having similar issues with my Ubuntu server 12.04.04 64 bits. I am using ldap authentication with NFS to mount users home. The server uses 2 gigabyte ethernet using bonding and there are 20 users and have 2x4 processors with 12 Gb ram (it is an ACER gateway gr320 f1 server). I change my server to the 12.04 becouse we were using an old version (7.10) and I was having some problems, but since the change, the clients are too slow. I notice that the server processors load it's high and I think this bug could be afecting me.

Sorry about my english and thank you in advance :)

dilan (dilanasanga-x) wrote :

Hi All,

I have also setup the same environment at my office for developers to work with JAVA and PHP etc., which was very slow initially.

Later I came to know programs like Netbeans, firefox, Chrome and systems logged in users cache is created in users home (as hidden directories) itself and when there are lot of users logged into system, huge amount of disk I/O goes due to the data operations. Because all users are writing to their homes which is in the same disk (I have setup raid 1 there)

What I simply did was moved all those caching directories to users local hard disk and created symbolic links to them. This gave a significant performance improvement.

like this
root@rcapladm:/home/dilan# pwd
/home/dilan

root@rcapladm:/home/dilan# ls -la

lrwxrwxrwx 1 dilan users 25 May 5 09:55 .local -> /rcapl/home/dilan/.local
lrwxrwxrwx 1 dilan users 27 May 5 09:55 .mozilla -> /rcapl/home/dilan/.mozilla
lrwxrwxrwx 1 dilan users 25 Aug 18 08:37 .mysql -> /rcapl/home/dilan/.mysql
lrwxrwxrwx 1 dilan users 29 Aug 18 08:38 .mysqlgui -> /rcapl/home/dilan/.mysqlgui/
lrwxrwxrwx 1 dilan users 28 May 5 09:55 .netbeans -> /rcapl/home/dilan/.netbeans
lrwxrwxrwx 1 dilan users 34 May 5 09:55 .netbeans-derby -> /rcapl/home/dilan/.netbeans-derby

Because now their caching (which uses most of disk IO ) is now not at server, and it is in local machine server does not have lots of disk IO. Still when a java application compiles using "mvn clean compile", compilation was slow. I also applied a simple trick in their pom.xml to set the compilation location not in the server but in the local machine. Then that was also fast. So all their source codes are in server, protected.

I don't say system is 100 perfect at all. Still little slowness is there.

Still my biggest problem is, if any users network is gone, suddenly the nfs mount goes off and system gets stuck with increasing load. I cannot see any program using system load when I check with "top" command , but system gets stuck and even network comes back, it is not re mounting the users home automatically 90 times out of 100. Even we cannot do it manually. Because when I type "df -h", it shows nothing but trying to get mount information.

Any Idea or solution. ?

Thanks all.

Specks of my NFS & LDAP server.

Cor2Duo Intel, 4 GB RAM, 500 HD (Normal) with RAID 1.
Around 10 Users working this Environment.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.