nfs kernel server is very slow and causing high cpu load

Bug #1071978 reported by Tim Lunn
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I have been noticing nfs-kernel server causing very high cpu load and the transfers are incredibly slow. In fact for big files, it tends to hang up completely and never finish.

I did try briefly to test the 3.5 backport kernel however that also seemed to have the same issue.

This may be a duplication of bug #879334, however that bug is missing log files, and apport-collect told me to file a new bug report instead.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: nfs-kernel-server 1:1.2.5-3ubuntu3.1
ProcVersionSignature: Ubuntu 3.2.0-32.51-generic 3.2.30
Uname: Linux 3.2.0-32-generic x86_64
ApportVersion: 2.0.1-0ubuntu14
Architecture: amd64
Date: Sat Oct 27 14:50:09 2012
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Beta amd64 (20120328)
ProcEnviron:
 LANGUAGE=en_AU:en
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_AU.UTF-8
 SHELL=/bin/bash
SourcePackage: nfs-utils
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Tim Lunn (darkxst) wrote :
Revision history for this message
Tim Lunn (darkxst) wrote :

Looks like ubuntu-bug didnt actually pick up any logs afterall I will attach a few. Let me know if you want anything else

Revision history for this message
Tim Lunn (darkxst) wrote :
Revision history for this message
Tim Lunn (darkxst) wrote :
Revision history for this message
Tim Lunn (darkxst) wrote :
Revision history for this message
Dave Gilbert (ubuntu-treblig) wrote :

You say nfs-kernel-server is causing high cpu load - how do you know it's that? top?
What type of traffic is it transferring; large files, lots of small files?

Please add:
    * your /etc/exports from the server
    * The output of cat /proc/mounts from the client to show how it's mounted
    * While it's being slow please run vmstat 5 for 30 seconds or so and paste the output

Dave

Revision history for this message
Dave Gilbert (ubuntu-treblig) wrote :

oops, that's vmstat 5 on the server

Revision history for this message
Tim Lunn (darkxst) wrote :

yes from top, seems to have high load for most transfers, however the hangups seems to happen for either very big files ~5+GB or transfer with many small files.

#exports
/media/store 192.168.1.0/24(rw,no_subtree_check)

#cat /proc/mounts
servs:/media/store/ /media/store nfs4 rw,relatime,vers=4.0,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.13,local_lock=none,addr=192.168.1.15 0 0

Revision history for this message
Tim Lunn (darkxst) wrote :
Revision history for this message
Dave Gilbert (ubuntu-treblig) wrote :

OK, lets see what we've got; if I'm reading this vmstat correctly it's split pretty
much between system time and wait (for IO):

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r b swpd free buff cache si so bi bo in cs us sy id wa
 5 1 43964 120664 346700 2267880 0 0 17 18 16 13 2 0 97 0
 0 0 43964 192872 343108 2216144 0 0 178 79081 13851 18234 2 30 45 23
 1 1 43968 321924 339260 2051828 0 2 937 78577 14664 21679 3 34 33 30
 1 4 43968 144504 329904 2311564 0 0 124 104639 21611 34747 3 46 8 44
 1 2 43968 145264 319808 2323488 0 0 328 108663 24725 40773 3 39 19 40
 1 4 43968 146860 317844 2325192 0 0 140 183331 38848 64357 2 46 4 47

now the way I'm reading that it's shifting about 100-180MB/s - (the bo column); and reading your logs I think that's
to a spinny disc, so that's a perfectly respectable transfer rate; I wouldn't have actually expected more than 100MB/s
though (i.e. 1000Mbps) assuming gigabit ether; what type of transfer were you doing at the time - large file or lots of small
ones?

The 'wait' time doesn't worry me, given that it seems reasonable if it's waiting for the disk.

To be honest I'd expect shifting 100MB/s over NFS would be pushing one of your cores pretty hard, so I'm not too surprised your machine is sluggish with that kind of load.

There are two things which might get some more detail:
   1) Try and capture what happens during a 'hangup; is it a complete hang of the server? Are there any log messages; if you perhaps watch a vmstat 1 during the period of the hangup what's going on (does the IO ever drop to being much lower?)

   2) Try to use something to see what's eating your system time (I'd bet on the rtl ether card perhaps?); the perf command is good for seeing where system time is going:
       a) Start your load going
       b) run sudo perf record -a
            it generates a log file
       c) After 30 seconds or so ctrl-c it
       d) run sudo perf report --stdio > myperfreport
       e) attach the myperfreport to this bugreport so we can see what your kernel is doing.

Revision history for this message
Dave Gilbert (ubuntu-treblig) wrote :

oh, the other thing you might want to see is whether you're running with jumbo frames or not, they should help the load at both ends.

Revision history for this message
Tim Lunn (darkxst) wrote :

yes it is over a gigabit link.

Once I hit a hangup IO drops right off to around 1MB/s

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 2490892 112292 87920 807736 0 0 0 1520 1689 2452 1 1 53 45
 1 1 2490892 110960 87920 808820 0 0 0 1672 1831 2632 2 2 50 47
 1 1 2490892 110340 87920 809580 0 0 0 1384 1551 2213 1 1 51 47
 1 1 2490892 109440 87936 810288 0 0 0 1388 1548 2390 2 1 45 52
 1 6 2490892 108572 87944 811148 0 0 0 1472 1555 2298 2 2 49 47

Revision history for this message
Tim Lunn (darkxst) wrote :
Revision history for this message
Tim Lunn (darkxst) wrote :

So I can not reproduce the hangup issue transferring many small files (source code), running with a mix of files all greater than 50MB it always seems to hangup on the big files (i.e. >5GB). At this point (hangup) it seems from the client that the mount has gone stale, however there is still a small amount of data trickling through on the running transfer.

Revision history for this message
Dave Gilbert (ubuntu-treblig) wrote :

OK; so we've got to be careful to keep the debug together; I think you really have two separate things going on:
  a) Slow when transferring lots of stuff (as per your vmstat in #9)
  b) 'hang up's

Is that perf report from the 'hang up' state - it doesn't seem to have much system/kernel CPU usage - so I'm assuming that perf report (#13) corresponds to the vmstat in #12 - is that correct? (That perf doesn't seem to show much doing anything - which again if it's that hang up state when it's mostly idle/wait makes sense)

OK, so for (b) - have you got a dmesg from the server after a hangup?
The other thing I can suggest for (b) is to try reducing the 'dirty_ratio', it causes the kernel to start to write out to disk sooner;
try (as root):
    echo 2 > /proc/sys/vm/dirty_ratio

and see if that makes any difference to the hang up case.

Now, I think it's still worth investigating the (a) case - i.e. the high cpu load when transferring a lot; if you can get a perf report for that case it would be interesting to tie up.

Dave

Revision history for this message
Tim Lunn (darkxst) wrote :

yes, perf report (#13) corresponds to the vmstat in #12

Here is another perf report correspoinding to high load case (a) in #9.

I just get permission denied (even as root), when trying to change that dirty_ratio file

Revision history for this message
Dave Gilbert (ubuntu-treblig) wrote :

note that sudo echo 2 > /proc/sys/vm/dirty_ratio will give you the perm denied (because the > is in the parent shell)
so if you do

echo 2 > /proc/sys/vm/dirty_ratio

from a root shell (e.g. sudo -s ) it should work.

Revision history for this message
Tim Lunn (darkxst) wrote :

hmm, yeh got that eventually. However changing the dirty ratio appear to make no difference to my hang-ups.

Also in the process, (as per your suggestion earlier) I discovered the switch in my router does not support jumbo frames.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.