Copying large files to NFS mount blocks system

Bug #591947 reported by KÁDÁR Balázs
34
This bug affects 4 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
New
Undecided
Unassigned

Bug Description

The system becomes completely unresponsive for several seconds then screen is updated, mouse can be moved for a few seconds before it repeats.

Server is a Debian Lenny:
Linux gurul 2.6.32-00007-g56678ec #1 PREEMPT Mon Feb 8 03:49:55 PST 2010 armv5tel GNU/Linux
unfs3 0.9.21+dfsg-1

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: nfs-common 1:1.2.0-4ubuntu4
ProcVersionSignature: Ubuntu 2.6.32-22.36-generic 2.6.32.11+drm33.2
Uname: Linux 2.6.32-22-generic i686
NonfreeKernelModules: nvidia
Architecture: i386
Date: Wed Jun 9 23:06:06 2010
InstallationMedia: Kubuntu 10.04 LTS "Lucid Lynx" - Release i386 (20100427)
ProcEnviron:
 LANGUAGE=
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: nfs-utils

Revision history for this message
KÁDÁR Balázs (balazs-kadar) wrote :
Revision history for this message
KÁDÁR Balázs (balazs-kadar) wrote :

Syslog contains lots of error messages starting with:

Jun 9 22:21:11 mithrim kernel: [53384.935724] rpciod/0: page allocation failure. order:0, mode:0x4020
Jun 9 22:21:14 mithrim kernel: [53384.935733] Pid: 911, comm: rpciod/0 Tainted: P 2.6.32-22-generic #36-Ubuntu

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

The calculations that set vm.min_free_kbytes are too parsimonious. This leads to log messages that start with the text:

fooprog: page allocation failure. order:0, mode:0x4020

and go on for dozens of lines.

By doubling the value set in vm.min_free_kbytes I was able to squelch those messages.

See https://gist.github.com/790577 https://gist.github.com/792128 https://gist.github.com/790584 for log messages

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

run: sysctl vm.min_free_kbytes
and then take the number of KB output from that and multiply it by two.

Then run:
sysctl -w vm.min_free_kbytes=new-number-of-KB

(substitute the value you calculated for "new-number-of-KB" in my case it was 16266 KB doubled to 32532 KB)

To make this persistent over reboots:

put it in a file like /etc/sysctl.d/e1000e-bug-fix.conf

#
# double amount of memory kept free
#
# 16266 KB -> 32532 KB
#
vm.min_free_kbytes = 32532

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

This is probably not a regression. I'm seeing both Lucid and Jaunty KVM guests with this problem too. The KVM host is running Lucid.

Jaunty VM guest with virtio IRQ page allocation failure: https://gist.github.com/793522

Lucid VM guest with virtio IRQ page allocation failure https://gist.github.com/793545

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Additional report of this issue http://ubuntuforums.org/showthread.php?t=1452659

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Another report with the same pattern: network driver IRQ happens before page allocation failure.

http://ubuntuforums.org/showthread.php?p=10393393

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I found a Karmic KVM VM guest with virtio where it also happened.
https://gist.github.com/793807

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

How min_free_kbytes default size is calculated. Note that the comments mention network bandwidth.

https://gist.github.com/793880

The comments say:
 min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
 min_free_kbytes = sqrt(lowmem_kbytes * 16)

So perhaps
 min_free_kbytes = sqrt(lowmem_kbytes * 32)
is more realistic in terms of what is actually needed to prevent this from happening?

Revision history for this message
Divinsa Development (dev-divinsa) wrote :

Reproduced multiple times on 10.04

On 10.04 this is also happening to us with a vm.min_free_kbytes set to 11140:

# sysctl vm.min_free_kbytes
vm.min_free_kbytes = 11140

Running multiple (10+) 10.04 instances on EC2, and reproduced over 15 times, but most often resulting in hung/non-responsive servers rather than a recovery.

It's fairly easy to reproduce this by increasing the MTU=9000 instead of default 1500, and moving large files, at which point it will hang the system or crash the system.

I'm now increasing that to 32252 to see how we fare at that point.

Would love to use jumbo frames as well, but causes crash within a few days (when we get high NFS load = network load)

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

@Divinsa Development for virtual machines there is a second issue.
See:
https://bugs.launchpad.net/bugs/579276

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Lucid proposed kernel with virtio-net napi patch passed all of the QA Team's regression testing
https://wiki.ubuntu.com/QATeam/KernelSRU-lucid-2.6.32-30.59

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.