Memory leaks when using NFS

Bug #1047566 reported by Dmitry Nikiforov
92
This bug affects 15 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

When using Ubuntu Server 12.04 with or without the latest updates (kernels 3.2.0-23 and 3.2.0-29, x86_64) as an NFS server with fairly heavy reading activity from clients (no writing), from a volume with a lot of small files, split into many subdirectories (with about 5-10 files or subdirectories per directory, in a tree-like structure not unlike that of Squid proxy), available memory is quickly exhausted, however no single process shows that much memory being used, nor does the "buffers" or "cached" in "free" command output. The server eventually runs out of memory and crashes.

slabtop shows that majority of memory is being used by idr_layer_cache (3.6G on a sever with 4G of RAM shortly before the kernel started killing processes and eventually crashed).

The filesystem being shared is ext4. Clients (also the same version of Ubuntu Server) mount the volume in read-only mode, with default options.

P.S. Also tried i386 version, with the same result.

Revision history for this message
Dmitry Nikiforov (dmitryn) wrote :

Also, the next top memory user in slabtop is ext4_inode_cache - usually at about 1/4th to 1/2nd of idr_layer_cache.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status: New → Confirmed
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1047566/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Dmitry Nikiforov (dmitryn) wrote :

One more detail: about 1-2% of the files on the volume are being replaced by a process, running locally on the NFS server, every 5 minutes. The total number of files on the volume is about 8 million.

Andrew Martin (asmartin)
affects: ubuntu → nfs-utils (Ubuntu)
Revision history for this message
Andrew Martin (asmartin) wrote :

I am also experiencing this bug on 12.04 amd64 with 3.2.0-26-generic and 3.2.0-27-generic. I am running an NFS server and a Samba server. Similar to the original reporter I am serving many small files from an ext4 partition to a number of NFS clients. I believe they are almost exclusively using NFSv3. This server has 6GB of RAM, which idr_layer_cache will consume entirely until the OOM killer is invoked. The only workaround I have found so far is rebooting the server. What other debug information can I provide to help resolve this bug?

Attached is the output of nfsstat.

Revision history for this message
Dmitry Nikiforov (dmitryn) wrote :

I was hesitant to say that this is definitely an NFS issue, since it might also be an EXT4 issue (judging by the fact that ext4_inode_cache is the second largest slab)...

Need to test this without EXT4.

Revision history for this message
Andrew Martin (asmartin) wrote :

I spoke with someone else who had this same issue on both 3.2.0-29-generic and 3.2.0-30-generic (Ubuntu-specific kernels). He was running ext3 so that can eliminate it as being an issue exclusively with ext4. This person reports switching to the mainline kernel, version 3.2.27-generic, appears to resolve the problem

Revision history for this message
Taylan Develioglu (tdevelioglu) wrote :

linux-image-3.2.0-0.bpo.3-amd64 3.2.23-1~bpo60+2

We are experiencing the exact same issue.

slabtop reports 14G+ of idr_layer_cache objects after a while.

Revision history for this message
Brian Norris (computersforpeace) wrote :

I think Andrew is referring to my team in comment #7 (my NFS exports are on an ext3 partition). I can confirm that mainline kernel 3.2.27-generic does *not* resolve the leak, as I recently noticed the leak again. I think it was a NFSv3 vs. v4 issue, as many of my clients moved back to v3 as an attempt to resolve the issue. But a few rarely-utilized clients remained, and it seems that NFSv4 activity from these clients correlates with memory leakage in idr_layer_cache.

So, I've seen the leak on all the following:

3.2.0-30-generic
3.2.0-29-generic
3.2.27-030227-generic

The last kernel is a vanilla build from the Ubuntu PPA:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.27-precise/

So, I think that this is an upstream leak, at least in the 3.2.x stable branch.

Revision history for this message
Kendall Hopkins (softwareelves) wrote :

I'm also experiencing the same issue, though in my case `fsnotify_event` is also leaking a similar size of ram. In my case I'm using NFSv4 with EXTv4 3.2.0-24-virtual on EC2.

Revision history for this message
Taylan Develioglu (tdevelioglu) wrote :

Looks like this was fixed in 3.2.33

commit f42ce0ca9eaf8a71f95dd0909c3ade7ab9cd824d
Author: J. Bruce Fields <email address hidden>
Date: Wed Aug 29 15:21:58 2012 -0700

    nfsd4: fix nfs4 stateid leak

    commit cf9182e90b2af04245ac4fae497fe73fc71285b4 upstream.

    Processes that open and close multiple files may end up setting this
    oo_last_closed_stid without freeing what was previously pointed to.
    This can result in a major leak, visible for example by watching the
    nfsd4_stateids line of /proc/slabinfo.

    Reported-by: Cyril B. <email address hidden>
    Tested-by: Cyril B. <email address hidden>
    Signed-off-by: J. Bruce Fields <email address hidden>
    Signed-off-by: Ben Hutchings <email address hidden>

Revision history for this message
Arthur Zalevsky (aozalevsky) wrote :

Confirm that. With kernel 3.2.34-030234-generic x86_64 from mainline kernel ppa no memory leaks for 3 days.

Revision history for this message
Brian Norris (computersforpeace) wrote :

I believe the Ubuntu 3.2.0-34 kernel now should contain the upstream commit (see bug #1075355). I'm not using NFSv4 anymore (because of this bug), but I may switch back temporarily to try to confirm this fix myself...

Can anyone else confirm that this is fixed in the repos?

Revision history for this message
Esa Ollitervo (fixie-c) wrote :

I've recently been wondering why my kvm server runs out of memory when I try to run minecraft server on one virtual machine. It looks like the problem is caused by this bug. I'm now running 3.2.0-36 on the host and virtual machines but I still cannot run minecraft on the virtual machine without idr_layer_cache growing endlessly until the server runs out of memory.

The server is running stable if I don't try to run minecraft. If I do it usually runs out of memory in a few hours.

On the server I have six virtual machines running.
The server has 18GB memory and virtual machines use only 8GB.
I have exported home directories from the server to virtual machines using NFSv3.
The filesystem for home directories is formatted as XFS.

 On the minecraft virtual machine I have 2GB memory and a lvm partition for minecraft and other stuff.
The lvm partition resides on the same physical disk as the home directories.

Revision history for this message
Esa Ollitervo (fixie-c) wrote :

Oh I forgot to mention that another virtual machine has problems even when there's still memory left.
I cannot log in via ssh on the virtual machine and on the console there are few messages like this:

task xx:xx blocked for more than 120 seconds.

Revision history for this message
Stephen Mercier (stephen-mercier) wrote :

Running Ubuntu 12.04.1 Server

uname -a = 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

We are using NFSv4 on top of ext4, and we are also seeing the exact same behavior. The server had 16GB of memory, and appears to gobble up about 0.5 - 1GB per day until running out of memory and eventually crashing. We tried backing off to NFSv3 at one point, but with no luck. Is there any progress being made with regard to this issue? Is there any assistance I can provide?

Revision history for this message
Daniel Jarman (daniel-jarman) wrote :

Also confirmed on 12.04.2 server

upgrading to the quantal release kernel 3.5 didn't fix the problem:

uname -a
3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25 17:13:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I have tried both NFSv3 & v4 mounts with the same result. When the problem occurs I have observed excessive io from jdb2 with slabtop reporting idr_layer_cache continuously growing. /proc/meminfo also shows SUnreclaim growing until the system crashes confirming that this is a leak...

Is changing to the mainline kernel the only proposed fix for this issue?

Revision history for this message
Anders Hall (a.hall) wrote :

We have seen this for many months now. The only workaround we have found is, as mentioned, to reboot when memory is reaching a crash.

The release below did not work.

"Processes that open and close multiple files may end up setting this
    oo_last_closed_stid without freeing what was previously pointed to.
    This can result in a major leak, visible for example by watching the
    nfsd4_stateids line of /proc/slabinfo"

This micro machine on ec2 will soon crash. We don't have that many files on nfs and mostly read from it. We also load a few large files when processes start (700 mb or so, read once).

uname -a
Linux ip-10-48-5-128 3.2.0-36-virtual #57-Ubuntu SMP Tue Jan 8 22:04:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

 OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
855315 855315 100% 0.53K 57021 15 456168K idr_layer_cache
 55040 55040 100% 0.02K 215 256 860K kmalloc-16

Is there any way to solve this on the client side by changing how read/write operations are done?

Dave Chiluk (chiluk)
Changed in nfs-utils (Ubuntu):
assignee: nobody → Dave Chiluk (chiluk)
Dave Chiluk (chiluk)
affects: nfs-utils (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Dave Chiluk (chiluk) → nobody
assignee: nobody → Dave Chiluk (chiluk)
Dave Chiluk (chiluk)
tags: added: precise quantal
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
penalvch (penalvch) wrote :

Dmitry Nikiforov, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest server release of Ubuntu? ISO images are available from http://releases.ubuntu.com/raring/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc7

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags: added: needs-kernel-logs needs-upstream-testing
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Dave Chiluk (chiluk) wrote :

I could not reproduce this problem, and will need some sort of reproduction procedure in order to pursue it further.

I created a filesystem with 680666 randomly named 4k files, spread out among 6811 nested directories at a max depth of 5 levels of directories.

I exported the filesystem using the following options
/nfs 192.168.122.1/24(rw,no_root_squash,no_subtree_check,async)

And mounted it with the following mount options. (rw,vers=4,addr=192.168.122.186,clientaddr=192.168.122.115)

I then spawned 10 threads on the client running
<<<<<<<<<<<<<<<<<<<<<<<<
#!/bin/bash
find ./ -type f | while read line
do
        cat $line > /dev/null
done
>>>>>>>>>>>>>>>>>>>>>>>>

I could not get the idr_layer_cache to increase above 890k. I am running the following kernel- 3.2.0-53-virtual.

Revision history for this message
FxMulder (fxmulder) wrote :

I have 3 web servers running kernel 3.2.0-55-generic on ubuntu 12.04.3 and using nfsv4 which seem to ramp up in memory usage which I can't account for in /proc/meminfo. I also have 4 web servers running ubuntu 12.10 and kernel 3.5.0-18-generic with nfsv4 and those machines run fine.

Revision history for this message
penalvch (penalvch) wrote :

FxMulder, so your hardware may be tracked, could you please file a new report on one of the web server via a terminal:
ubuntu-bug linux

Revision history for this message
John Greenhalgh (john-greenhalgh) wrote :

Has there been any progress on this bug? I'm seeing the same issue as the first poster, with a very similar setup (thousands of files with heavy reading - serving php files and static hml/images). The NFS server is running 12.04.4 LTS with kernel 3.2.0-67-virtual #101-Ubuntu SMP in EC2. The the SUnreclaim takes 7GB worth of memory in about 1 week. Slabtop reports increasing idr_layer_cache.

Revision history for this message
Anders Hall (a.hall) wrote :

Hey John. We had the same problem with extremely low load. Some form of leak. Never found a good solution (besides reboot when leak reaches critical as a work around). For the exact same purpose (code/file transfers) running on Ubuntu 14.04 is much better. I recommend you upgrade unless you can fix it yourself. Seldom admins pick up these older bugs.

Dave Chiluk (chiluk)
Changed in linux (Ubuntu):
assignee: Dave Chiluk (chiluk) → nobody
Revision history for this message
John Greenhalgh (john-greenhalgh) wrote :

Thanks Anders. The upgrade to 14.04 worked. I upgraded to 3.13.0-34-generic and SUnreclaim is no longer climbing. So, if anyone has a similar issue to me, the upgrade to the 14.04 kernel seems to work - or at least it did in my case.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.