inotify fd leak

Bug #1101666 reported by Adar Dembo on 2013-01-19
168
This bug affects 38 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Precise
Medium
Unassigned
Quantal
Medium
Unassigned

Bug Description

I'm running Ubuntu 12.04 in a VM. After a recent kernel upgrade, I'm finding that I can reliably put the system in a position where the inotify_init() syscall returns -EMFILE, but /proc/*/fd shows fewer "anon_inode:inotify" entries than /proc/sys/fs/inotify/max_user_instances. Unfortunately the only way I know how to reproduce this is to run some internal Python unit tests that exercise pyinotify. But after a few such invokations, there appears to be a leak.

Perhaps a regression of CVE-2010-4250?

adar@adar-dev:~$ for foo in /proc/*/fd/*; do readlink -f $foo; done | grep inotify | sort | wc -l
24

adar@adar-dev:~$ cat /proc/sys/fs/inotify/max_user_instances
128

adar@adar-dev:~$ cat inotify_test.c
#include <stdio.h>
#include <sys/inotify.h>

void main(int argc, char *argv[]) {
  int err = inotify_init();
  if (err == -1) {
    perror("inotify_init");
  }
}
adar@adar-dev:~$ gcc inotify_test.c -o inotify_test
adar@adar-dev:~$ ./inotify_test
inotify_init: Too many open files
---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 2.0.1-0ubuntu17.1
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: AudioPCI [Ensoniq AudioPCI], device 0: ES1371/1 [ES1371 DAC2/ADC]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: adar 2363 F.... pulseaudio
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Card0.Amixer.info:
 Card hw:0 'AudioPCI'/'Ensoniq AudioPCI ENS1371 at 0x2040, irq 16'
   Mixer name : 'Cirrus Logic CS4297A rev 3'
   Components : 'AC97a:43525913'
   Controls : 24
   Simple ctrls : 13
DistroRelease: Ubuntu 12.04
HibernationDevice: RESUME=UUID=69da1950-dcd9-4f58-bcfc-c575290982a5
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64 (20111012)
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
 Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
MachineType: VMware, Inc. VMware Virtual Platform
MarkForUpload: True
Package: linux (not installed)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 svgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-36-generic root=UUID=a63e42ec-99f0-4141-a1f2-f0f1b0cc3dbf ro quiet splash
ProcVersionSignature: Ubuntu 3.2.0-36.57-generic 3.2.35
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-36-generic N/A
 linux-backports-modules-3.2.0-36-generic N/A
 linux-firmware 1.79.2
RfKill:

Tags: precise running-unity
Uname: Linux 3.2.0-36-generic x86_64
UpgradeStatus: Upgraded to precise on 2012-05-09 (254 days ago)
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare wireshark
dmi.bios.date: 07/02/2012
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: 440BX Desktop Reference Platform
dmi.board.vendor: Intel Corporation
dmi.board.version: None
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 1
dmi.chassis.vendor: No Enclosure
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd07/02/2012:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:
dmi.product.name: VMware Virtual Platform
dmi.product.version: None
dmi.sys.vendor: VMware, Inc.

Adar Dembo (adembo) wrote :
Adar Dembo (adembo) wrote :
Adar Dembo (adembo) wrote :

I should add that when I reboot the system, the problem goes away until I run my pyinotify unit tests again.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1101666

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected precise running-unity
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Eugene Crosser (crosser) wrote :

I see this problem in quantal after kernel upgrade to 3.5.0-22-generic #34-Ubuntu. Perhaps it's due to one of these:

10 days ago
Lino Sanfilippofsnotify: pass group to fsnotify_destroy_mark()
tree | commitdiff

10 days ago
Lino Sanfilippofsnotify: use reference counting for groups
tree | commitdiff

10 days ago
Chris J ArgesRevert "UBUNTU: SAUCE: fsnotify: use reference counting...
tree | commitdiff

10 days ago
Chris J ArgesRevert "UBUNTU: SAUCE: fsnotify: pass group to fsnotify...
tree | commitdiff

2012-12-14
Lino SanfilippoUBUNTU: SAUCE: fsnotify: pass group to fsnotify_destroy...
tree | commitdiff

2012-12-14
Lino SanfilippoUBUNTU: SAUCE: fsnotify: use reference counting for...
tree | commitdiff

Eugene Crosser (crosser) wrote :

And by the way, two unexpected manifestations are:

1. Sound applet disappear from the panel, and Sound settings do not show any devices.
2. Dropbox applet turns red and reports "Cannot access Dropbox folder"

so there may be a lot of duplicate reports in distant areas of launchpad...

I would like to find out what goes wrong with inotifys user instance counting but unfortunately I dont know how to reproduce this. @adar could you be a little bit more precise in what your python unit tests look like? Thx

Eugene Crosser (crosser) wrote :

#!/bin/bash

i=0
while [ $i -lt 1000 ];do
 i=$((i+1))
 echo -n "$i: "
 tail -f /etc/hosts >/dev/null &
 pid=$!
 sleep 1
 kill $pid
done

Running it gives this result:
[...]
65: /tmp/t.sh: line 11: 17838 Terminated tail -f /etc/hosts > /dev/null
66: /tmp/t.sh: line 11: 17840 Terminated tail -f /etc/hosts > /dev/null
67: /tmp/t.sh: line 11: 17842 Terminated tail -f /etc/hosts > /dev/null
68: tail: inotify cannot be used, reverting to polling: Too many open files
/tmp/t.sh: line 11: 17844 Terminated tail -f /etc/hosts > /dev/null
69: tail: inotify cannot be used, reverting to polling: Too many open files
/tmp/t.sh: line 11: 17847 Terminated tail -f /etc/hosts > /dev/null
70: tail: inotify cannot be used, reverting to polling: Too many open files
/tmp/t.sh: line 11: 17850 Terminated tail -f /etc/hosts > /dev/null
71: tail: inotify cannot be used, reverting to polling: Too many open files
^C

Adar Dembo (adembo) wrote :

Thanks, Eugene. I had begun putting together a repro case with pyinotify, but a shell + tail certainly takes the cake.

Adar Dembo (adembo) wrote :

I can confirm that Eugene's script yields the same results in my VM as well:

...
62: ./eugene_script.sh: line 11: 10940 Terminated tail -f /etc/hosts > /dev/null
63: ./eugene_script.sh: line 11: 10943 Terminated tail -f /etc/hosts > /dev/null
64: tail: inotify cannot be used, reverting to polling: Too many open files
./eugene_script.sh: line 11: 10945 Terminated tail -f /etc/hosts > /dev/null
65: tail: inotify cannot be used, reverting to polling: Too many open files
./eugene_script.sh: line 11: 10947 Terminated tail -f /etc/hosts > /dev/null

Leonid Evdokimov (darkk) wrote :

I confirm Eugene's words:

darkk@darkk-ya-thinkpad:~$ cat /proc/sys/fs/inotify/max_user_instances
128
darkk@darkk-ya-thinkpad:~$ ls -l /proc/*/fd/ 2>/dev/null | grep -c anon_inode:inotify
36
darkk@darkk-ya-thinkpad:~$ sudo ls -l /proc/*/fd/ 2>/dev/null | grep -c anon_inode:inotify
51
darkk@darkk-ya-thinkpad:~$ uname -a
Linux darkk-ya-thinkpad 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
darkk@darkk-ya-thinkpad:~$ uptime
 16:57:08 up 4 days, 4:55, 15 users, load average: 0.60, 0.97, 0.82
darkk@darkk-ya-thinkpad:~$ tail -f /etc/hosts >/dev/null
tail: inotify cannot be used, reverting to polling: Too many open files
^C

Joseph Salisbury (jsalisbury) wrote :

Does this only happen when running Precise in a VM? Do you happen to know if the bug also exists on bare metal?

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: regression-update
tags: added: kernel-da-key
tags: added: needs-bisect quantal
Joseph Salisbury (jsalisbury) wrote :

@Adar, Do you happen to know that last "Good" and first "Bad" kernel versions? If we can identify these two versions, I can perform a kernel bisect to identify the commit that introduced this bug.

It appears you are running 3.2.0-36.57-generic. If this is correct, can you test the following two kernels and report back:

3.2.34: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.34-precise/
3.2.35: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.35-precise/

Changed in linux (Ubuntu Precise):
status: New → Incomplete
Changed in linux (Ubuntu Quantal):
status: New → Incomplete
Changed in linux (Ubuntu Precise):
importance: Undecided → Medium
Changed in linux (Ubuntu Quantal):
importance: Undecided → Medium
Leonid Evdokimov (darkk) wrote :

@Joseph, I see the bug on real hw, but I use quantal

Eugene Crosser (crosser) wrote :

Happens on real amd64 iron here.

Eugene Crosser (crosser) wrote :

Running quantal.

The problem does NOT happen on this:
Linux pccross 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

The problem does happen on this:
Linux pccross 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Joseph Salisbury (jsalisbury) wrote :

I'm going to build a test kernel with the fsnotify patches reverted. I'll post the kernel shortly.

This may also be related to bug 1101355 and bug 1101797

Changed in linux (Ubuntu Precise):
status: Incomplete → In Progress
status: In Progress → Confirmed
Changed in linux (Ubuntu Quantal):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Precise):
status: Confirmed → In Progress
Changed in linux (Ubuntu Quantal):
status: Confirmed → In Progress
Changed in linux (Ubuntu Precise):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Quantal):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Precise):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Quantal):
assignee: Joseph Salisbury (jsalisbury) → nobody
Joseph Salisbury (jsalisbury) wrote :

Actually the quickest test would be to boot the 3.2.0-36.56 kernel and see if it also exhibits this bug. It can be downloaded from:

https://launchpad.net/ubuntu/+source/linux/3.2.0-36.56

From that link, select your particular arch under the "Builds" section.

Thanks in advance!

Changed in linux (Ubuntu Precise):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Quantal):
assignee: nobody → Joseph Salisbury (jsalisbury)
tags: removed: needs-bisect
Joseph Salisbury (jsalisbury) wrote :

One additional request in addition to testing 3.2.0-36.56. Can folks affected by this bug also test the v3.2.0-35 kernel, which can downloaded from:

https://launchpad.net/ubuntu/+source/linux/3.2.0-35.55

Again, from that link, select your particular arch under the "Builds" section.

Thanks in advance!

markusj (markusj) wrote :

@jsalisbury: As stated in #1101797 comment 4: Yes, 3.2.0-36 is also affected, it introduced the bug into the 3.2 kernel line. The previuous kernel, 3.2.0-35 is (still) working fine.

Joseph Salisbury (jsalisbury) wrote :

@markusj, There are actually two versions of 3.2.0-36. It appears 3.2.0-36.57 has the bug, since that is what the bug was reported against. However, there is also 3.2.0-36.56. I was curious to see if that version has the bug since it contains a different version of the fsnotify patches.

Changed in linux (Ubuntu Precise):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Quantal):
assignee: Joseph Salisbury (jsalisbury) → nobody

First, thanks to Eugene and Adar for reporting and investigation.
I have had some time now to debug this and I found the reason for the bug. Obviously something went wrong during
the port of the patch series mentioned above from mainline to ubuntu kernel:
There is the function fsnotify_destroy() which is never called in ubuntu. But this function ensures
that all pending events are flushed and thereby ref counts on a fsnotify group held by those events are released.
So what has to be done is call fsnotify_destroy in inotify_release(). Otherwise there will always be references held to
the inotify group and the group will never get destroyed - which sooner or later results in a number of alive groups that
exceeds the allowed max number.
The same flaw can be found in the fanotify code. I will attach a patch that should fix the ref counts for both inotify and fanotify.

The attachment "inotify/fanotify ref count fix" of this bug report has been identified as being a patch. The ubuntu-reviewers team has been subscribed to the bug report so that they can review the patch. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-reviewers team please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch
Luis Henriques (henrix) wrote :

Lino, thanks a lot you for your time. Unfortunately, the patchset that introduced this regression has been reverted from the Ubuntu kernels for the time being. We will be bringing them back soon. We are currently testing your patch, and reviewing our backport.

Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel for Precise in -proposed solves the problem (3.2.0-37.58). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-precise
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel for Quantal in -proposed solves the problem (3.5.0-23.35). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-quantal' to 'verification-done-quantal'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-quantal
markusj (markusj) on 2013-01-25
tags: added: verification-done-precise
removed: verification-needed-precise
markusj (markusj) wrote :

3.2.0-37.58 (precise) works for me, i will test the backported quantal-kernel as soon it becomes available.

Eugene Crosser (crosser) wrote :

3.5.0-23-generic #35-Ubuntu fixes the problem for me on quantal.

tags: added: verification-done-quantal
removed: verification-needed-quantal
markusj (markusj) wrote :

The backported kernel 3.5.0-23.35~precise1 works fine, too.

Dan Kegel (dank) wrote :

I may be suffering from this. I'm on Ubuntu 12.04, uname says 3.2.0-36-generic-pae #57-Ubuntu,
# tail -f /var/log/syslog
complains
tail: inotify cannot be used, reverting to polling: Too many open files
and (oddly) dbus-daemon is using 79% of CPU according to top, and 17% according to ps.
This is on a desktop with three users logged into it; my son complained of slowness during minecraft.
Don't know how repeatable it is yet.

Dennis Schridde (devurandom) wrote :

I am using linux-image-3.2.0-37-generic (Version: 3.2.0-37.58) now and did not experience the problem anymore (so far). At the very least the new version does not appear to be harmful on my machine (i.e. I did not notice any new bugs).

Lennart Karssen (l.c.karssen) wrote :

I see this bug too on real hardware: Linux epib-genstat3 3.2.0-36-generic #57-Ubuntu SMP Tue Jan 8 21:44:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
This is on a server and showed up after 9 days of uptime. Furthermore, it doesn't only show up when running tail -f, but also keeps libvirt-bin in a restart loop. /var/log/libvirt/libvirtd.log keeps repeating (with a different pid each time of course):
2013-01-30 13:43:35.046+0000: 25431: info : libvirt version: 0.9.8
2013-01-30 13:43:35.046+0000: 25431: error : umlStartup:471 : cannot initialize inotify
2013-01-30 13:43:35.046+0000: 25431: error : virStateInitialize:854 : Initialization of UML state driver failed
2013-01-30 13:43:35.046+0000: 25431: error : daemonRunStateInit:1157 : Driver state initialization failed

Unfortunately, since this is a production server I can't easily reboot with a new kernel...

Lennart Karssen (l.c.karssen) wrote :

Well, for now running
sysctl fs.inotify.max_user_instances=256
to increase the maximum number of instances (was 128) helps.

Chris J Arges (arges) wrote :

The patches that cause this issue were reverted.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Precise):
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Quantal):
status: In Progress → Fix Released
Chris J Arges (arges) wrote :

Ok I posted a new fix here with the proper patch set:
http://people.canonical.com/~arges/lp1101666.1/

You should follow the new bug 1110605, where I have the test cases posted.
Let me know if there are any issues when testing on P/Q.

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Robin (robingape) wrote :

On a little used desktop the script from comment 25 failed after 117 iterations, on the principal desktop, which has seemed slow recently, the script failed on the first iteration attempt! Both machines running kernel 3.5.0-22
uname -a
Linux ionman 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

The new kernel 3.5.0-23 behaves correctly against the script.

To post a comment you must log in.