upstart using 100% CPU when running into open file limit

Bug #1300663 reported by Robert Collins
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned
upstart
New
Undecided
Unassigned

Bug Description

We deployed a 10 node cluster and threw nodepool load at it on the HP servers.

After a few hours happily working, hypervisors all dropped offline.

Investigation showed nova-compute blocked on writing to stdout, and init consuming 100% CPU.

nova-compute
cat /proc/3398/stack
[<ffffffff81434183>] n_tty_write+0x1e3/0x450
[<ffffffff81430f78>] tty_write+0x148/0x2a0
[<ffffffff811a7d6d>] vfs_write+0xbd/0x1e0
[<ffffffff811a87a9>] SyS_write+0x49/0xa0
[<ffffffff816f841d>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

from top:
    1 root 20 0 48828 24m 1436 R 100 0.0 119:07.00 init

strace -p 1
pipe(0x7fff5fddb900) = -1 EMFILE (Too many open files)
pipe(0x7fff5fddb900) = -1 EMFILE (Too many open files)
pipe(0x7fff5fddb900) = -1 EMFILE (Too many open files)
pipe(0x7fff5fddb900) = -1 EMFILE (Too many open files)
pipe(0x7fff5fddb900) = -1 EMFILE (Too many open files)
...

Revision history for this message
Robert Collins (lifeless) wrote :

ls /proc/1/fd/ | wc -l
1023

description: updated
Revision history for this message
Robert Collins (lifeless) wrote :

derek points out:
 # ps -ef | grep flock | wc
 999 8992 65940

root 309 1 0 06:51 ? 00:00:00 [flock] <defunct>
root 330 1 0 06:41 ? 00:00:00 [flock] <defunct>
root 355 1 0 06:31 ? 00:00:00 [flock] <defunct>
root 357 1 0 05:52 ? 00:00:00 flock -x /etc/network/interfaces /usr/local/sbin/dhcp-all-interfaces.sh
root 421 1 0 06:51 ? 00:00:00 [flock] <defunct>
root 450 1 0 05:52 ? 00:00:00 [flock] <defunct>

note that thats a service we use to configure up new interfaces.
description "DHCP any connected, but unconfigured network interfaces"

start on starting network-interface
instance $INTERFACE

task

exec /usr/local/sbin/dhcp-all-interfaces.sh

So this will run on each new hotplugged VM tap port.

Revision history for this message
Robert Collins (lifeless) wrote :

# dpkg -l upstart
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=================================-=====================-=====================-========================================================================
ii upstart 1.10-0ubuntu7 amd64 event-based init daemon

Revision history for this message
Robert Collins (lifeless) wrote :

22:07 < rbasak> lifeless: could that be an errant upstart job, perhaps, causing some sort of loop? If it is an upstart bug, you haven't provided the release
                or version of upstart or anything.
22:08 < lifeless> rbasak: oh sorry ! still gathering data but saucy
22:08 < lifeless> rbasak: so yes, certainly an errant job, but that should never be able to wedge upstart
22:08 < lifeless> rbasak: upstarts job is to be unwedgable ;)
22:09 -!- StathisA [~<email address hidden>] has joined #ubuntu-server
22:10 < rbasak> jodh: ^^
22:10 -!- Kwiirk [~Kwiirk@95.249.56.99] has quit [Remote host closed the connection]
22:11 < rbasak> lifeless: agreed, but is upstart actually wedged there? Or is it trying as hard as it can to do what an errant upstart job might have said,
                while still being able to process other things?
22:14 < lifeless> rbasak: service nova-compute stop hangs
22:14 < lifeless> rbasak: even though the nova-compute process can be killed (have done so) and is now a zombie
22:15 < lifeless> rbasak: also can't reboot the machine

Revision history for this message
James Hunt (jamesodhunt) wrote :

The fact that pipe(2) is failing shows that this system is clearly short of resources. Is some process consuming more and more fds thus denying PID 1 the ability to create more? What are the limits on this server? Can you increase them.

Also, it would be useful to see the output of 'ls -al /proc/1/fd/'.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Specifically, PID 1 has reached its maximum number of open files, but you should be able to raise the value to the hard limit by modifying /proc/1/limits.

Please also attach the output of 'cat /proc/1/stack'.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Looks like upstart may attempting to call job_process_spawn() repeatedly. Since failure to create a pipe is considered by upstart to be a temporary failure, and since talking on irc you say that upstart is essentially unresponsive (which would align with my theory regarding job_process_spawn()), you only course of action is to free up atleast 2 file descriptors from PID 1.

Hence, I think all you can do here is to look at /proc/1/fd and kill one more more process associated with the 1024 fds that PID 1 has open. Presumably a lot of these fds are associated with job log files.

If this is really a server system, you are going to have to raise the per process limits once you have rebooted.

Revision history for this message
Robert Collins (lifeless) wrote :

/proc/1/fd/ has files 0 through 1023 inclusive; I can grab the exact details if that will be useful.

echo 2048 > /proc/1/limits gave an error - did you have a specific method for altering it ?

Revision history for this message
James Hunt (jamesodhunt) wrote :

It should be possible to tweak nofile in the initramfs if you are using one?

Yup - the writeable limits file trick doesn't work in Ubuntu. However, this does: http://people.canonical.com/~jhunt/c/prlimit-set-nofile.c

Revision history for this message
James Hunt (jamesodhunt) wrote :

Note that util-linux 2.21 provides a prlimit command to avoid the need for the custom C prog above.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

James, I tried raising the limit with prlimit, and it caused a kernel panic with "something is trying to kill init". This only happened once, and it is possible somethign else (OOM killer? unlikely as this box had 96G of RAM and most unused). Anyway, something to consider when considering the severity of the bug.

summary: - upstart using 100% CPU
+ upstart using 100% CPU when running into open file limit
Revision history for this message
James Hunt (jamesodhunt) wrote :

Hi Clint - kernel panic? I think that may be unrelated. Atleast I've just raised nofile to 10240:10240 for PID 1 under trusty using my C program in #10 with no ill effects.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.