Call to fork/clone fails with EAGAIN (before encountering resource limits)

Bug #1624043 reported by Andrew
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

I wrote a test program that forks processes until the fork calls start to fail. It forks around 12000 processes and then the fork calls start failing with EAGAIN. According to the fork man page, there are four conditions that could cause EAGAIN to be returned:

- the RLIMIT_NPROC soft resource limit, which limits the number of processes and threads for a real user ID, was reached
- the kernel's system-wide limit on the number of processes and threads, /proc/sys/kernel/threads-max, was reached
- the maximum number of PIDs, /proc/sys/kernel/pid_max, was reached
- The caller is operating under the SCHED_DEADLINE scheduling policy and does not have the reset-on-fork flag set

On my machine:
 - Before running the program, ~250 processes / ~500 threads are running (as determined by ps)
 - RLIMIT_NPROC (soft and hard) is 31616
 - threads-max is 63233
 - pid_max is 32768
 - the program runs with the SCHED_NORMAL scheduling policy (so, not SCHED_DEADLINE)

It seems strange that the fork calls fail after ~12000 forks, (it should fail at 31616.) Some more technical details:

 - Reproducible on Ubuntu 16.04.1 running with kernel 4.4.0-36-generic.
 - Reproducible when tested with mainline kernel 4.8.0-040800rc6-generic
 - Doesn't occur on Ubuntu 12.04 running with kernel 3.2.0-23-generic
 - Monitoring thread usage, it appears to fail at exactly the 12,500 thread mark
 - From using strace, it looks like clone is the syscall actually being used behind the scenes (should have the same EAGAIN error semantics, from the clone man page.)
 - From using systemtap and ftrace, it looks like copy_process in _do_fork returns an error when this case is hit. Maybe from sched_trace? It's hard to tell - the ftrace output doesn't seem complete.

I'm attaching the test fork program I've been using, which has some code to also print the aforementioned values.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-36-generic 4.4.0-36.55
ProcVersionSignature: Ubuntu 4.4.0-36.55-generic 4.4.16
Uname: Linux 4.4.0-36-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: shockwave 3454 F.... pulseaudio
 /dev/snd/controlC0: shockwave 3454 F.... pulseaudio
CurrentDesktop: GNOME-Flashback:Unity
Date: Thu Sep 15 11:07:16 2016
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=281b1d85-c94d-43ee-a37b-736e550f48e7
InstallationDate: Installed on 2016-09-12 (3 days ago)
InstallationMedia: Ubuntu 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
IwConfig:
 lo no wireless extensions.

 eno1 no wireless extensions.
MachineType: Dell Inc. Precision T1600
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-36-generic root=UUID=e15cb9c6-c1a4-4313-9067-340edb6098a1 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-36-generic N/A
 linux-backports-modules-4.4.0-36-generic N/A
 linux-firmware 1.157.3
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/11/2011
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A02
dmi.board.name: 06NWYK
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 6
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrA02:bd04/11/2011:svnDellInc.:pnPrecisionT1600:pvr01:rvnDellInc.:rn06NWYK:rvrA00:cvnDellInc.:ct6:cvr:
dmi.product.name: Precision T1600
dmi.product.version: 01
dmi.sys.vendor: Dell Inc.

Revision history for this message
Andrew (andrew56) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.8 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8-rc6

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Andrew (andrew56)
tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Andrew (andrew56) wrote :

After doing a bit more research I stumbled upon this:

http://unix.stackexchange.com/questions/253903/creating-threads-fails-with-resource-temporarily-unavailable-with-4-3-kernel
https://news.ycombinator.com/item?id=11675129

Looks like systemd imposes another, smaller limit on the number of processes that a user can run:

cat /sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max
12288

It would have been nice if the 'fork' man page mentioned that this could be a cause for failure. :(

Revision history for this message
Andrew (andrew56) wrote :

Thinking about an appropriate resolution for this, could the 'fork' man page be updated to mention this systemd limit as a cause of EAGAIN being returned?

Revision history for this message
Andrew (andrew56) wrote :

As of at least 18.04, the fork man page mentions that EAGAIN will be returned in the case where "the PID limit (pids.max) imposed by the cgroup "process number" (PIDs) controller was reached." Thus, I changed this bug's status to 'Fix Released'.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.