[library] Raise max file descriptors and process file limits for radosgw

Bug #1331554 reported by Denis Ipatov on 2014-06-18
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Medium
Radoslaw Zarzynski
5.0.x
Medium
Dmitry Borodaenko
5.1.x
Medium
Dmitry Borodaenko
6.0.x
Medium
Dmitry Borodaenko
7.0.x
Medium
MOS Ceph
8.0.x
Medium
MOS Ceph
Mitaka
Medium
Radoslaw Zarzynski

Bug Description

Radosgw will hang if it runs out of file descriptors. We should increase the limits to reduce this from happening again.

[root@node-37 ~]# /etc/init.d/ceph-radosgw start
Starting radosgw instance(s)...
runuser: cannot set user id: Resource temporarily unavailable
Starting client.radosgw.gateway... [FAILED]

The following action fix this problem:

vi /etc/security/limits.conf
apache soft nproc 2047
apache hard nproc 16384
apache soft nofile 2048
apache hard nofile 65536

And start radosgw.

Changed in fuel:
importance: Undecided → Medium
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 5.1
tags: added: ceph
Changed in fuel:
status: New → Confirmed
Changed in fuel:
status: Confirmed → Triaged
Dmitry Ilyin (idv1985) on 2014-07-15
summary: - Raise max file descriptors and process file limits for radosgw
+ [library] Raise max file descriptors and process file limits for radosgw
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Borodaenko (dborodaenko)
Changed in fuel:
milestone: 5.1 → 6.0
Dmitry Borodaenko (angdraug) wrote :

Raised priority to High and retargeted for 5.1 since a High priority duplicate was raised.

Changed in fuel:
importance: Medium → High
milestone: 6.0 → 5.1
Dmitry Borodaenko (angdraug) wrote :

Original bug description is incorrect, limits defined in /etc/security/limits.conf have no bearing on services started by sysvinit in CentOS or upstart in Ubuntu.

According to Debian Wiki, every service is responsible for setting its own limits:
https://wiki.debian.org/Limits

For example, in Debian (and Ubuntu) packages for Apache, limits can be set by uncommenting this line in /etc/apache2/envvars (honored by apachectl):
APACHE_ULIMIT_MAX_FILES='ulimit -n 65536'

A more generic solution on Ubuntu is to use "limit nofile" upstart option like this:
https://github.com/ceph/ceph/blob/firefly/src/upstart/radosgw.conf#L9

On CentOS, there's no corresponding option since the init script doesn't use apachectl. Corresponding ulimit commands will have to be patched directly into the init script in the httpd package. Likewise for Ceph (including RadosGW) on CentOS, upstream init scripts include hardcoded ulimit commands:
https://github.com/ceph/ceph/blob/firefly/src/init-radosgw.sysv#L83

Dmitry Borodaenko (angdraug) wrote :

Observation of limits reported for Apache and RadosGW on MOS 5.1 doesn't match the values defined in their sysvinit and upstart init scripts.

CentOS sysvinit script for RadosGW:
daemon --user="$user" "ulimit -n 32768; $RADOSGW -n $name"

Actual soft/hard limit is 1024/4096 for RadosGW:
Max open files 1024 4096 files

...and 1024/1024 for Apache (init script doesn't set limits):
Max open files 1024 1024 files

Ubuntu upstart script for RadosGW (sysvinit script in the same deb does not set limits):
limit nofile 8096 65536

Actual soft/hard limit is 1024/4096 (same as CentOS).

...and 8192/8192 for Apache (init script doesn't set limits):
Max open files 8192 8192 files

In HA deployments, /etc/security/limits.conf is set (by corosync Puppet module) to:
* soft nofile 102400
* hard nofile 112640

As expected based on the previous comment, these values have no effect on Apache and RadosGW.

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Aleksandr Didenko (adidenko)
Sergii Golovatiuk (sgolovatiuk) wrote :

I suppose the main reason that RadosGW shouldn't share the same user as Apache. As a sample we may switch from Apache to nginx, so patching can delete apache user.

On Centos radosgw shares the same user with httpd

On Ubuntu radosgw uses root
root 20207 0.0 0.2 1631600 5488 ? Ssl Sep09 0:52 /usr/bin/radosgw -n client.radosgw.gateway

which is quite bad for security reasons.

I think we should introduce radosgw user with own limits for CentOS and Ubuntu

Aleksandr Didenko (adidenko) wrote :

The problem here is caused by "nproc" ulimit, not "nofile". Since we use the same "apache" user for both httpd and rados-gw in CentOS, then on a powerfull hardware puppet would set high values for apache workers and wsgi processes/threads. So when you try to start any new process under "apache" user (like rados-gw), you will hit "nproc" ulimit right away:

# /etc/init.d/ceph-radosgw start
Starting radosgw instance(s)...
runuser: cannot set user id: Resource temporarily unavailable
Starting client.radosgw.gateway... [FAILED]
/usr/bin/radosgw is not running.

# su - apache
su: cannot set user id: Resource temporarily unavailable

This problem does not affect httpd itsef, because it sets high nproc value:

# grep processes /proc/$(pgrep -P 1 httpd)/limits
Max processes 14865 14865 processes

So I agree with Sergii - using a saparate user for rados-gw in ubuntu and centos would be the best solution here.

Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Dmitry Borodaenko (dborodaenko)
Changed in fuel:
milestone: 5.1 → 6.0
tags: added: release-notes
Changed in fuel:
milestone: 5.1 → 6.0
Christopher Aedo (docaedo) wrote :

On CentOS, without setting radogw to run as a different user, process limit can be raised by modifying /etc/security/limits.d/90-nproc.conf

]# grep processes /proc/$(pgrep -P 1 radosgw)/limits
Max processes 1024 22948 processes

# sed -i 's/1024/2048/' /etc/security/limits.d/90-nproc.conf

# /etc/init.d/ceph-radosgw restart
Starting client.radosgw.gateway... [ OK ]

# grep processes /proc/$(pgrep -P 1 radosgw)/limits
Max processes 2048 22948 processes

Christopher Aedo (docaedo) wrote :

Additionally regarding CentOS, the hard limit is set in the same file (/etc/security/limits.d/90-nproc.conf)

# cat 90-nproc.conf
* soft nproc 2048
* hard nproc unlimited
root soft nproc unlimited

# grep processes /proc/$(pgrep -P 1 radosgw)/limits
Max processes 2048 unlimited processes

Christopher Aedo (docaedo) wrote :

Under Ubuntu, changes in /etc/security/limits.conf only take effect after logging out and back in (or after restarting the system). To increase or remove the limit of number of processes for root user which radosgw runs under, create or adjust in /etc/security/limits.conf.

Set to 2001 in /etc/security/limits.conf, radosgw picks this up:
# grep process /proc/$(pgrep radosgw)/limits
Max processes 2001 unlimited processes

Change to unlimited, log out, log in, then restart radosgw
# sed -i 's/2001/unlimited/' /etc/security/limits.conf
# exit
logout, then log back in
# /etc/init.d/radosgw restart
Starting client.radosgw.gateway...
2014-10-06 18:10:32.871016 7f18e6e1a780 -1 WARNING: libcurl doesn't support curl_multi_wait()
2014-10-06 18:10:32.871022 7f18e6e1a780 -1 WARNING: cross zone / region transfer performance may be affected
/usr/bin/radosgw is running.
# grep process /proc/$(pgrep radosgw)/limits
Max processes unlimited unlimited processes

Christopher Aedo (docaedo) wrote :

Beware - the proposed resolution to this bug https://bugs.launchpad.net/fuel/+bug/1376564 (https://review.openstack.org/#/c/126254/) removes /etc/security/limits.d/90-nproc.conf under CentOS, which will impact this proposed resolution.

Christopher Aedo (docaedo) wrote :

Update - these changes for CentOS persist after reboot. They do NOT persist on Ubuntu, as upstart ignores the limits.conf file.

Christopher Aedo (docaedo) wrote :

On Ubuntu, in order to get this change to persist across reboots, radosgw init file needs an extra line, in start block.

# grep -B 3 -A 2 ulimit /etc/init.d/radosgw

case "$1" in
    start)
        ulimit -p 5123 #adjust max number of processes for radosgw
        for name in `ceph-conf --list-sections $PREFIX`;
        do

NOTE - since upstart uses dash, the ulimit flag to adjust number of processes (-p) is different than when run under bash (-u).

Dmitry Borodaenko (angdraug) wrote :

Related nproc problem is addressed in https://review.openstack.org/125971

Matthew Mosesohn (raytrac3r) wrote :

Is this still an open problem?

Dmitry Borodaenko (angdraug) wrote :

Good question. The nofile limit is still a problem, but the root cause of what caused the priority on this bug to be set to High turned out to be the nproc limit and that's now fixed. I'm still not happy with the way we handle these limits, but I don't think it's High priority anymore.

Ryan Moe (rmoe) wrote :

Medium priority bugs are not backported to stable releases.

Mike Scherbakov (mihgen) on 2015-04-30
no longer affects: fuel/6.1.x
Changed in fuel:
status: Triaged → Won't Fix
Pawel Stefanski (pejotes) wrote :

For the 7.0 build we have
Max processes 11806 11806 processes
Max open files 1024 4096 files

VERSION:
  feature_groups:
    - experimental
  production: "docker"
  release: "7.0"
  openstack_version: "2014.2.2-7.0"
  api: "1.0"
  build_number: "28"
  build_id: "2015-07-07_05-28-31"
  nailgun_sha: "d040c5cebc9cdd24ef20cb7ecf0a337039baddec"
  python-fuelclient_sha: "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45"
  astute_sha: "9cbb8ae5adbe6e758b24b3c1021aac1b662344e8"
  fuel-library_sha: "08ca3e768ad0701db909002b291ff4cc1b435eed"
  fuel-ostf_sha: "a752c857deafd2629baf646b1b3188f02ff38084"
  fuelmain_sha: "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"

Dmitry Pyzhov (dpyzhov) on 2015-10-12
Changed in fuel:
milestone: 6.1 → 8.0
status: Won't Fix → Triaged
no longer affects: fuel/8.0.x
Dmitry Pyzhov (dpyzhov) on 2015-10-22
tags: added: area-mos
Roman Podoliaka (rpodolyaka) wrote :

We no longer fix Medium bugs in 8.0, closing as Won't Fix

Roman Podoliaka (rpodolyaka) wrote :

Per https://bugs.launchpad.net/fuel/+bug/1331554/comments/14 and https://bugs.launchpad.net/fuel/+bug/1331554/comments/16 removing the release-notes tag, as the original problem was fixed (nprocs limit). It's not clear if we need to do anything else.

tags: removed: release-notes
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related blueprints