RabbitMQ can run out of file descriptors

Bug #1279594 reported by Ryan Moe on 2014-02-13
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Dmitry Burmistrov

Bug Description

The default number of file descriptors for RabbitMQ is 1024. When there are a lot of queues or many connections this limit can be exceeded.

Explanation: rabbitmq wasn't work as supposed.

Behaviour: rabbitmq have a lack of FD (file descriptors) it will looks like working one and

'rabbitmqctl cluster status’ command shows all nodes participating in cluster.

Attempts with trying to connect to rabbitmq ports will be successful.

But all services will be not able to perform communications between each other. If you try to create volume or spawn instance this will be unsuccessful. Task will hang in 'scheduling’ or other first step.

To confirm that you should perform this:

rabbitmqctl status

If total_used almost or reached the total_limit then the number of file descriptors should be increased.
Here is permanent fix for that issue. The following steps will help to perform that:

Check /etc/default/rabbitmq-server file for ulimit -n <somenumber> string
If it commented and/or have a small number of FD that was already reached then fix it to look like:
       ulimit -n 102400

After that need to restart rabbitmq-server on all nodes using the following command:

service rabbitmq-server restart

When service was restarted on all nodes you should check that number of FD is increased using the command on the node:

rabbitmqctl status

Total limit should have an increased value.

tags: added: customer-found
Changed in fuel:
milestone: none → 4.1
importance: Undecided → Critical
status: New → Triaged
assignee: nobody → Ryan Moe (rmoe)
tags: added: release-notes
Mike Scherbakov (mihgen) wrote :

For this bug:
* I assume that it does not affect most of the users
* it does not lead to complete failure of a feature
* simple workaround is provided

With all the above, in full accordance with https://wiki.openstack.org/wiki/Bugs, https://wiki.openstack.org/wiki/BugTriage:
> Critical if the bug prevents a key feature from working properly (regression) for all users (or without a simple workaround) or result in data loss

changing importance to High.

Changed in fuel:
importance: Critical → High
Ryan Moe (rmoe) wrote :

Because of the way we have rabbit configured in haproxy this issue will bring down the entire cluster. The way we have haproxy configured is that the primary controller is the only active member and the others are backups. When rabbit runs out of file descriptors it will still accept connections but it can't actually do anything with them. This means that haproxy will never think the node is down and will continue to funnel all connections to the primary controller. Eventually none of the OpenStack services can talk to rabbit and the entire cluster stops working.

Ryan is correct. It leads to the loss of service for the entire cloud.

Vladimir Kuklin (vkuklin) wrote :

I think this bug is going to be fixed by RabbitMQ version upgrade that depends on https://bugs.launchpad.net/fuel/+bug/1278336 bugfix

Bogdan Dobrelya (bogdando) wrote :

Just for note, for current RabbitMQ version 2.8.7 we have:
Ubuntu:
grep -e Limit -e files /proc/$(pgrep rabbitmq-server)/limits
Limit Soft Limit Hard Limit Units
Max open files 1024 4096 files

Centos:
grep -e Limit -e files /proc/$(cat /var/run/rabbitmq/pid)/limits
Limit Soft Limit Hard Limit Units
Max open files 102400 112640 files

RH:???

So, the issue is Ubuntu only related.

Mike Scherbakov (mihgen) wrote :

So let's fix it for Ubuntu then. I see no point in upgrading puppet right now: it has some issues with Nailgun, and anyway pretty risky at the end of release cycle.

Aleksandr Didenko (adidenko) wrote :

Only in HA mode we deploy /etc/security/limits.conf file with higher than standard values (during corosync_setup stage). So on simple CentOS env we also have 1024 files limit, I've just checked on a live env. I don't have a live HA ubuntu env to check, but it also should have 102400 open files limit since we deploy /etc/security/limits.conf file there as well.

In my opinion, the best way to solve this is to include "/etc/default/rabbitmq-server" file in our rabbitmq packages which will have reasonable defaults. This file is being sourced by the current init scripts already, so we just need to add it.

Bogdan Dobrelya (bogdando) wrote :

+1 to ship /etc/default/rabbitmq-server with 'ulimit -n 102400' by the RPM/DEB as well

Changed in fuel:
status: Triaged → Fix Committed
Mike Scherbakov (mihgen) on 2014-03-02
Changed in fuel:
status: Fix Committed → In Progress
milestone: 4.1 → 4.1.1

Reviewed: https://review.openstack.org/76906
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=44802a6c7c8a9a5e1ea027108a2c069c9d85b793
Submitter: Jenkins
Branch: master

commit 44802a6c7c8a9a5e1ea027108a2c069c9d85b793
Author: Roman Vyalov <email address hidden>
Date: Thu Feb 27 20:16:13 2014 +0400

    Remove version for rabbitmq-server

    Related-Bug: #1279594

    Change-Id: I8809e0b332365b77d4b161e4b0fe0dcfe2afb6d0

Changed in fuel:
status: In Progress → Fix Committed
Mike Scherbakov (mihgen) wrote :

Is it fixed in 4.1.1? Reopened unless you specify changeset ported to stable/4.1.

Changed in fuel:
status: Fix Committed → In Progress
tags: added: backports-4.1.1
Andrew Woodward (xarses) on 2014-04-04
tags: added: ha
Roman Vyalov (r0mikiam) on 2014-04-07
Changed in fuel:
assignee: Ryan Moe (rmoe) → Dmitry Burmistrov (dburmistrov)
Changed in fuel:
status: In Progress → Fix Committed

verified on {

    "build_id": "2014-06-04_09-16-08",
    "mirantis": "yes",
    "build_number": "341",
    "nailgun_sha": "a828d6b7610f872980d5a2113774f1cda6f6810b",
    "ostf_sha": "c959aa55f83fe2555cf2d382559271c7a9b17467",
    "fuelmain_sha": "7ed0f85acc0bab4b9157703a618b8cc9fd7de3e1",
    "astute_sha": "55df06b2e84fa5d71a1cc0e78dbccab5db29d968",
    "release": "4.1B",
    "fuellib_sha": "0e96fc5a340cd57f75c454ea8536471379299494"

}

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers