cinder-volume keeps opening Ceph clients until the maximum number of opened files reached

Bug #1446682 reported by Imri Zvik on 2015-04-21
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Cinder
High
Unassigned

Bug Description

In Juno, we see that cinder-volume with an RBD backend keeps opening connections to Ceph until it reaches the maximum number of opened files.
With lsof we see that it keeps opening new Ceph admin sockets (e.g. /var/run/ceph/ceph-client.cinder.21428.128790752.asok).

In the logs we constantly see this:

2015-04-21 08:57:06.913 21428 DEBUG cinder.manager [-] Notifying Schedulers of capabilities ... _publish_service_capabilities /usr/lib/python2.7/site-packages/cinder/manager.py:128
2015-04-21 08:57:06.918 21428 DEBUG cinder.openstack.common.periodic_task [-] Running periodic task VolumeManager._report_driver_status run_periodic_tasks /usr/lib/python2.7/site-packages/cinder/openstack/common/periodic_task.py:193
2015-04-21 08:57:06.919 21428 INFO cinder.volume.manager [-] Updating volume status
2015-04-21 08:57:06.919 21428 DEBUG cinder.volume.drivers.rbd [-] opening connection to ceph cluster (timeout=-1). _connect_to_rados /usr/lib/python2.7/site-packages/cinder/volume/drivers/rbd.py:291

Jon Bernard (jbernard) wrote :

I assume you trigger this with simply creating a large number of volumes? Or is there something more to your scenario? I'll try to reproduce this myself, but any additional info you can give would be helpful. Thanks for the bug report!

Changed in cinder:
status: New → Incomplete
Imri Zvik (imrizvik) wrote :

The connections accumulate over time. I don't think it is related to activity (which is very low on our cluster, since it is not fully production yet). We currently have 32 volumes and maybe a few more snapshots. After a few days of running, it opened over 100k Ceph admin sockets.

Imri Zvik (imrizvik) wrote :

By the way, we're using openstack-cinder-2014.2.1-1.el7.centos.noarch from the RDO repositories

Eric Harney (eharney) on 2015-04-21
tags: added: ceph drivers
Josh Durgin (jdurgin) wrote :

Are the admin socket connections still open, (e.g. shown in lsof) or just the files left around? The latter would probably be a bug in librados.

Reaching max open files is common for ceph because of the low default limits, and the way its network layer works. Increasing the limits via ulimit causes no harm and is generally recommended. This is done for ceph daemons by the init scripts. Cinder's init script could change this too.

Jon Bernard (jbernard) on 2015-04-21
Changed in cinder:
status: Incomplete → Confirmed
importance: Undecided → Medium

They are most definitely opened (i.e. shown in lsof).
Also, We've raised the limit to 100k (and it reached that too)

On Tue, Apr 21, 2015, 23:10 Josh Durgin <email address hidden> wrote:

> Are the admin socket connections still open, (e.g. shown in lsof) or
> just the files left around? The latter would probably be a bug in
> librados.
>
> Reaching max open files is common for ceph because of the low default
> limits, and the way its network layer works. Increasing the limits via
> ulimit causes no harm and is generally recommended. This is done for
> ceph daemons by the init scripts. Cinder's init script could change this
> too.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1446682
>
> Title:
> cinder-volume keeps opening Ceph clients until the maximum number of
> opened files reached
>
> Status in Cinder:
> Incomplete
>
> Bug description:
> In Juno, we see that cinder-volume with an RBD backend keeps opening
> connections to Ceph until it reaches the maximum number of opened files.
> With lsof we see that it keeps opening new Ceph admin sockets (e.g.
> /var/run/ceph/ceph-client.cinder.21428.128790752.asok).
>
> In the logs we constantly see this:
>
> 2015-04-21 08:57:06.913 21428 DEBUG cinder.manager [-] Notifying
> Schedulers of capabilities ... _publish_service_capabilities
> /usr/lib/python2.7/site-packages/cinder/manager.py:128
> 2015-04-21 08:57:06.918 21428 DEBUG
> cinder.openstack.common.periodic_task [-] Running periodic task
> VolumeManager._report_driver_status run_periodic_tasks
> /usr/lib/python2.7/site-packages/cinder/openstack/common/periodic_task.py:193
> 2015-04-21 08:57:06.919 21428 INFO cinder.volume.manager [-] Updating
> volume status
> 2015-04-21 08:57:06.919 21428 DEBUG cinder.volume.drivers.rbd [-]
> opening connection to ceph cluster (timeout=-1). _connect_to_rados
> /usr/lib/python2.7/site-packages/cinder/volume/drivers/rbd.py:291
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cinder/+bug/1446682/+subscriptions
>

Josh Durgin (jdurgin) wrote :

Ok, definitely a serious bug then. Which version of librados are you using, and what is in your ceph.conf and cinder.conf on the node running cinder-volume?

Jon Bernard (jbernard) on 2015-04-21
Changed in cinder:
importance: Medium → High
Imri Zvik (imrizvik) wrote :

cinder.conf

Imri Zvik (imrizvik) wrote :

ceph.conf

Ceph versions:
ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
librados2-0.80.8-0.el7.centos.x86_64

Josh Durgin (jdurgin) wrote :

The configuration looks normal. 0.80.8 is the first version of librados to remove __del__() from rados.Rados() (since it can't be relied on, users need to explicitly shutdown() anyway). I suspect that may have uncovered a bug in cinder. Could you try with rados.py from ceph 0.80.7, and see if the problem still occurs?

Imri Zvik (imrizvik) wrote :

That is a bit problematic, as we are constantly adding more production load into it, and I don't want to rock the boat too much.
Were you able to reproduce it?

Josh Durgin (jdurgin) wrote :

Jon couldn't reproduce it with his normal environment. I'm not sure if he's tried RDO yet. Is there anything else customized about ceph, cinder, or python where you're running cinder-volume?

Jon Bernard (jbernard) wrote :

I just tested RDO juno using the same versions and things appear to be stable. It must be something beyond idle volumes leaking file descriptors. Is there any additional info you can offer? Can you share the exact command your running to detect orphaned socket, where they're located, what process created them, and how frequently they're added? A complete process list might be helpful along with a complete list of openstack packages and versions. Anything that could help me reach a similar machine state in my test env where the problem occurs. Thanks!

Imri Zvik (imrizvik) wrote :

I used lsof -np <pid> on the child process of the cinder-volume.
The output has many unix sockets and FIFO pipes opened.
e.g.
cinder-vo 47681 cinder *868u unix 0xffff8820f16f1a40 0t0 1726751631 /var/run/ceph/ceph-client.cinder.47681.349704224.asok
cinder-vo 47681 cinder *869r FIFO 0,8 0t0 1726800997 pipe

I currently have about 30k of these in that process after running for about 10 days, and it is constantly increasing (2 per minute, the same rate the periodic task logs appear in the logfile).

Jon Bernard (jbernard) wrote :

Ok, I do see the issue now. I'll try to look closer at this over the weekend and if not, next week. With client admin socket enabled in ceph.conf, issue reproduces immediately.

Imri Zvik (imrizvik) wrote :

Thanks. It seems that the admin socket is not being teared down on shutdown()

Jon Bernard (jbernard) wrote :

Yep. Further, it appears that librados is not close()'ing the listening socket in the destructor. This patch should resolve the issue:

    https://github.com/ceph/ceph/pull/4541

Are you able to test this in your environment?

Jon Bernard (jbernard) wrote :

Another thought, the value of an admin socket given the short lifespan of cinder's update-volume connection would be very small. Is disabling the admin socket for the cinder client an option? That may be a sufficient workaround until the issue is resolved upstream.

Imri Zvik (imrizvik) wrote :

I cannot test this patch, but I think that disabling the admin socket for the cinder client is a good work around.
Since this might affect unaware Ceph users, I think there should be a big warning in the documentation for this, or disable it at the initialization of the Rados object (and maybe have a configuration knob to control this):
 --- ./rbd.py 2015-05-03 07:39:27.108588156 -0500
+++ /usr/lib/python2.7/site-packages/cinder/volume/drivers/rbd.py 2015-05-05 05:09:07.836816398 -0500
@@ -291,7 +291,8 @@
                   (self.configuration.rados_connect_timeout))

         client = self.rados.Rados(rados_id=self.configuration.rbd_user,
- conffile=self.configuration.rbd_ceph_conf)
+ conffile=self.configuration.rbd_ceph_conf,
+ conf = dict (admin_socket = ''))
         if pool is not None:
             pool = strutils.safe_encode(pool)
         else:

Ivan Kolodyazhny (e0ne) on 2015-07-21
Changed in cinder:
assignee: nobody → Ivan Kolodyazhny (e0ne)
Jon Bernard (jbernard) wrote :

FYI, this is fixed in librados now, a backport or upgrading to the new bits should resolve this issue if you're seeing it yourself.

https://github.com/ceph/ceph/pull/4541

Ivan Kolodyazhny (e0ne) on 2015-09-21
Changed in cinder:
assignee: Ivan Kolodyazhny (e0ne) → nobody
jingtao liang (liang-jingtao) wrote :

Hi ,the bug is also affecting me . Can you tell me the solution or ways? tks

Sean McGinnis (sean-mcginnis) wrote :

Closing as fixed in librados.

Changed in cinder:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers