Cinder

Ceph driver should support retries on rados_connect_timeout

Bug #1462970 reported by Ivan Kolodyazhny on 2015-06-08

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Fix Released	Undecided	Ivan Kolodyazhny	Cinder 7.0.0 "liberty"

Bug Description

RBD driver supports rados_connect_timeout param for connection to rados. It would be good to have a retries support if connection timeout occues

See original description

Ivan Kolodyazhny (e0ne) on 2015-06-08

description:	updated
Changed in cinder:
status:	New → Triaged
assignee:	nobody → Ivan Kolodyazhny (e0ne)

Revision history for this message

Josh Durgin (jdurgin) wrote on 2015-06-08:

#1

librados already retries internally. Doing it at the cinder layer when timeouts are enabled is equivalent to using a longer timeout. It'd be better imo to make the rbd driver work as you like even when timeouts are disabled.

Revision history for this message

Ivan Kolodyazhny (e0ne) wrote on 2015-06-09:

#2

Josh, thanks for a comment about librados.

"librados already retries internally. Doing it at the cinder layer when timeouts are enabled is equivalent to using a longer timeout"

Unfortunately, I didn't find anything how librados works with connection timeouts and retries. Could you please share a link with it?
I don't know C++ well so it's pretty hard for me to find it in the sources.

"It'd be better imo to make the rbd driver work as you like even when timeouts are disabled."
What do you mean? How should we handle case when connection to 'hangs' or something like it happened?

Revision history for this message

Josh Durgin (jdurgin) wrote on 2015-06-10:

#3

There aren't good docs for this behavior of librados. Basically anything to do with the cluster may block and retry internally, to hide the complexity of retrying everything from library users. This could happen for many reasons. Network connections can flap. Switches can stop working. Daemons can be restarted. When a disk fails, an I/O needs to be resent to a different osd. An object can be unavailable for writes if not enough copies of it are available. The cluster could be temporarily full. In each of these cases (and many more), librados retries for the library user. Higher-level timeouts in things on top of rados are possible, and this eliminates extra configuration to make any timeouts at the librados level match higher levels.

The timeout options for librados were added for library users to simplify use cases where they want to fail eventually, and not block for the cluster to become available. It makes sense to timeout in cases like updating storage usage, which don't need to be strictly up to date, and will be re-run by a higher level (in this case a periodic task). I'm not sure what other cases would make sense in cinder off the top of my head. Config issues like a missing or incorrect ceph key tend to be immediate failures anyway.

Did you have particular things in mind for retrying, or a particular failure condition where connect() wouldn't eventually work?

Revision history for this message

Ivan Kolodyazhny (e0ne) wrote on 2015-06-10:

#4

John,
I'm hitting with issue when connection to Ceph hangs. I don't have clear steps to reproduce so I didn't file a bug yet. You can find some more details here: https://bugs.launchpad.net/mos/6.1.x/+bug/1459781. I'll post bug to Cinder once I'll get more details

Revision history for this message

Josh Durgin (jdurgin) wrote on 2015-06-10:

#5

It's not clear it's hanging trying to connect or at a later step. I'd suggest enabling ceph logging for cinder-volume (log file = /var/log/cinder/rbd.$pid.$cctid.log, debug ms = 1, debug rbd = 20 in the [client] section of ceph.conf).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-11: Fix proposed to cinder (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/190579

Changed in cinder:
status:	Triaged → In Progress

OpenStack Infra (hudson-openstack) on 2015-06-19

Changed in cinder:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-06-23

Changed in cinder:
milestone:	none → liberty-1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-10-15

Changed in cinder:
milestone:	liberty-1 → 7.0.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.