volume stuck in creating following VolumeBackendAPIException

Bug #1211839 reported by Edward Hope-Morley
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Edward Hope-Morley

Bug Description

With RBDDriver if cinder is unable to connect to cluster, the following exception is raised but the volume stays in 'creating' state:

2013-08-13 16:33:29.009 ERROR cinder.service [req-38634e4f-1e8a-499e-88d2-014731209952 None None] Unhandled exception
2013-08-13 16:33:29.009 TRACE cinder.service Traceback (most recent call last):
2013-08-13 16:33:29.009 TRACE cinder.service File "/opt/stack/cinder/cinder/service.py", line 228, in _start_child
2013-08-13 16:33:29.009 TRACE cinder.service self._child_process(wrap.server)
2013-08-13 16:33:29.009 TRACE cinder.service File "/opt/stack/cinder/cinder/service.py", line 205, in _child_process
2013-08-13 16:33:29.009 TRACE cinder.service launcher.run_server(server)
2013-08-13 16:33:29.009 TRACE cinder.service File "/opt/stack/cinder/cinder/service.py", line 96, in run_server
2013-08-13 16:33:29.009 TRACE cinder.service server.start()
2013-08-13 16:33:29.009 TRACE cinder.service File "/opt/stack/cinder/cinder/service.py", line 385, in start
2013-08-13 16:33:29.009 TRACE cinder.service self.manager.init_host()
2013-08-13 16:33:29.009 TRACE cinder.service File "/opt/stack/cinder/cinder/volume/manager.py", line 149, in init_host
2013-08-13 16:33:29.009 TRACE cinder.service self.driver.check_for_setup_error()
2013-08-13 16:33:29.009 TRACE cinder.service File "/opt/stack/cinder/cinder/volume/drivers/rbd.py", line 262, in check_for_setup_error
2013-08-13 16:33:29.009 TRACE cinder.service raise exception.VolumeBackendAPIException(data=msg)
2013-08-13 16:33:29.009 TRACE cinder.service VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: error connecting to ceph cluster
2013-08-13 16:33:29.009 TRACE cinder.service

Tags: ceph drivers rbd
Revision history for this message
Eric Harney (eharney) wrote :

Sounds like same issue as bug 1242942.

tags: added: ceph rbd
Mike Perez (thingee)
Changed in cinder:
status: New → Triaged
Revision history for this message
Hua Zhang (zhhuabj) wrote :

I also met the same problem today using the latest master branch code, any reasons who know ?

Revision history for this message
Hua Zhang (zhhuabj) wrote :

this is caused by wrong user, change into 'rbd_user = admin' is ok by refering the link https://ceph.com/docs/master/rados/api/python/

Revision history for this message
Hua Zhang (zhhuabj) wrote :

or make sure configure right permissions for non-admin users

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Ok I've had another look at this (Juno). Some thoughts:

 * The error above i.e. "error connecting to ceph cluster" is a result of the volume service getting restarted and the rbd driver not being able to connect to ceph. I notice that the rbd driver does not set a timeout value when calling rados.connect() and so the default time (which appears to be 10 mins) elapses before the error above is triggered. Now, although this is unrelated to a create operation, if a create happens to be issued prior to the volume driver finishing it's initialisation, it will remain in the creating state since the scheduler has nowhere to issue that op.

* Slightly different case is where the rbd driver has previously been successfully started/initialised but the ceph cluster subsequently becomes unreachable. When a create is issued it does reach the driver but when a connection is attempted a timeout (10 mins by default) elapses before a rados.TimedoutError is raised. Subsequent i then see that the volume stays in 'creating' until the get_volume_stats() gets called at which point it finally goes to 'error'.

So, couple of improvements I think we could make here:

1. Allow rados connection timeout to be configurable. Useful for testing and allow timeouts to be predictable and controllable.
2. Make sure the create fails properly when the timeout occurs i.e. volume status -> 'error'.

Changed in cinder:
status: Triaged → Confirmed
importance: Undecided → Medium
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/103424

Changed in cinder:
status: Confirmed → In Progress
tags: added: drivers
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/103424
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=02e4afbd8108ca26af09ccaebb8b1c8b72ea3f3f
Submitter: Jenkins
Branch: master

commit 02e4afbd8108ca26af09ccaebb8b1c8b72ea3f3f
Author: Edward Hope-Morley <email address hidden>
Date: Sun Jun 29 19:08:46 2014 +0100

    Ensure rbd connect exception is properly caught

    If the rbd driver fails to connect to Ceph the exception
    was not being properly caught resulting in the volume
    remaining in the 'creating' state until the corresponding
    task eventually times out (on top of the time it took
    for the connect to fail).

    Also added config option for rados connect timeout.

    DocImpact: new config option 'rados_connect_timout'
    Closes-Bug: 1211839
    Change-Id: I5e6eaaaf6bed3e139ff476ecf9510ebe214a83f9

Changed in cinder:
status: In Progress → Fix Committed
Changed in cinder:
milestone: none → juno-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.