During evacuate cinder-api tries to call cinder-volume of halted node

Bug #1491276 reported by Eduard Biceri-Matei
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

Three nodes devstack Juno (2012.2.4) with shared storage.

One instance running on second node of the cluster.
Halted the host, instance still shows status "Running".
Tried to evacuate the host but got following errors:

(n-cpu log)

2015-09-01 19:19:50.291 DEBUG nova.openstack.common.lockutils [req-56b30762-dad3-410d-a50d-a74318c43198 admin admin] Semaphore / lock released "update_usage" from (pid=16720) inner /opt/stack/nova/nova/openstack/common/lockutils.py:275
2015-09-01 19:19:50.296 ERROR oslo.messaging.rpc.dispatcher [req-56b30762-dad3-410d-a50d-a74318c43198 admin admin] Exception during message handling: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-dcf02ffd-0916-45b1-b2a3-4623e71d340c)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 430, in decorated_function
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/server.py", line 139, in inner
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return func(*args, **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/exception.py", line 88, in wrapped
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher payload)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/exception.py", line 71, in wrapped
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 314, in decorated_function
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher LOG.warning(msg, e, instance_uuid=instance_uuid)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 286, in decorated_function
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 364, in decorated_function
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 342, in decorated_function
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 330, in decorated_function
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 2889, in rebuild_instance
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher self._rebuild_default_impl(**kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 2725, in _rebuild_default_impl
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher detach_block_devices(context, bdms)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 2867, in detach_block_devices
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher destroy_bdm=False)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 4710, in _detach_volume
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher self.volume_api.terminate_connection(context, volume_id, connector)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/volume/cinder.py", line 185, in wrapper
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher res = method(self, ctx, volume_id, *args, **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/volume/cinder.py", line 360, in terminate_connection
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher connector)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/cinderclient/v1/volumes.py", line 331, in terminate_connection
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher {'connector': connector})
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/cinderclient/v1/volumes.py", line 250, in _action
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return self.api.client.post(url, body=body)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/cinderclient/client.py", line 305, in post
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher return self._cs_request(url, 'POST', **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/cinderclient/client.py", line 269, in _cs_request
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher **kwargs)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/cinderclient/client.py", line 252, in request
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher raise exceptions.from_response(resp, body)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher ClientException: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-dcf02ffd-0916-45b1-b2a3-4623e71d340c)
2015-09-01 19:19:50.296 TRACE oslo.messaging.rpc.dispatcher

(c-api log)

2015-09-01 19:19:49.696 ERROR cinder.api.middleware.fault [req-dcf02ffd-0916-45b1-b2a3-4623e71d340c 13e9d15ce670486a852427a9f94a19d2 ac5dec9bc10b4ac8b7b3ee79e3a06f0e] Caught error: Timed out waiting for a reply to message ID e83aa194d740457494288b61a1251caa
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault Traceback (most recent call last):
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/api/middleware/fault.py", line 76, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return req.get_response(self.application)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1320, in send
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault application, catch_exc_info=False)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1284, in call_application
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault app_iter = application(self.environ, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault resp = self.call_func(req, *args, **self.kwargs)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return self.func(req, *args, **kwargs)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/osprofiler/web.py", line 99, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return request.get_response(self.application)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1320, in send
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault application, catch_exc_info=False)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1284, in call_application
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault app_iter = application(self.environ, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py", line 823, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return self._call_app(env, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py", line 758, in _call_app
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return self._app(env, _fake_start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/routes/middleware.py", line 136, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault response = self.app(environ, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault resp = self.call_func(req, *args, **self.kwargs)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return self.func(req, *args, **kwargs)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/api/openstack/wsgi.py", line 897, in __call__
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault content_type, body, accept)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/api/openstack/wsgi.py", line 945, in _process_stack
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault action_result = self.dispatch(meth, request, action_args)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/api/openstack/wsgi.py", line 1021, in dispatch
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return method(req=request, **action_args)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/api/contrib/volume_actions.py", line 218, in _terminate_connection
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault self.volume_api.terminate_connection(context, volume, connector)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/volume/api.py", line 87, in wrapped
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault return func(self, context, target_obj, *args, **kwargs)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/volume/api.py", line 519, in terminate_connection
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault force)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/opt/stack/cinder/cinder/volume/rpcapi.py", line 170, in terminate_connection
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault connector=connector, force=force)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault retry=self.retry)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault timeout=timeout, retry=retry)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 408, in send
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault retry=retry)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 397, in _send
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault result = self._waiter.wait(msg_id, timeout)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 285, in wait
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault reply, ending = self._poll_connection(msg_id, timeout)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 235, in _poll_connection
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault % msg_id)
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault MessagingTimeout: Timed out waiting for a reply to message ID e83aa194d740457494288b61a1251caa
2015-09-01 19:19:49.696 TRACE cinder.api.middleware.fault

So it looks like the c-api of first node is trying to call c-vol of halted node to terminate_connection although the node is halted (and the message times out).

Instead it should simply "rebuild" the instance on another available node.

Revision history for this message
Gorka Eguileor (gorka) wrote :

What do you mean with "shared storage"? Is this some kind of Active-Active cluster configuration?

Revision history for this message
Eduard Biceri-Matei (cubusbacau) wrote :

No, it's not Active-Active.
Shared storage means the volumes are available on all nodes but active only on one. We use our own cinder driver to enable this, so, for example, we can do live-migration without having to copy the volume over from one node to another.

We try to do the same thing with evacuate, since the volume is available on any node it should be easy to just rebuild the instance and start it.

PS. Looking closer at the chain of events, this looks more like an error in nova (n-cpu called c-api to terminate_connection, although the evacuate command states - Evacuate server from failed host. - so it shouldn't try to connect to the failed host). Not sure though which component is to blame here.

Revision history for this message
Eduard Biceri-Matei (cubusbacau) wrote :

(bump)

Revision history for this message
Scott DAngelo (scott-dangelo) wrote :

Nova must call cinder to terminate_connection (and later on in rebuild_instance it will call cinder detach) when it is moving the attached volume to a new host. It can then re-attach the volume to a new Nova instance on a new Host. Without calling terminate_connection() and detach() nova will never be able to re-attach the volume.
If the c-vol node is down, the terminate_connection call to that node will not succeed. I'm not sure we'd want to ignore this, since the export of the volume will persist if/when that c-vol node comes back up.

Revision history for this message
Eduard Biceri-Matei (cubusbacau) wrote :

IMHO nova should not even try to terminate_connection as this is sure to fail (as we know the node is down, otherwise the evacuate host option would not be present in horizon).
Instead, it should try to initialize_connection on the new host and fail if it doesn't (and let the cinder driver sort out the issues to make sure it succeeds). Our driver is able to handle a new connection coming from a new host (as it doesn't use export since the volumes are already available on every node).

Otherwise how is "evacuate host" supposed to work?

Revision history for this message
Scott DAngelo (scott-dangelo) wrote :

terminate_connection is not sure to fail because it it more common to have the cinder-volume node as a separate machine than the nova-cpu node.
In this scenario, the nova-compute host is down but the cinder-volume node is up, and all proceeds as designed.
If the original connection is still present (because terminate_connection is never called), it will be picked up by the original nova-compute hosts when it restarts and does an iscsi rescan (in the case of Iscsi) and can lead to data corruption.

Revision history for this message
Eduard Biceri-Matei (cubusbacau) wrote :

Ok, so i should try to run cinder-volume on a different node.
Although in our case it might bring other problems as the backend is a local file system (although distributed using an external component) on each compute node and we need cinder-volume to run commands locally.

So this ticket is invalid.

Thanks.

no longer affects: cinder
Changed in nova:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.