Cinder volume stuck in detaching on error exception

Bug #1888665 reported by Jacolex
34
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Lee Yarwood

Bug Description

Hello!

The issue is annoying me and my users for long time, so I decided to report the problem with detaching volumes from instances. When an exception occurs, the detaching is stuck in detaching state without retrying detach procedure.
For example, below is the exception:

2020-07-23 09:36:37.964 37220 ERROR nova.volume.cinder [req-1397b7e6-e111-44e6-87df-3057b30bfa0b 13622fba11a84996a70019d553060c8c b5fec10eaaa948ee86d61d7db3fb42e7 - default default] Delete attachment failed for attachment 35653c6c-12c5-4d10-a3d4-508b45e2b1d6. Error: Gat
eway Time-out (HTTP 504) Code: 504: cinderclient.exceptions.ClientException: Gateway Time-out (HTTP 504)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server [req-1397b7e6-e111-44e6-87df-3057b30bfa0b 13622fba11a84996a70019d553060c8c b5fec10eaaa948ee86d61d7db3fb42e7 - default default] Exception during message handling: cinderclient.exceptions.ClientException: Gatew
ay Time-out (HTTP 504)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 79, in wrapped
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server function_name, call_dict, binary, tb)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server raise value
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 69, in wrapped
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/utils.py", line 1372, in decorated_function
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 219, in decorated_function
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server raise value
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 207, in decorated_function
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 6198, in detach_volume
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server do_detach_volume(context, volume_id, instance, attachment_id)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 328, in inner
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 6196, in do_detach_volume
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server attachment_id=attachment_id)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 6147, in _detach_volume
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server attachment_id=attachment_id, destroy_bdm=destroy_bdm)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/virt/block_device.py", line 461, in detach
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server attachment_id, destroy_bdm)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/virt/block_device.py", line 442, in _do_detach
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server volume_api.attachment_delete(context, self['attachment_id'])
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/volume/cinder.py", line 397, in wrapper
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server res = method(self, ctx, *args, **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/volume/cinder.py", line 448, in wrapper
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server res = method(self, ctx, attachment_id, *args, **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/volume/cinder.py", line 889, in attachment_delete
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server 'code': getattr(ex, 'code', None)})
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server raise value
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/volume/cinder.py", line 882, in attachment_delete
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server attachment_id)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinderclient/api_versions.py", line 423, in substitution
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return method.func(obj, *args, **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinderclient/v3/attachments.py", line 44, in delete
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return self._delete("/attachments/%s" % base.getid(attachment))
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinderclient/base.py", line 309, in _delete
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server resp, body = self.api.client.delete(url)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 223, in delete
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return self._cs_request(url, 'DELETE', **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 205, in _cs_request
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server return self.request(url, method, **kwargs)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 191, in request
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server raise exceptions.from_response(resp, body)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server cinderclient.exceptions.ClientException: Gateway Time-out (HTTP 504)
2020-07-23 09:36:38.019 37220 ERROR oslo_messaging.rpc.server

After that the volume can't be detached without reset-state to "in-use". I suggest to include some timeout or retrying in attach/detach procedures.

Jacolex (jacolex)
description: updated
Revision history for this message
vinay harsha mitta (vinay7) wrote :

hi,
can i know the steps to reproduce this issue.

Revision history for this message
Jacolex (jacolex) wrote :

Hello Vinay,

The issue occurs when some critical problems happened, for example multiple attach/detach/create volume requests to cinder-api and cinder-volume would leads to slower response of cinder service. To reproduce the problem I could suggest to stop cinder-volume service for couple of seconds during detaching. After nova-compute throws exception, the cinder-volume service could be started and finally the detaching procedure won't be retried by nova-compute.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Please use the provided template when creating bugs in future.

Can you confirm which version of Nova and Cinder are used in this environment?

The volume staying in a `detaching` state after this exception is a valid Nova bug IMHO.

However `cinderclient.exceptions.ClientException: Gateway Time-out (HTTP 504)` suggests that Cinder API is timing out while waiting for Cinder Volume to remove the volume attachment on the backend. Can you trace a request through on the Cinder side to confirm this and maybe create a fresh bug for any issues you discover there?

I'll use this bug to ensure we call the roll detaching API to ensure the volume remains marked as `in-use`.

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Lee Yarwood (lyarwood)
Revision history for this message
Jacolex (jacolex) wrote :

Hello Lee
I'm glad it will be processed to fix. Regarding to cinder timeouts I'll try to dive in and check the issue, but following your suggestion I'll also try to change timeouts on haproxy and cinder api.

Revision history for this message
Doug Szumski (dszumski) wrote :

From what I can see, there is no timeout set when the Cinder client is created in Nova, and the Python requests library defaults to no timeout. Therefore, where is the timeout coming from? From looking at a similar issue, in my case, HAProxy appears to be causing it. I see some long running requests (exceeding the HAProxy timeout of 10s) in Cinder, but HAProxy has `timeout http-request 10s` set. Do you have a similar setup Jacolex, or is Nova talking directly to Cinder in your case?

Revision history for this message
Doug Szumski (dszumski) wrote :

Ah, sorry, just saw that you mentioned you have haproxy in the middle!

Revision history for this message
Jacolex (jacolex) wrote :

Yes, I have haproxy, but i think that's not the point. Devs shouldn't assume that cinder api is aware of failure. I had disaster last week when something hapeened wrong with network and openstack had bad contact with Ceph. During that failure cinder-volume was flapping (ceph volume service upping and downing continously) and compute nodes also were not able to proper handling attach/detach requests. Attaching - minor problem, but detaching - it was disaster, because some detaches went ok, some not, almost every detaching stucked in detaching state, regardless of actual status of the task. There is Openshift on the Openstack, so many attach/detach requests occuring every time. So after that I had inconsistent state of volumes, which I had to solve since today. It won't be happend if detaching error handling would be hendled in proper way.

Revision history for this message
Sven Kieske (s-kieske) wrote :

any update on this?

Revision history for this message
Satish Patel (satish-txt) wrote :

any update?

I am getting same issue with Zed openstack release.

Revision history for this message
Andrés Castro (acastro25) wrote :

I have the same issue with 2023.1 version.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.