Volume detach fails if there are multiple BDM entries

Bug #1709287 reported by Radoslav Gerganov
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Steps to reproduce:
1. Attaching volume to an instance fails because of an RPC timeout when nova-api calls nova-compute to create BDM
2. Attaching the same volume to the same instance succeeds the second time
3. There are two BDMs for this volume and one of them has empty connection_info. When we try to detach the volume, an error is thrown because of the stale BDM entry created on step 1:

[req-b14eb2a2-10bc-4b1a-b62f-ead07947eb66 7c0126911c154f3db23e4f013c70f5aa b006cefe78734655ad29cf49445f2f67 - - -] Exception during message handling: <type 'NoneType'> can't be decoded
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
    incoming.message))
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 154, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 110, in wrapped
    payload)
  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 89, in wrapped
    return f(self, context, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 395, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 383, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 466, in decorated_function
    instance=instance)
  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 456, in decorated_function
    *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4976, in detach_volume
    attachment_id=attachment_id)
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4906, in _detach_volume
    connection_info = jsonutils.loads(bdm.connection_info)
  File "/usr/lib/python2.7/dist-packages/oslo_serialization/jsonutils.py", line 229, in loads
    return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)
  File "/usr/lib/python2.7/dist-packages/oslo_utils/encodeutils.py", line 39, in safe_decode
    raise TypeError("%s can't be decoded" % type(text))
TypeError: <type 'NoneType'> can't be decoded

It is not easy to catch the timeout and then delete the BDM entry because the entry may get created later after the timeout (we have seen this in our environment). Also, we may accidentally delete the entry created by a concurrent attach request.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Need more information about the environment - which version of nova? If master, what is the git hash? Which virt driver?

Changed in nova:
status: New → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :

Can you confirm that it's the reserve_block_device_name RPC call to the compute that times out?

I think Nikola Dipanov and later Lee Yarwood had worked on a change in the API that might have helped here, but now I'm having a hard time finding it.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This was the change I was looking for: https://review.openstack.org/#/c/290793/

Revision history for this message
Radoslav Gerganov (rgerganov) wrote :

@Matt We hit this problem using stable/ocata but I believe it is also present in master. Bug 1427060 describes a similar problem but most importantly the proposed patch will also fix this bug.

Revision history for this message
jcat (jcat) wrote :

Hi,

Can we get some clarification here please?

This bug was declared a duplicate of #1427060, on the grounds that it's proposed patch would fix this issue. However, that proposed patch has been abandoned. Therefor there is no certainly that the eventual resolution for that bug will also resolve this bug.

Can anyone confirm what the current status is?

Revision history for this message
jcat (jcat) wrote :

I've removed the duplicate flag for now.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.