Volume detach failure because of invalid bdm.connection_info

Bug #1327218 reported by Mark McLoughlin on 2014-06-06
76
This bug affects 19 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Matt Riedemann
Juno
Undecided
Unassigned
nova (Ubuntu)
Undecided
Unassigned
Trusty
High
Unassigned

Bug Description

Example of this here:

http://logs.openstack.org/33/97233/1/check/check-grenade-dsvm/f7b8a11/logs/old/screen-n-cpu.txt.gz?level=TRACE#_2014-06-02_14_13_51_125

   File "/opt/stack/old/nova/nova/compute/manager.py", line 4153, in _detach_volume
     connection_info = jsonutils.loads(bdm.connection_info)
   File "/opt/stack/old/nova/nova/openstack/common/jsonutils.py", line 164, in loads
     return json.loads(s)
   File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
     return _default_decoder.decode(s)
   File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
 TypeError: expected string or buffer

and this was in grenade with stable/icehouse nova commit 7431cb9

There's nothing unusual about the test which triggers this - simply attaches a volume to an instance, waits for it to show up in the instance and then tries to detach it

logstash query for this:

  message:"Exception during message handling" AND message:"expected string or buffer" AND message:"connection_info = jsonutils.loads(bdm.connection_info)" AND tags:"screen-n-cpu.txt"

but it seems to be very rare

Mark McLoughlin (markmc) wrote :

Looks in the same ballpark as bug #1302774

description: updated
description: updated
Mark McLoughlin (markmc) wrote :

Here's a case of the DiskNotFound traceback from bug #1302774 and the traceback from this bug in the same log:

http://logs.openstack.org/33/96333/2/check/check-grenade-dsvm/105af93/logs/old/screen-n-cpu.txt.gz?level=TRACE

Nikola Đipanov (ndipanov) wrote :

Looks like that this did not get fixed by the cinder fix https://review.openstack.org/#/c/90353/ looking at logstash. The fix landed on June 10th but we have a hit on June 11th.

Let's monitor this a bit more to be sure, but seems there are more races in there.

Joe Gordon (jogo) wrote :

No hits for this one, marking as resolved

Changed in nova:
status: Triaged → Fix Committed
Thierry Carrez (ttx) on 2014-09-05
Changed in nova:
milestone: none → juno-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2014-10-16
Changed in nova:
milestone: juno-3 → 2014.2
Matt Riedemann (mriedem) wrote :

In one case, we're attaching the encrypted luks volume to the instance here:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_38_09_061

We initialize the connection and get the connection_info back here:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_38_11_064

I see an os-attach call here:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_38_15_223

We start detaching the volume here:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_38_16_902

We're failing to detach the volume here:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_38_17_567

And six minutes later we're terminating the bdm for that volume here:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_44_54_876

After failing to detach, I'm also seeing the same volume_id showing up in the logs in other test runs:

VolumesV1SnapshotTestJSON:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_42_03_507

TestMinimumBasicScenario:

http://logs.openstack.org/93/156693/7/check/check-tempest-dsvm-postgres-full/d3b26e8/logs/screen-n-cpu.txt.gz#_2015-03-12_16_44_40_119

Matt Riedemann (mriedem) wrote :

This should help with some of the confusion when tracing throug the n-cpu debug logs:

https://review.openstack.org/#/c/164259/

The test that fails is for cryptsetup but the volume type had 'luks' in the name which is confusing.

Fix proposed to branch: master
Review: https://review.openstack.org/164330

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Matt Riedemann (mriedem) on 2015-03-13
tags: added: juno-backport-potential
haruka tanizawa (h-tanizawa) wrote :

This patch https://review.openstack.org/#/c/163937/ also seems to avoid NULL of connection_info.

Reviewed: https://review.openstack.org/164330
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6fb2ef96d6aaf9ca0ad394fd7621ef1e6003f5a1
Submitter: Jenkins
Branch: master

commit 6fb2ef96d6aaf9ca0ad394fd7621ef1e6003f5a1
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 18 12:42:42 2015 -0700

    Save bdm.connection_info before calling volume_api.attach_volume

    There is a race in attach/detach of a volume where the volume status
    goes to 'in-use' before the bdm.connection_info data is stored in the
    database. Since attach is a cast, the caller can see the volume go to
    'in-use' and immediately try to detach the volume and blow up in the
    compute manager because bdm.connection_info isn't set stored in the
    database.

    This fixes the issue by saving the connection_info immediately before
    calling volume_api.attach_volume (which sets the volume status to
    'in-use').

    Closes-Bug: #1327218

    Change-Id: Ib95c8f7b66aca0c4ac7b92d140cbeb5e85c2717f

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2015-03-20
Changed in nova:
milestone: none → kilo-3
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/166017
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bbf6348997fee02f9dadd556565f44005e2c7f23
Submitter: Jenkins
Branch: stable/juno

commit bbf6348997fee02f9dadd556565f44005e2c7f23
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 18 12:42:42 2015 -0700

    Save bdm.connection_info before calling volume_api.attach_volume

    There is a race in attach/detach of a volume where the volume status
    goes to 'in-use' before the bdm.connection_info data is stored in the
    database. Since attach is a cast, the caller can see the volume go to
    'in-use' and immediately try to detach the volume and blow up in the
    compute manager because bdm.connection_info isn't set stored in the
    database.

    This fixes the issue by saving the connection_info immediately before
    calling volume_api.attach_volume (which sets the volume status to
    'in-use').

    Closes-Bug: #1327218

    Conflicts:
            nova/tests/unit/compute/test_compute.py
            nova/tests/unit/virt/test_block_device.py
            nova/virt/block_device.py

    NOTE(mriedem): The block_device conflicts are due to using dot
    notation when accessing object fields and in kilo the context is
    no longer passed to bdm.save(). The test conflicts are due to moving
    the test modules in kilo and passing the context on save().

    Change-Id: Ib95c8f7b66aca0c4ac7b92d140cbeb5e85c2717f
    (cherry picked from commit 6fb2ef96d6aaf9ca0ad394fd7621ef1e6003f5a1)

tags: added: in-stable-juno
Thierry Carrez (ttx) on 2015-04-30
Changed in nova:
milestone: kilo-3 → 2015.1.0
Changed in nova (Ubuntu Trusty):
assignee: nobody → Edward Hope-Morley (hopem)
importance: Undecided → High
Xi Yang (xi-yang) wrote :
Download full text (5.0 KiB)

I am not sure whether this bug is fixed. I can reproduce it on master branch.

my test environment is setup with commit 172d4a00ce609da7ea6d8d97f635e6c9afecb373.

2015-08-12 23:23:15.793 INFO nova.compute.manager [req-6e6eb653-cb52-47a2-93e5-d520633e4e10 admin admin] [instance: 0030bb4e-10f3-4300-9ab7-2a2bd609678f] Detach volume fe82d173-1500-4bfb-a541-3046b46c8be0 from mountpoint /dev/vdb
 ----BlockDeviceMapping(boot_index=None,connection_info=None,created_at=2015-08-12T13:38:16Z,delete_on_termination=False,deleted=False,deleted_at=None,destination_type='volume',device_name='/dev/vdb',device_type=None,disk_bus=None,guest_format=None,id=2685,image_id=None,instance=<?>,instance_uuid=0030bb4e-10f3-4300-9ab7-2a2bd609678f,no_device=False,snapshot_id=None,source_type='volume',updated_at=None,volume_id='fe82d173-1500-4bfb-a541-3046b46c8be0',volume_size=None) --- (This is printed by me)
2015-08-12 23:23:15.808 ERROR oslo_messaging.rpc.dispatcher [req-6e6eb653-cb52-47a2-93e5-d520633e4e10 admin admin] Exception during message handling: <type 'NoneType'> can't be decoded
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher executor_callback))
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher executor_callback)
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/nova/nova/exception.py", line 89, in wrapped
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher payload)
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 119, in __exit__
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/nova/nova/exception.py", line 72, in wrapped
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw)
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/opt/stack/nova/nova/compute/manager.py", line 364, in decorated_function
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 119, in __exit__
2015-08-12 23:23:15.808 136462 ERROR oslo_messaging.rpc.dis...

Read more...

Liam Young (gnuoy) wrote :

The fix went into 2015.1.0 and 2015.1.1 is now in the cloud archive.

Changed in nova (Ubuntu):
status: New → Fix Released
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu Trusty):
status: New → Confirmed
George (lmihaiescu) wrote :

Do you know when/if this will be backported to Juno?

Thank you.

Edward Hope-Morley (hopem) wrote :

@lmihaiescu Hi, this patch is already in stable/juno and as such will be included in the next point release of Juno (2014.2.4) but is not yet targeted for SRU into Juno.

Changed in nova (Ubuntu Trusty):
assignee: Edward Hope-Morley (hopem) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers