tempest

Failed to detach volume because of volume not found error prevents vm teardown

Bug #1302774 reported by Sean Dague on 2014-04-04

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Fix Released	High	Nikola Đipanov
	tempest	Fix Released	High	Sean Dague

Bug Description

When running the boto tests in the gate we get a periodic race on Cinder volumes that can be seen as follows:

Attach the volume (vdh) - http://logs.openstack.org/29/84829/6/check/check-dg-tempest-dsvm-full-reexec/7e595ac/logs/screen-n-cpu.txt.gz?level=INFO#_2014-04-04_11_38_35_995

Detach the volume (vdh) - http://logs.openstack.org/29/84829/6/check/check-dg-tempest-dsvm-full-reexec/7e595ac/logs/screen-n-cpu.txt.gz?level=INFO#_2014-04-04_11_38_41_477

Stack trace horribly because the volume is not found - http://logs.openstack.org/29/84829/6/check/check-dg-tempest-dsvm-full-reexec/7e595ac/logs/screen-n-cpu.txt.gz?level=INFO#_2014-04-04_11_38_43_866

Stack trace horribly because we try again? - http://logs.openstack.org/29/84829/6/check/check-dg-tempest-dsvm-full-reexec/7e595ac/logs/screen-n-cpu.txt.gz?level=INFO#_2014-04-04_11_42_29_353

Because of this we end up with a volume in an undeletable state, and the tests fail in making it go away (it remains marked in-use even though the guest that was using it is gone) - http://logs.openstack.org/29/84829/6/check/check-dg-tempest-dsvm-full-reexec/7e595ac/console.html#_2014-04-04_11_42_47_940

Logstash for these results is: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGlza05vdEZvdW5kOiBObyBkaXNrIGF0XCIgQU5EIHRhZ3M6c2NyZWVuLW4tY3B1LnR4dCIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5NjYzNDM5MjIyOX0=

Tags:

Revision history for this message

Sean Dague (sdague) wrote on 2014-04-04:

Possibly a Tempest bug as well, as maybe there is a better tear down path here.

Tracy Jones (tjones-i) on 2014-04-08

tags:

added: volumes

Matt Riedemann (mriedem) on 2014-04-18

tags:	added: ec2
Changed in nova:
importance:	Undecided → High
tags:	added: gate-failure

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-04-18:

This should help with diagnosis: https://review.openstack.org/#/c/88541/

Revision history for this message

Attila Fazekas (afazekas) wrote on 2014-04-23:

Testing in addition http://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-ItemType-AttachmentSetItemResponseType.html in the http://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-ItemType-DescribeVolumesSetItemResponseType.html may help if the attachmentSet attribute filled properly.

Another little different but may be related failure:
http://logs.openstack.org/65/86265/1/check/check-grenade-dsvm-partial-ncpu/7c60194/console.html

Revision history for this message

Openstack Gerrit (openstack-gerrit) wrote on 2014-04-24: Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/89723
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=3b563355ec49568b4c1098a0915527b107359056
Submitter: Jenkins
Branch: master

commit 3b563355ec49568b4c1098a0915527b107359056
Author: Sean Dague <email address hidden>
Date: Tue Apr 22 20:00:46 2014 -0400

fix test_compute_with_volumes

    The test_integration_1 test was waiting on volume being in an
    'in-use' state before proceeding. However the 'in-use' state
    doesn't actually mean that the volume is attached, as 'attaching'
    gets mapped to 'in-use' in nova.

    This makes it possible for the test to proceed when the volume is
    in an attaching state, and not actually ever attached. Which means
    you can get to the detach phase on an non attached volume, and
    thus explode.

Use the attachment status to filter out when then volume is not
actually ready.

In the process, make the test name something that's actually
descriptive for what it does.

Related-Bug: #1302774

Change-Id: Iaeb6a42a39b4adf8cb4bdff64efe00bf14821725

Revision history for this message

Nikola Đipanov (ndipanov) wrote on 2014-04-25:

Looking at this a bit more, this seems like a nova race. The fact that detach_call gets to the compute service when the volume is in the wrong state (what the stack trace tells us) makes me think that.

Looking deeper, this seems to be a race condition in Nova API, which is actually caused by the nova cinder API class.

This is probably from the time when Cinder was being split out.

Basically instead of actually sending a cinder check_detach API request, nova will just check the status of the volume dict passed into the cinder.API.check_detach. In case of ec2 API we first get the volume, and then get it's instance since ec2 API does not expect the instance to be specified (see nova.api.ec2.cloud.CloudController.detach_volume). This will then hit the database, which is serialized due to eventlet in the gate... thus making the race even bigger.

An easy way to reproduce the usse would be to make nova.compute.api.API.detach_volume artificially take longer for one request and then fire off two requests. Both will get to the compute host (race!), and one will stacktrace.

I will submit a nova patch as soon as I figure out the proper checks that need to be replaced by actual API calls.

Revision history for this message

Nikola Đipanov (ndipanov) wrote on 2014-04-25:

Ooops looks like this is a bit trickier than I thought, since check_attach and check_detach and check_detach are actually not exposed in the API.

The best solution would be to completely drop those two calls from nova as they are inherently racy (the result is outdated as soon as we get it). and make sure that reserve_volume (already does) and begin_detaching error out when the status is not as expected, so it will require some cinder work as well.

Nikola Đipanov (ndipanov) on 2014-04-25

Changed in nova:
assignee:	nobody → Nikola Đipanov (ndipanov)
Changed in cinder:
assignee:	nobody → Nikola Đipanov (ndipanov)

Revision history for this message

Openstack Gerrit (openstack-gerrit) wrote on 2014-04-25: Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/90353

Changed in cinder:
status:	New → In Progress

Sean Dague (sdague) on 2014-04-25

Changed in nova:
status:	New → Confirmed
Changed in cinder:
importance:	Undecided → High
Changed in tempest:
assignee:	nobody → Sean Dague (sdague)
status:	New → Fix Released
importance:	Undecided → High

Revision history for this message

Openstack Gerrit (openstack-gerrit) wrote on 2014-04-25: Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/90410

Revision history for this message

Openstack Gerrit (openstack-gerrit) wrote on 2014-04-29: Fix merged to tempest (master)

Reviewed: https://review.openstack.org/90410
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=1ad0f2b8b3f3959a6a5d031c1ac0b5a82d714e61
Submitter: Jenkins
Branch: master

commit 1ad0f2b8b3f3959a6a5d031c1ac0b5a82d714e61
Author: Nikola Dipanov <email address hidden>
Date: Mon Apr 28 12:33:12 2014 +0200

Remove {begin,roll}_detaching volume API tests

    In preparation for changes to the cinder API in
    https://review.openstack.org/#/c/90353/ , we need to make sure volumes
    are in the expected state before we can call begin_detaching on it.

    Adding this kind of check to the existing two tests would make them
    scenario tests really, and also begin and roll detaching are already
    tested as part of standard attach and detach scenario tests, plus are
    kind of an implementation detail (even though, sadly, exposed as a
    public API) and need not be tested by tempest tests.

Partial-bug: #1302774
Change-Id: I59e88344e83b8c0665a04c8ef55126a8f27bbf57

Revision history for this message

Mark McLoughlin (markmc) wrote on 2014-06-06:

#10

bug #1327218 may have the same underlying cause as this one, talking to Nikola we're not 100% sure yet

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Fix merged to cinder (master)

#11

Reviewed: https://review.openstack.org/90353
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=d955c1da4490987233e252f94dffcfa106746c14
Submitter: Jenkins
Branch: master

commit d955c1da4490987233e252f94dffcfa106746c14
Author: Nikola Dipanov <email address hidden>
Date: Fri Apr 25 13:38:31 2014 +0200

Make begin_detaching fail if volume not "in-use"

    Like it's counterpart from Nova's volume-attach functionality standpoint
    - reserve_volume, begin_detaching should fail if the volume is not in
      the correct state to be detached. This will prevent nova from
    attempting to check and then detach, which is inherently racy.

Change-Id: Ie87eb0c9aea068affc94032ab2e53a91888d272a
Partial-bug: #1302774

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-18:

#12

Reviewed: https://review.openstack.org/90354
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=27fe5638caed1544220943b07895c8afc70dab32
Submitter: Jenkins
Branch: master

commit 27fe5638caed1544220943b07895c8afc70dab32
Author: Nikola Dipanov <email address hidden>
Date: Fri Apr 25 18:39:43 2014 +0200

Remove check_{attach,detach} from volumes API

    These methods are not called from anywhere and are likely just a relic
    from splitting Cinder out from Nova. Moreover their usage from the API
    should really be discouraged as they are inherently racy (the result is
    immediately outdated since it is not atomic with anything that might
    follow).

Partial-bug: #1302774
Change-Id: I5d79b22b2a1d3c9be429a13d094328fc2075fcd2

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-08-18:

#13

Sean fixed the invalid assumption in the Tempest tests so I'm marking this as closed and we can put priority on bug 1327218 which is still hitting in the gate.

Changed in nova:
importance:	High → Medium
assignee:	Nikola Đipanov (ndipanov) → nobody
status:	Confirmed → Invalid
no longer affects:	nova
Changed in cinder:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.