Failed to detach volume because of volume not found error prevents vm teardown

Bug #1302774 reported by Sean Dague
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Nikola Đipanov
tempest
Fix Released
High
Sean Dague
Revision history for this message
Sean Dague (sdague) wrote :

Possibly a Tempest bug as well, as maybe there is a better tear down path here.

Tracy Jones (tjones-i)
tags: added: volumes
Matt Riedemann (mriedem)
tags: added: ec2
Changed in nova:
importance: Undecided → High
tags: added: gate-failure
Revision history for this message
Matt Riedemann (mriedem) wrote :

This should help with diagnosis: https://review.openstack.org/#/c/88541/

Revision history for this message
Attila Fazekas (afazekas) wrote :
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/89723
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=3b563355ec49568b4c1098a0915527b107359056
Submitter: Jenkins
Branch: master

commit 3b563355ec49568b4c1098a0915527b107359056
Author: Sean Dague <email address hidden>
Date: Tue Apr 22 20:00:46 2014 -0400

    fix test_compute_with_volumes

    The test_integration_1 test was waiting on volume being in an
    'in-use' state before proceeding. However the 'in-use' state
    doesn't actually mean that the volume is attached, as 'attaching'
    gets mapped to 'in-use' in nova.

    This makes it possible for the test to proceed when the volume is
    in an attaching state, and not actually ever attached. Which means
    you can get to the detach phase on an non attached volume, and
    thus explode.

    Use the attachment status to filter out when then volume is not
    actually ready.

    In the process, make the test name something that's actually
    descriptive for what it does.

    Related-Bug: #1302774

    Change-Id: Iaeb6a42a39b4adf8cb4bdff64efe00bf14821725

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Looking at this a bit more, this seems like a nova race. The fact that detach_call gets to the compute service when the volume is in the wrong state (what the stack trace tells us) makes me think that.

Looking deeper, this seems to be a race condition in Nova API, which is actually caused by the nova cinder API class.

This is probably from the time when Cinder was being split out.

Basically instead of actually sending a cinder check_detach API request, nova will just check the status of the volume dict passed into the cinder.API.check_detach. In case of ec2 API we first get the volume, and then get it's instance since ec2 API does not expect the instance to be specified (see nova.api.ec2.cloud.CloudController.detach_volume). This will then hit the database, which is serialized due to eventlet in the gate... thus making the race even bigger.

An easy way to reproduce the usse would be to make nova.compute.api.API.detach_volume artificially take longer for one request and then fire off two requests. Both will get to the compute host (race!), and one will stacktrace.

I will submit a nova patch as soon as I figure out the proper checks that need to be replaced by actual API calls.

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Ooops looks like this is a bit trickier than I thought, since check_attach and check_detach and check_detach are actually not exposed in the API.

The best solution would be to completely drop those two calls from nova as they are inherently racy (the result is outdated as soon as we get it). and make sure that reserve_volume (already does) and begin_detaching error out when the status is not as expected, so it will require some cinder work as well.

Changed in nova:
assignee: nobody → Nikola Đipanov (ndipanov)
Changed in cinder:
assignee: nobody → Nikola Đipanov (ndipanov)
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/90353

Changed in cinder:
status: New → In Progress
Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
Changed in cinder:
importance: Undecided → High
Changed in tempest:
assignee: nobody → Sean Dague (sdague)
status: New → Fix Released
importance: Undecided → High
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/90410

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/90410
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=1ad0f2b8b3f3959a6a5d031c1ac0b5a82d714e61
Submitter: Jenkins
Branch: master

commit 1ad0f2b8b3f3959a6a5d031c1ac0b5a82d714e61
Author: Nikola Dipanov <email address hidden>
Date: Mon Apr 28 12:33:12 2014 +0200

    Remove {begin,roll}_detaching volume API tests

    In preparation for changes to the cinder API in
    https://review.openstack.org/#/c/90353/ , we need to make sure volumes
    are in the expected state before we can call begin_detaching on it.

    Adding this kind of check to the existing two tests would make them
    scenario tests really, and also begin and roll detaching are already
    tested as part of standard attach and detach scenario tests, plus are
    kind of an implementation detail (even though, sadly, exposed as a
    public API) and need not be tested by tempest tests.

    Partial-bug: #1302774
    Change-Id: I59e88344e83b8c0665a04c8ef55126a8f27bbf57

Revision history for this message
Mark McLoughlin (markmc) wrote :

bug #1327218 may have the same underlying cause as this one, talking to Nikola we're not 100% sure yet

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/90353
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=d955c1da4490987233e252f94dffcfa106746c14
Submitter: Jenkins
Branch: master

commit d955c1da4490987233e252f94dffcfa106746c14
Author: Nikola Dipanov <email address hidden>
Date: Fri Apr 25 13:38:31 2014 +0200

    Make begin_detaching fail if volume not "in-use"

    Like it's counterpart from Nova's volume-attach functionality standpoint
    - reserve_volume, begin_detaching should fail if the volume is not in
      the correct state to be detached. This will prevent nova from
    attempting to check and then detach, which is inherently racy.

    Change-Id: Ie87eb0c9aea068affc94032ab2e53a91888d272a
    Partial-bug: #1302774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/90354
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=27fe5638caed1544220943b07895c8afc70dab32
Submitter: Jenkins
Branch: master

commit 27fe5638caed1544220943b07895c8afc70dab32
Author: Nikola Dipanov <email address hidden>
Date: Fri Apr 25 18:39:43 2014 +0200

    Remove check_{attach,detach} from volumes API

    These methods are not called from anywhere and are likely just a relic
    from splitting Cinder out from Nova. Moreover their usage from the API
    should really be discouraged as they are inherently racy (the result is
    immediately outdated since it is not atomic with anything that might
    follow).

    Partial-bug: #1302774
    Change-Id: I5d79b22b2a1d3c9be429a13d094328fc2075fcd2

Revision history for this message
Matt Riedemann (mriedem) wrote :

Sean fixed the invalid assumption in the Tempest tests so I'm marking this as closed and we can put priority on bug 1327218 which is still hitting in the gate.

Changed in nova:
importance: High → Medium
assignee: Nikola Đipanov (ndipanov) → nobody
status: Confirmed → Invalid
no longer affects: nova
Changed in cinder:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.