Volumes tests fails often with rbd backend

Bug #1464259 reported by Matt Riedemann
34
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Unassigned
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Liberty
Fix Released
High
Matt Riedemann
tempest
Invalid
Undecided
Unassigned

Bug Description

http://logs.openstack.org/02/173802/5/check/check-tempest-dsvm-full-ceph/a72aac1/logs/screen-n-api.txt.gz?level=TRACE#_2015-06-11_09_04_19_511

2015-06-11 09:04:19.511 ERROR nova.api.ec2 [req-0ac81d78-2717-4dd2-80e2-d94363b55ac8 EC2VolumesTest-442487008 EC2VolumesTest-1066393631] Unexpected InvalidInput raised: Invalid input received: Invalid volume: Volume still has 1 dependent snapshots. (HTTP 400) (Request-ID: req-4586b5d2-7212-4ddd-af79-43ad8ba7ea58)
2015-06-11 09:04:19.511 ERROR nova.api.ec2 [req-0ac81d78-2717-4dd2-80e2-d94363b55ac8 EC2VolumesTest-442487008 EC2VolumesTest-1066393631] Environment: {"HTTP_AUTHORIZATION": "AWS4-HMAC-SHA256 Credential=a5e9253350ce4a249ddce8b7c1c798c2/20150611/0/127/aws4_request,SignedHeaders=host;x-amz-date,Signature=304830ed947f7fba3143887b08d1e47faa18d4b59782c0992727cb7593f586b4", "SCRIPT_NAME": "", "REQUEST_METHOD": "POST", "HTTP_X_AMZ_DATE": "20150611T090418Z", "PATH_INFO": "/", "SERVER_PROTOCOL": "HTTP/1.0", "CONTENT_LENGTH": "60", "HTTP_USER_AGENT": "Boto/2.38.0 Python/2.7.6 Linux/3.13.0-53-generic", "RAW_PATH_INFO": "/", "REMOTE_ADDR": "127.0.0.1", "wsgi.url_scheme": "http", "SERVER_PORT": "8773", "CONTENT_TYPE": "application/x-www-form-urlencoded; charset=UTF-8", "HTTP_HOST": "127.0.0.1:8773", "SERVER_NAME": "127.0.0.1", "GATEWAY_INTERFACE": "CGI/1.1", "REMOTE_PORT": "45819", "HTTP_ACCEPT_ENCODING": "identity"}

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRUMyVm9sdW1lc1Rlc3RcIiBBTkQgbWVzc2FnZTpcIlVuZXhwZWN0ZWQgSW52YWxpZElucHV0IHJhaXNlZDogSW52YWxpZCBpbnB1dCByZWNlaXZlZDogSW52YWxpZCB2b2x1bWU6IFZvbHVtZSBzdGlsbCBoYXMgMSBkZXBlbmRlbnQgc25hcHNob3RzXCIgQU5EIHRhZ3M6XCJzY3JlZW4tbi1hcGkudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MzQwMzAyMTUwODd9

10 hits in 7 days, check and gate, hitting on the ceph and glusterfs jobs.

Tags: ceph
Matt Riedemann (mriedem)
summary: - tempest.thirdparty.boto.test_ec2_volumes.EC2VolumesTest fails with
- shared storage backend
+ EC2VolumesTest fails with shared storage backend
Changed in nova:
status: New → Confirmed
tags: added: ceph ec2 glusterfs
Revision history for this message
Matt Riedemann (mriedem) wrote : Re: EC2VolumesTest fails with shared storage backend

Here is the 400 from cinder-api:

http://logs.openstack.org/02/173802/5/check/check-tempest-dsvm-full-ceph/a72aac1/logs/screen-c-api.txt.gz#_2015-06-11_09_04_19_506

2015-06-11 09:04:19.506 INFO cinder.volume.api [req-4586b5d2-7212-4ddd-af79-43ad8ba7ea58 EC2VolumesTest-1066393631] Unable to delete volume: 1ca1133b-915c-4f13-b2e5-5acdaa61075b, volume currently has snapshots.

Here is the more specific error in the cinder-volume logs:

http://logs.openstack.org/02/173802/5/check/check-tempest-dsvm-full-ceph/a72aac1/logs/screen-c-vol.txt.gz#_2015-06-11_09_01_02_096

2015-06-11 09:01:02.096 ERROR cinder.volume.manager [req-accfae1c-8732-4051-8707-c86e0ced69ee EC2VolumesTest-1066393631] Snapshot(cgsnapshot_id=None,created_at=2015-06-11T09:00:56Z,deleted=False,deleted_at=None,display_description=None,display_name=None,encryption_key_id=None,id=4465feb0-0a59-4aca-9491-36cd797a8329,metadata={},progress='100%',project_id=e3dfaf596203409784675b0aa176135e,provider_id=None,provider_location=None,status='deleting',updated_at=2015-06-11T09:00:57Z,user_id=217de41df5f54ee8b71bb4f88cb388e5,volume=Volume(1ca1133b-915c-4f13-b2e5-5acdaa61075b),volume_id=1ca1133b-915c-4f13-b2e5-5acdaa61075b,volume_size=1,volume_type_id=7902593b-4e35-410d-83b4-f8cf8abcee89)Delete snapshot failed, due to snapshot busy.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like the rbd driver in cinder just need a retry loop like was added to nova here:

https://review.openstack.org/#/c/169446/

Revision history for this message
Matt Riedemann (mriedem) wrote :

The glusterfs failure is actually not the issue here, that was a check queue change where everything failed, so this is just the ceph issue.

no longer affects: nova
summary: - EC2VolumesTest fails with shared storage backend
+ EC2VolumesTest fails with rbd backend
Changed in cinder:
status: New → Triaged
importance: Undecided → Low
tags: removed: ec2 glusterfs
Revision history for this message
Deepak C Shetty (dpkshetty) wrote : Re: EC2VolumesTest fails with rbd backend

FWIW, i got a similar (yet different) failure for my patch , the same testcase, but failure in tempest cleanup/teardown

See http://logs.openstack.org/13/172813/8/check/check-tempest-dsvm-full-ceph/9731e7e/console.html

<snip>
2015-06-17 09:10:18.541 | {1} tearDownClass (tempest.thirdparty.boto.test_ec2_volumes.EC2VolumesTest) [0.000000s] ... FAILED

2015-06-17 09:13:50.494 | ==============================
2015-06-17 09:13:50.494 | Failed 1 tests - output below:
2015-06-17 09:13:50.494 | ==============================
2015-06-17 09:13:50.494 |
2015-06-17 09:13:50.495 | tearDownClass (tempest.thirdparty.boto.test_ec2_volumes.EC2VolumesTest)
2015-06-17 09:13:50.495 | -----------------------------------------------------------------------
2015-06-17 09:13:50.495 |
2015-06-17 09:13:50.495 | Captured traceback:
2015-06-17 09:13:50.495 | ~~~~~~~~~~~~~~~~~~~
2015-06-17 09:13:50.495 | Traceback (most recent call last):
2015-06-17 09:13:50.495 | File "tempest/test.py", line 310, in tearDownClass
2015-06-17 09:13:50.495 | six.reraise(etype, value, trace)
2015-06-17 09:13:50.495 | File "tempest/test.py", line 293, in tearDownClass
2015-06-17 09:13:50.495 | teardown()
2015-06-17 09:13:50.495 | File "tempest/thirdparty/boto/test.py", line 283, in resource_cleanup
2015-06-17 09:13:50.496 | raise exceptions.TearDownException(num=fail_count)
2015-06-17 09:13:50.496 | tempest.exceptions.TearDownException: 2 cleanUp operation failed

</snip>

Revision history for this message
Jon Bernard (jbernard) wrote :

It looks like the root cause (for RBD) is in the test's cleanup resources step, it's not scheduling waits for volume status prior to issuing the next command. It schedules client.delete_volume , which just checks for 202 response code:

    https://git.openstack.org/cgit/openstack/tempest/tree/tempest/services/volume/json/volumes_client.py#n94
    https://git.openstack.org/cgit/openstack/tempest/tree/tempest/thirdparty/boto/test_ec2_volumes.py#n67

I suppose these operations are not dependent in the default backend, but for RBD there is enough delay to cause problems. The snapshot from which the second volume was created is deleted before the new volume completes - which fails. And then nova attempts to delete the original volume, which fails because a snapshot exists.

It seems to me that properly scheduling the waits in this test is the right way to go, rather than introducing retries in the driver.

Matt Riedemann (mriedem)
Changed in tempest:
status: New → Triaged
Revision history for this message
Deepak C Shetty (dpkshetty) wrote :
Revision history for this message
Sean Dague (sdague) wrote :

This is a top 5 race bug atm, it needs to be addressed

summary: - EC2VolumesTest fails with rbd backend
+ Volumes tests fails often with rbd backend
Changed in cinder:
importance: Low → High
Jon Bernard (jbernard)
Changed in cinder:
assignee: nobody → Jon Bernard (jbernard)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/254303

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tempest (master)

Change abandoned by Jon Bernard (<email address hidden>) on branch: master
Review: https://review.openstack.org/254303

Matt Riedemann (mriedem)
Changed in tempest:
status: Triaged → Invalid
Revision history for this message
Matt Riedemann (mriedem) wrote :

Discussed this in cinder IRC today:

http://eavesdrop.openstack.org/irclogs/%23openstack-cinder/%23openstack-cinder.2015-12-07.log.html#t2015-12-07T18:36:19

In tracing down http://logs.openstack.org/20/218120/3/check/gate-tempest-dsvm-full-ceph/2349f2d/console.html and the ebs test failure, the test basically does this:

1. create a volume from an image
2. boot a server from that volume
3. create snapshot from the server (creates image and volume snapshots)
4. delete the server
5. create 2nd server from the image snapshot
6. cleanup

The failure (race) is when deleting the volume snapshot, there is a backing volume that is dependent on the snapshot, and that's not deleted yet. That 2nd volume is attached to the 2nd server (nova does this b/c of the image metadata that has the bdm snapshot info in it).

It sounds like there is a timing issue in the cinder rbd driver when deleting the dependent volume (for the snapshot) such that it's not gone when we try to delete the volume snapshot (it's considered busy), so that fails.

So as noted in comment 2, it sounds like when we delete a volume snapshot, the cinder rbd driver needs to wait for the backing volume to be deleted (from the ceph backend) before it can delete the volume snapshot.

Revision history for this message
Matt Riedemann (mriedem) wrote :

If the backing volume delete from ceph takes longer than the default 196 seconds (that tempest waits for), then we might need to just increase the delete timeout in the ceph job (maybe to 300 seconds) until https://review.openstack.org/#/c/205282/ lands.

Revision history for this message
Matt Riedemann (mriedem) wrote :

We could also just look at setting rbd_flatten_volume_from_snapshot=True in cinder.conf for the ceph job, since if I understand that code correctly, it should disassociate the backing volume from the snapshot in the ceph backend which should speed this up.

Matt Riedemann (mriedem)
Changed in nova:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (5.4 KiB)

Added nova since I think this is a bug in nova.

In the test_create_ebs_image_and_check_boot test scenario:

1. create volume1 from an image
2. boot server1 from volume1 (wait for the server to be ACTIVE)
3. create snapshot from server1 (creates image and volume snapshots)
4. delete server1
5. create server2 from the image snapshot (don't wait for it to be ACTIVE)
6. delete server2 (could still be building/attaching volumes in the background)
7. cleanup

Because the image snapshot has the volume1 snapshot BDM information, when we create server2 it constructs a snapshot bdm from the image meta here:

https://github.com/openstack/nova/blob/master/nova/compute/api.py#L729

And when booting server2 we use that snapshot bdm to create volume2:

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1727

https://github.com/openstack/nova/blob/master/nova/virt/block_device.py#L506

https://github.com/openstack/nova/blob/master/nova/virt/block_device.py#L391

That's where we see this logged:

http://logs.openstack.org/20/218120/3/check/gate-tempest-dsvm-full-ceph/2349f2d/logs/screen-n-cpu.txt.gz#_2015-12-07_05_51_06_776

2015-12-07 05:51:06.776 INFO nova.virt.block_device [req-0467c440-1f5b-4a74-9ba8-d266d6b0b182 tempest-TestVolumeBootPatternV2-1635783358 tempest-TestVolumeBootPatternV2-232134174] [instance: 91817c91-6305-4e44-9f53-0eca5a27aa8d] Booting with volume None at /dev/vda

Since the snapshot bdm doesn't have a volume_id it logs None.

This is the volume that's created though:

http://logs.openstack.org/20/218120/3/check/gate-tempest-dsvm-full-ceph/2349f2d/logs/screen-n-cpu.txt.gz#_2015-12-07_05_51_07_467

2015-12-07 05:51:07.467 DEBUG keystoneclient.session [req-0467c440-1f5b-4a74-9ba8-d266d6b0b182 tempest-TestVolumeBootPatternV2-1635783358 tempest-TestVolumeBootPatternV2-232134174] RESP: [202] Content-Length: 797 X-Compute-Request-Id: req-05b0da1b-3c95-41cd-a3c2-919b1ca9662d Connection: keep-alive Date: Mon, 07 Dec 2015 05:51:07 GMT Content-Type: application/json X-Openstack-Request-Id: req-05b0da1b-3c95-41cd-a3c2-919b1ca9662d
RESP BODY: {"volume": {"status": "creating", "user_id": "f3dc192b88054a3d80283f9696183e4a", "attachments": [], "links": [{"href": "http://127.0.0.1:8776/v2/37d3bb6fbd8b459ebd1459afe434900a/volumes/19efe5d8-fa71-4569-b806-dfe2e0080b7f", "rel": "self"}, {"href": "http://127.0.0.1:8776/37d3bb6fbd8b459ebd1459afe434900a/volumes/19efe5d8-fa71-4569-b806-dfe2e0080b7f", "rel": "bookmark"}], "availability_zone": "nova", "bootable": "false", "encrypted": false, "created_at": "2015-12-07T05:51:07.178269", "description": "", "updated_at": null, "volume_type": "ceph", "name": "", "replication_status": "disabled", "consistencygroup_id": null, "source_volid": null, "snapshot_id": "366fab34-8494-47ee-925e-1505b7521744", "multiattach": false, "metadata": {}, "id": "19efe5d8-fa71-4569-b806-dfe2e0080b7f", "size": 1}}
 _http_log_response /usr/local/lib/python2.7/dist-packages/keystoneclient/session.py:216

Since the test immediately deletes the volume after it's created (but before it's fully active), we see things in the logs like this:

2015-12-07 05:51:15.251 DEBUG nova.compute.claims [req-0467c440-1f5b-4a74-9ba8-d266...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is where we get the VolumeNotFound because we're passing bdm.volume_id of None to volume_api.delete:

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2316

And this is where we log that we're terminating the BDM but we're not actually since volume_id isn't set on it:

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2342

Revision history for this message
Matt Riedemann (mriedem) wrote :

volume2 (created from the volume snapshot) is attached to server2 here:

http://logs.openstack.org/20/218120/3/check/gate-tempest-dsvm-full-ceph/2349f2d/logs/screen-n-cpu.txt.gz#_2015-12-07_05_51_12_600

2015-12-07 05:51:12.600 DEBUG keystoneclient.session [req-0467c440-1f5b-4a74-9ba8-d266d6b0b182 tempest-TestVolumeBootPatternV2-1635783358 tempest-TestVolumeBootPatternV2-232134174] REQ: curl -g -i -X POST http://127.0.0.1:8776/v2/37d3bb6fbd8b459ebd1459afe434900a/volumes/19efe5d8-fa71-4569-b806-dfe2e0080b7f/action -H "User-Agent: python-cinderclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}3f5cbbbb6773d2637527121a8ae0faa71adbc434" -d '{"os-attach": {"instance_uuid": "91817c91-6305-4e44-9f53-0eca5a27aa8d", "mountpoint": "/dev/vda", "mode": "rw"}}' _http_log_request /usr/local/lib/python2.7/dist-packages/keystoneclient/session.py:198

We set the volume_id on the bdm here:

https://github.com/openstack/nova/blob/master/nova/virt/block_device.py#L396

And that would be updated in the database once the attach method is done.

We start deleting the instance here (1 second later):

http://logs.openstack.org/20/218120/3/check/gate-tempest-dsvm-full-ceph/2349f2d/logs/screen-n-cpu.txt.gz#_2015-12-07_05_51_13_998

But I'm still not sure why the snapshot bdm isn't updated with the volume_id in the database after we created it. Maybe we should save that off immediately before trying to attach the volume in case we need to teardown.

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/254428

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/254428
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4f2a46987cf705d5dea84e97ef2006342cc5d9c4
Submitter: Jenkins
Branch: master

commit 4f2a46987cf705d5dea84e97ef2006342cc5d9c4
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 7 14:49:18 2015 -0800

    Make sure bdm.volume_id is set after auto-creating volumes

    The test_create_ebs_image_and_check_boot test in Tempest does the
    following:

    1. create volume1 from an image
    2. boot server1 from volume1 with delete_on_termination=True and wait
       for the server to be ACTIVE
    3. create snapshot from server1 (creates image and volume snapshots)
    4. delete server1
    5. create server2 from the image snapshot (don't wait for it to be
       ACTIVE - this auto-creates volume2 from the volume snapshot in
       cinder and attaches server2 to it)
    6. delete server2 (could still be building/attaching volumes in the
       background)
    7. cleanup

    There is a race when booting server2, which creates and attaches
    volume2, and deleting server2 before it's active.

    The volume attach completes and updates the bdm.volume_id in the DB before
    we get to _shutdown_instance, but after the delete request is in the API.
    The compute API gets potentially stale BDMs and passes those over RPC to
    the compute. So we add a check in _shutdown_instance to see if we have
    potentially stale volume BDMs and refresh that list if so.

    The instance.uuid locks in build_and_run_instance and terminate_instance
    create the mutex on the compute host such that the bdm.volume_id should
    be set in the database after the volume attach and before terminate_instance
    gets the lock. The bdm.volume_id could still be None in _shutdown_instance
    if the volume create fails, but we don't have anything to teardown in cinder
    in that case anyway.

    In the case of the race bug, deleting the volume snapshot in cinder fails
    because volume2 was never deleted by nova, so the test fails in
    teardown. Note that there is still potential for a race here, this does
    not eliminate it, but should narrow the race window.

    This also cleans up the logging in attach_block_devices since there
    may not be a volume_id at that point (depending on bdm.source_type).

    Closes-Bug: #1464259

    Change-Id: Ib60d60a5af35be89ad8afbcf44fcffe0b0ce2876

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/258118

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/258695

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/258118
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e8693e995daae1ed395e6f0d895674d86b76b4c2
Submitter: Jenkins
Branch: master

commit e8693e995daae1ed395e6f0d895674d86b76b4c2
Author: Matt Riedemann <email address hidden>
Date: Tue Dec 15 11:18:49 2015 -0800

    Refresh stale volume BDMs in terminate_connection

    Change Ib60d60a5af35be89ad8afbcf44fcffe0b0ce2876 meant
    to address a race in deleting an instance while booting from
    volume such that the BDMs passed to terminate_instance from
    the API were stale (no volume_id set yet in the database).

    Without the volume_id, _shutdown_instance can't detach the
    volumes and _cleanup_volumes can't delete them.

    The change fixed the race in _shutdown_instance by refreshing
    the BDM list if necessary, but it didn't fix the problem of
    deleting the volumes in _cleanup_volumes, which is still
    getting a stale list of BDMs from the compute API.

    This moves the refresh BDM logic to the top-level method
    terminate_instance so that the refreshed BDMs are passed
    to both _shutdown_instance and _cleanup_volumes.

    There is some cleanup that can be done with
    _delete_instance but since we'd like to backport this change
    and the previous one, the cleanup is left for a later patch.

    Change-Id: Ib0980a92ebdfa9c83496cd29bfea47fc42d538c5
    Closes-Bug: #1464259

Revision history for this message
Flavio Percoco (flaper87) wrote :

Just a heads up,

we hit this in Glance's stable/liberty.

http://logs.openstack.org/14/258314/2/check/gate-tempest-dsvm-full-ceph/1222c6a/console.html#_2015-12-18_03_40_06_979

2015-12-18 03:40:06.979 | Captured traceback-1:
2015-12-18 03:40:06.979 | ~~~~~~~~~~~~~~~~~~~~~
2015-12-18 03:40:06.979 | Traceback (most recent call last):
2015-12-18 03:40:06.979 | File "tempest/scenario/manager.py", line 104, in delete_wrapper
2015-12-18 03:40:06.979 | delete_thing(*args, **kwargs)
2015-12-18 03:40:06.979 | File "tempest/services/volume/base/base_volumes_client.py", line 106, in delete_volume
2015-12-18 03:40:06.979 | resp, body = self.delete("volumes/%s" % str(volume_id))
2015-12-18 03:40:06.979 | File "/opt/stack/new/tempest/.tox/all/local/lib/python2.7/site-packages/tempest_lib/common/rest_client.py", line 290, in delete
2015-12-18 03:40:06.980 | return self.request('DELETE', url, extra_headers, headers, body)
2015-12-18 03:40:06.980 | File "/opt/stack/new/tempest/.tox/all/local/lib/python2.7/site-packages/tempest_lib/common/rest_client.py", line 639, in request
2015-12-18 03:40:06.980 | resp, resp_body)
2015-12-18 03:40:06.980 | File "/opt/stack/new/tempest/.tox/all/local/lib/python2.7/site-packages/tempest_lib/common/rest_client.py", line 697, in _error_checker
2015-12-18 03:40:06.980 | raise exceptions.BadRequest(resp_body, resp=resp)
2015-12-18 03:40:06.980 | tempest_lib.exceptions.BadRequest: Bad request
2015-12-18 03:40:06.980 | Details: {u'message': u'Invalid volume: Volume still has 1 dependent snapshots.', u'code': 400}

Revision history for this message
Matt Riedemann (mriedem) wrote :

Flavio, there is a backport to stable/liberty proposed.

Revision history for this message
Matt Riedemann (mriedem) wrote :

So far since the nova change merged in master, logstash is looking better:

http://status.openstack.org/elastic-recheck/index.html#1464259

There is still a race here, but it seems to have been mitigated a bit.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/liberty)
Download full text (4.5 KiB)

Reviewed: https://review.openstack.org/258695
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=288c87f531d8ffbaeae3923335077b1968174a22
Submitter: Jenkins
Branch: stable/liberty

commit 288c87f531d8ffbaeae3923335077b1968174a22
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 7 14:49:18 2015 -0800

    Refresh stale volume BDMs in terminate_connection

    NOTE(mriedem): This is a two-part change that is squashed for stable,
    where the second change basically moves the code from the first change
    up a level to actually fix the bug in nova, else it was only partially
    fixed.

    --

    Make sure bdm.volume_id is set after auto-creating volumes

    The test_create_ebs_image_and_check_boot test in Tempest does the
    following:

    1. create volume1 from an image
    2. boot server1 from volume1 with delete_on_termination=True and wait
       for the server to be ACTIVE
    3. create snapshot from server1 (creates image and volume snapshots)
    4. delete server1
    5. create server2 from the image snapshot (don't wait for it to be
       ACTIVE - this auto-creates volume2 from the volume snapshot in
       cinder and attaches server2 to it)
    6. delete server2 (could still be building/attaching volumes in the
       background)
    7. cleanup

    There is a race when booting server2, which creates and attaches
    volume2, and deleting server2 before it's active.

    The volume attach completes and updates the bdm.volume_id in the DB before
    we get to _shutdown_instance, but after the delete request is in the API.
    The compute API gets potentially stale BDMs and passes those over RPC to
    the compute. So we add a check in _shutdown_instance to see if we have
    potentially stale volume BDMs and refresh that list if so.

    The instance.uuid locks in build_and_run_instance and terminate_instance
    create the mutex on the compute host such that the bdm.volume_id should
    be set in the database after the volume attach and before terminate_instance
    gets the lock. The bdm.volume_id could still be None in _shutdown_instance
    if the volume create fails, but we don't have anything to teardown in cinder
    in that case anyway.

    In the case of the race bug, deleting the volume snapshot in cinder fails
    because volume2 was never deleted by nova, so the test fails in
    teardown. Note that there is still potential for a race here, this does
    not eliminate it, but should narrow the race window.

    This also cleans up the logging in attach_block_devices since there
    may not be a volume_id at that point (depending on bdm.source_type).

    Conflicts:
            nova/compute/manager.py

    NOTE(mriedem): The conflict is due to 48a07fe48 not being in liberty.

    Closes-Bug: #1464259

    Change-Id: Ib60d60a5af35be89ad8afbcf44fcffe0b0ce2876
    (cherry picked from commit 4f2a46987cf705d5dea84e97ef2006342cc5d9c4)

    *** This is the 2nd commit message:

    Refresh stale volume BDMs in terminate_connection

    Change Ib60d60a5af35be89ad8afbcf44fcffe0b0ce2876 meant
    to address a race in deleting an i...

Read more...

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 12.0.1

This issue was fixed in the openstack/nova 12.0.1 release.

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0b2

This issue was fixed in the openstack/nova 13.0.0.0b2 development milestone.

Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/289252

Changed in cinder:
assignee: Jon Bernard (jbernard) → Michal Jura (mjura)
status: Triaged → In Progress
Revision history for this message
Michal Jura (mjura) wrote :

@Emilien Macchi (emilienm) : Please try this fix https://review.openstack.org/289252

Revision history for this message
Michal Jura (mjura) wrote :

I've opened another bug report for this follow up issue https://bugs.launchpad.net/cinder/+bug/1554045

Michal Jura (mjura)
Changed in cinder:
status: In Progress → Fix Released
assignee: Michal Jura (mjura) → nobody
Revision history for this message
Daniel Smith (daniel-smith-v) wrote :

Hello.

Today we hit this issue, Liberty release, CEPH for both GLANCE and CINDER

The use case was "create a cinder volume from glance image'

cinder create --image <id> --name "MyVol" 40

The cinder volume says "available" however, when you try to boot, you can see that the Volume is in fact, not finished being copied from glance into cinder. (one issue).

Then trying to delete this Volume, it says deleteing, then goes back to available - and in cinder-volume logs, i have the same error (ImageBusy). However ,there is no indication of how to get around this - i have waited about 3 hours now - so there is no way that its still "copying" at this point.

Is there a workaround at this point?

Should i report a bug for Cinder reporting that a Volume is available (when doing Glance to Cinder volume creation) when it fact its "not"?

many thanks!
Daniel

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.