ceph: TestVolumeBootPattern.test_create_ebs_image_and_check_boot times out failing to delete volume due to snapshot is busy

Bug #1648885 reported by Matt Riedemann
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Cinder
In Progress
Undecided
Jon Bernard

Bug Description

Seen in the ceph job here:

http://logs.openstack.org/60/406260/11/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/bd86755/console.html#_2016-12-09_19_33_41_200131

2016-12-09 19:33:41.200131 | tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_create_ebs_image_and_check_boot[compute,id-36c34c67-7b54-4b59-b188-02a2f458a63b,image,volume]
2016-12-09 19:33:41.200170 | ------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-12-09 19:33:41.200185 |
2016-12-09 19:33:41.200196 | Captured traceback:
2016-12-09 19:33:41.200206 | ~~~~~~~~~~~~~~~~~~~
2016-12-09 19:33:41.200218 | Traceback (most recent call last):
2016-12-09 19:33:41.200241 | File "tempest/lib/common/rest_client.py", line 862, in wait_for_resource_deletion
2016-12-09 19:33:41.200257 | raise exceptions.TimeoutException(message)
2016-12-09 19:33:41.200274 | tempest.lib.exceptions.TimeoutException: Request timed out
2016-12-09 19:33:41.200309 | Details: (TestVolumeBootPattern:_run_cleanups) Failed to delete volume-snapshot 2415b67c-2cb9-4427-96b5-f581a4b4e48c within the required time (196 s).
2016-12-09 19:33:41.200315 |
2016-12-09 19:33:41.200320 |
2016-12-09 19:33:41.200329 | Captured traceback-1:
2016-12-09 19:33:41.200337 | ~~~~~~~~~~~~~~~~~~~~~
2016-12-09 19:33:41.200349 | Traceback (most recent call last):
2016-12-09 19:33:41.200372 | File "tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc
2016-12-09 19:33:41.200385 | return func(*args, **kwargs)
2016-12-09 19:33:41.200407 | File "tempest/lib/services/volume/v2/volumes_client.py", line 90, in delete_volume
2016-12-09 19:33:41.200424 | resp, body = self.delete("volumes/%s" % volume_id)
2016-12-09 19:33:41.200442 | File "tempest/lib/common/rest_client.py", line 307, in delete
2016-12-09 19:33:41.200461 | return self.request('DELETE', url, extra_headers, headers, body)
2016-12-09 19:33:41.200479 | File "tempest/lib/common/rest_client.py", line 664, in request
2016-12-09 19:33:41.200492 | self._error_checker(resp, resp_body)
2016-12-09 19:33:41.200512 | File "tempest/lib/common/rest_client.py", line 766, in _error_checker
2016-12-09 19:33:41.200527 | raise exceptions.BadRequest(resp_body, resp=resp)
2016-12-09 19:33:41.200542 | tempest.lib.exceptions.BadRequest: Bad request
2016-12-09 19:33:41.200592 | Details: {u'code': 400, u'message': u'Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group or have snapshots.'}

From the cinder-volume logs:

http://logs.openstack.org/60/406260/11/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/bd86755/logs/screen-c-vol.txt.gz?level=TRACE#_2016-12-09_19_14_00_798

2016-12-09 19:14:00.798 ERROR cinder.volume.manager [req-0de4aa10-cc59-4987-af55-33eb3492e0a9 tempest-TestVolumeBootPattern-1554064796] [snapshot-2415b67c-2cb9-4427-96b5-f581a4b4e48c] Delete snapshot failed, due to snapshot busy.

We likely have a race here so would need to trace the requests from the failure and see what's blowing up and in what order. Possibly we need to retry the snapshot delete until it's not busy, or timeout in c-vol.

Tags: ceph rbd
Revision history for this message
Matt Riedemann (mriedem) wrote :
tags: added: rbd
Matt Riedemann (mriedem)
Changed in cinder:
status: New → Confirmed
Revision history for this message
Ian Cordasco (icordasc) wrote :
Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

please do not exclude this test.
this is a base functionality for ec2-api project.

Jon Bernard (jbernard)
Changed in cinder:
assignee: nobody → Jon Bernard (jbernard)
Revision history for this message
Jon Bernard (jbernard) wrote :

Andrey, don't worry we'll find a solution without excluding.

From what I've gathered so far, the test fails in cleanup when trying to delete a volume, and that volume has one or more snapshots. Those shapshots have previously received delete requests, so I suspect they're in a 'deleteting' state, but since delete_snapshot() is async the volume delete is winning the race and thus we see a failure here. I need to dig a bit more to be sure. Comments and suggestions are welcome.

Revision history for this message
Jon Bernard (jbernard) wrote :

There is also a chance that the snapshot is 'busy' because it has dependent volumes - so this patch [1] may address this issue.

1: https://review.openstack.org/#/c/281550/

Revision history for this message
Feodor Tersin (ftersin) wrote :

Right. The volume is deleted by Nova since it has been attached with delete_on_termination flag. And Nova does not wait for the end of the volume deletion. So when Tempest's _run_cleanup come to delete the snapshot, it may be 'busy'.

Changed in cinder:
status: Confirmed → In Progress
Revision history for this message
Feodor Tersin (ftersin) wrote :
Download full text (5.6 KiB)

Even with waiting for 'active' state of the instance, the problem still reproduces. See logs on https://review.openstack.org/#/c/410338/6.

Look at c-vol trace fragment. Here req-b895f42e-870c-4d30-a98a-ce0efaf837e7 correponds to delete volume operation (called via instance termination), and req-c615ee35-7e0c-4490-8e57-1339bcc8a970 correponds to delete snapshot operation (called via image deletion during test cleanup).

--- volume deletion started ---
2016-12-14 11:40:20.931 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:26.194 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] volume has no backup snaps _delete_backup_snaps /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:528
2016-12-14 11:40:26.267 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] deleting rbd volume volume-e0e726f4-f7a2-4e89-8de5-307f02639a18 delete_volume /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:649
--- snapshot deletion started ---
2016-12-14 11:40:26.275 DEBUG cinder.volume.drivers.rbd [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:26.382 INFO cinder.volume.drivers.rbd [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] Image volumes/volume-e0e726f4-f7a2-4e89-8de5-307f02639a18 is dependent on the snapshot snapshot-af5114b1-1096-45df-b007-fec7558b7779.
2016-12-14 11:40:31.177 ERROR cinder.volume.manager [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] [snapshot-af5114b1-1096-45df-b007-fec7558b7779] Delete snapshot failed, due to snapshot busy.
--- ^^^ snapshot deletion failed ---
--- volume deletion continued ---
2016-12-14 11:40:31.772 DEBUG cinder.quota [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] Created reservations ['cc95a20b-d1a3-4663-9776-32196739384b', '781df0f0-a697-40ff-88c1-27c33a66b3d1', '05042d6a-15dc-42c7-b3d4-cd7004b7cac7', 'e7616d2e-1e5a-41a5-b796-fe964ea1c214'] reserve /opt/stack/new/cinder/cinder/quota.py:1025
2016-12-14 11:40:31.791 DEBUG cinder.volume.utils [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] Can not find volume e0e726f4-f7a2-4e89-8de5-307f02639a18 at notify usage _usage_from_volume /opt/stack/new/cinder/cinder/volume/utils.py:96
2016-12-14 11:40:31.806 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:31.834 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to...

Read more...

Revision history for this message
Jon Bernard (jbernard) wrote :

Here we see both volume and snapshot deletion operations begin, and then we see snapshot deletion fail before volume deletion completes. Given the current logic, this is expected behaviour. So we have two immediate choices:

1. Postpone the snapshot deletion in the driver until the final volume finishes deletion
2. Change the tempest test to not submit both operations concurrently.

The third options would be:

3. Change the cinder API to reject busy snapshot delete requests.

This one has a larger impact and will require project (cinder) agreement. I think it all depends on how many patches are being effected by this issue.

Revision history for this message
Lee Yarwood (lyarwood) wrote :
Revision history for this message
Sean McGinnis (sean-mcginnis) wrote : Bug Assignee Expired

Unassigning due to no activity for > 6 months.

Changed in cinder:
assignee: Jon Bernard (jbernard) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.openstack.org/281550
Reason: No updates in quite awhile. Feel free to restore and update if this is still needed.

Revision history for this message
Jay Bryant (jsbryant) wrote :

We discussed this bug in our bug triage time during the Cinder meeting today: http://eavesdrop.openstack.org/meetings/cinder/2018/cinder.2018-09-19-16.00.log.html

This is an old bug that hasn't been addressed in quite some time and it doesn't appear that we are still seeing this problem. So, we have decided to close this out. It can be re-opened if necessary or a new bug can be created if necessary.

Changed in cinder:
status: In Progress → Invalid
Changed in cinder:
assignee: nobody → Jon Bernard (jbernard)
status: Invalid → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Eric Harney <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/281550
Reason: This is superseded by https://review.opendev.org/c/openstack/cinder/+/754397

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.