Bug #1648885 “ceph: TestVolumeBootPattern.test_create_ebs_image_...” : Bugs : Cinder

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-09:

#1

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22TestVolumeBootPattern%5C%22%20AND%20message%3A%5C%22Delete%20snapshot%20failed%2C%20due%20to%20snapshot%20busy.%5C%22%20AND%20tags%3A%5C%22screen-c-vol.txt%5C%22&from=7d

5 hits in the last 7 days, check and gate, all failures and all in the ceph job.

tags:

added: rbd

Matt Riedemann (mriedem) on 2016-12-12

Changed in cinder:
status:	New → Confirmed

Revision history for this message

Ian Cordasco (icordasc) wrote on 2016-12-12:

#2

In case some extra logs come in handy: http://logs.openstack.org/98/358498/1/gate/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/9db3e69/console.html (Seen in glance on 2016-12-12)

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2016-12-13:

#3

please do not exclude this test.
this is a base functionality for ec2-api project.

Jon Bernard (jbernard) on 2016-12-13

Changed in cinder:
assignee:	nobody → Jon Bernard (jbernard)

Revision history for this message

Jon Bernard (jbernard) wrote on 2016-12-13:

#4

Andrey, don't worry we'll find a solution without excluding.

From what I've gathered so far, the test fails in cleanup when trying to delete a volume, and that volume has one or more snapshots. Those shapshots have previously received delete requests, so I suspect they're in a 'deleteting' state, but since delete_snapshot() is async the volume delete is winning the race and thus we see a failure here. I need to dig a bit more to be sure. Comments and suggestions are welcome.

Revision history for this message

Jon Bernard (jbernard) wrote on 2016-12-13:

#5

There is also a chance that the snapshot is 'busy' because it has dependent volumes - so this patch [1] may address this issue.

1: https://review.openstack.org/#/c/281550/

Revision history for this message

Feodor Tersin (ftersin) wrote on 2016-12-13:

#6

Right. The volume is deleted by Nova since it has been attached with delete_on_termination flag. And Nova does not wait for the end of the volume deletion. So when Tempest's _run_cleanup come to delete the snapshot, it may be 'busy'.

OpenStack Infra (hudson-openstack) on 2016-12-13

Changed in cinder:
status:	Confirmed → In Progress

Revision history for this message

Feodor Tersin (ftersin) wrote on 2016-12-14:

#7

Download full text (5.6 KiB)

Even with waiting for 'active' state of the instance, the problem still reproduces. See logs on https://review.openstack.org/#/c/410338/6.

Look at c-vol trace fragment. Here req-b895f42e-870c-4d30-a98a-ce0efaf837e7 correponds to delete volume operation (called via instance termination), and req-c615ee35-7e0c-4490-8e57-1339bcc8a970 correponds to delete snapshot operation (called via image deletion during test cleanup).

--- volume deletion started ---
2016-12-14 11:40:20.931 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:26.194 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] volume has no backup snaps _delete_backup_snaps /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:528
2016-12-14 11:40:26.267 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] deleting rbd volume volume-e0e726f4-f7a2-4e89-8de5-307f02639a18 delete_volume /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:649
--- snapshot deletion started ---
2016-12-14 11:40:26.275 DEBUG cinder.volume.drivers.rbd [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:26.382 INFO cinder.volume.drivers.rbd [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] Image volumes/volume-e0e726f4-f7a2-4e89-8de5-307f02639a18 is dependent on the snapshot snapshot-af5114b1-1096-45df-b007-fec7558b7779.
2016-12-14 11:40:31.177 ERROR cinder.volume.manager [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] [snapshot-af5114b1-1096-45df-b007-fec7558b7779] Delete snapshot failed, due to snapshot busy.
--- ^^^ snapshot deletion failed ---
--- volume deletion continued ---
2016-12-14 11:40:31.772 DEBUG cinder.quota [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] Created reservations ['cc95a20b-d1a3-4663-9776-32196739384b', '781df0f0-a697-40ff-88c1-27c33a66b3d1', '05042d6a-15dc-42c7-b3d4-cd7004b7cac7', 'e7616d2e-1e5a-41a5-b796-fe964ea1c214'] reserve /opt/stack/new/cinder/cinder/quota.py:1025
2016-12-14 11:40:31.791 DEBUG cinder.volume.utils [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] Can not find volume e0e726f4-f7a2-4e89-8de5-307f02639a18 at notify usage _usage_from_volume /opt/stack/new/cinder/cinder/volume/utils.py:96
2016-12-14 11:40:31.806 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:31.834 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to...

Even with waiting for 'active' state of the instance, the problem still reproduces. See logs on https://review.openstack.org/#/c/410338/6.

Look at c-vol trace fragment. Here req-b895f42e-870c-4d30-a98a-ce0efaf837e7 correponds to delete volume operation (called via instance termination), and req-c615ee35-7e0c-4490-8e57-1339bcc8a970 correponds to delete snapshot operation (called via image deletion during test cleanup).

--- volume deletion started ---
2016-12-14 11:40:20.931 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:26.194 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] volume has no backup snaps _delete_backup_snaps /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:528
2016-12-14 11:40:26.267 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] deleting rbd volume volume-e0e726f4-f7a2-4e89-8de5-307f02639a18 delete_volume /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:649
--- snapshot deletion started ---
2016-12-14 11:40:26.275 DEBUG cinder.volume.drivers.rbd [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:26.382 INFO cinder.volume.drivers.rbd [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] Image volumes/volume-e0e726f4-f7a2-4e89-8de5-307f02639a18 is dependent on the snapshot snapshot-af5114b1-1096-45df-b007-fec7558b7779.
2016-12-14 11:40:31.177 ERROR cinder.volume.manager [req-c615ee35-7e0c-4490-8e57-1339bcc8a970 tempest-TestVolumeBootPattern-1496098983] [snapshot-af5114b1-1096-45df-b007-fec7558b7779] Delete snapshot failed, due to snapshot busy.
--- ^^^ snapshot deletion failed ---
--- volume deletion continued ---
2016-12-14 11:40:31.772 DEBUG cinder.quota [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] Created reservations ['cc95a20b-d1a3-4663-9776-32196739384b', '781df0f0-a697-40ff-88c1-27c33a66b3d1', '05042d6a-15dc-42c7-b3d4-cd7004b7cac7', 'e7616d2e-1e5a-41a5-b796-fe964ea1c214'] reserve /opt/stack/new/cinder/cinder/quota.py:1025
2016-12-14 11:40:31.791 DEBUG cinder.volume.utils [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] Can not find volume e0e726f4-f7a2-4e89-8de5-307f02639a18 at notify usage _usage_from_volume /opt/stack/new/cinder/cinder/volume/utils.py:96
2016-12-14 11:40:31.806 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:31.834 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:31.866 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:37.132 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:37.283 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:37.373 DEBUG cinder.volume.drivers.rbd [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] opening connection to ceph cluster (timeout=-1). _connect_to_rados /opt/stack/new/cinder/cinder/volume/drivers/rbd.py:220
2016-12-14 11:40:37.565 DEBUG cinder.manager [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] Notifying Schedulers of capabilities ... _publish_service_capabilities /opt/stack/new/cinder/cinder/manager.py:175
2016-12-14 11:40:37.566 DEBUG oslo_messaging._drivers.amqpdriver [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] CAST unique_id: 66c1896ea93846e392f02b6b7b0e7ad2 FANOUT topic 'cinder-scheduler' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:431
2016-12-14 11:40:37.567 3244 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 4aa5d1fd63814219a04616049abdbcae reply to reply_611c80901b90436ea64b638bdde71b34 __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:194
2016-12-14 11:40:37.596 DEBUG oslo_messaging._drivers.amqpdriver [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] CAST unique_id: 938160ea81f24ed8964076a0205d5e30 exchange 'openstack' topic 'cinder-scheduler' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:442
2016-12-14 11:40:37.609 INFO cinder.volume.manager [req-b895f42e-870c-4d30-a98a-ce0efaf837e7 tempest-TestVolumeBootPattern-1496098983] [volume-e0e726f4-f7a2-4e89-8de5-307f02639a18] Deleted volume successfully.
--- volume deletion finished successfully ---

Revision history for this message

Jon Bernard (jbernard) wrote on 2016-12-14:

#8

Here we see both volume and snapshot deletion operations begin, and then we see snapshot deletion fail before volume deletion completes. Given the current logic, this is expected behaviour. So we have two immediate choices:

1. Postpone the snapshot deletion in the driver until the final volume finishes deletion
2. Change the tempest test to not submit both operations concurrently.

The third options would be:

3. Change the cinder API to reject busy snapshot delete requests.

This one has a larger impact and will require project (cinder) agreement. I think it all depends on how many patches are being effected by this issue.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2017-01-04:

#9

Starting to see this against the stable branches for Nova for some reason with 6 hits in the last two days :

http://logstash.openstack.org/#dashboard/file/logstash.json?query=(message%3A%5C%22TestVolumeBootPatternV2%5C%22%20OR%20message%3A%5C%22TestVolumeBootPattern%5C%22)%20AND%20message%3A%5C%22Delete%20snapshot%20failed%2C%20due%20to%20snapshot%20busy.%5C%22%20AND%20tags%3A%5C%22screen-c-vol.txt%5C%22

For example :

http://logs.openstack.org/26/408826/2/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/51c28af/logs/screen-c-vol.txt.gz?level=ERROR#_2017-01-04_16_43_57_603

That appears to be due to the following calls racing as previously discussed in this bug :

req-0efd714f

http://logs.openstack.org/26/408826/2/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/51c28af/logs/screen-c-api.txt.gz#_2017-01-04_16_43_56_203

req-63845b69

http://logs.openstack.org/26/408826/2/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/51c28af/logs/screen-c-api.txt.gz#_2017-01-04_16_43_56_945

Any idea why we are not seeing this against master at the moment?

Revision history for this message

Sean McGinnis (sean-mcginnis) wrote on 2017-09-01: Bug Assignee Expired

#10

Unassigning due to no activity for > 6 months.

Changed in cinder:
assignee:	Jon Bernard (jbernard) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-14: Change abandoned on cinder (master)

#11

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.openstack.org/281550
Reason: No updates in quite awhile. Feel free to restore and update if this is still needed.

Revision history for this message

Jay Bryant (jsbryant) wrote on 2018-09-20:

#12

We discussed this bug in our bug triage time during the Cinder meeting today: http://eavesdrop.openstack.org/meetings/cinder/2018/cinder.2018-09-19-16.00.log.html

This is an old bug that hasn't been addressed in quite some time and it doesn't appear that we are still seeing this problem. So, we have decided to close this out. It can be re-opened if necessary or a new bug can be created if necessary.

Changed in cinder:
status:	In Progress → Invalid

OpenStack Infra (hudson-openstack) on 2019-05-20

Changed in cinder:
assignee:	nobody → Jon Bernard (jbernard)
status:	Invalid → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-19:

#13

Change abandoned by "Eric Harney <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/281550
Reason: This is superseded by https://review.opendev.org/c/openstack/cinder/+/754397

Cinder

ceph: TestVolumeBootPattern.test_create_ebs_image_and_check_boot times out failing to delete volume due to snapshot is busy

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches