OpenStack Compute (nova)

Nova might orphan volumes when it's racing to delete a volume-backed instance

Bug #1527623 reported by Matt Riedemann on 2015-12-18

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	Medium	ChangBo Guo(gcb)

Bug Description

Discussed in the -dev mailing list here:

http://lists.openstack.org/pipermail/openstack-dev/2015-December/082596.html

When nova deletes a volume-backed instance, it detaches the volume first here:

https://github.com/openstack/nova/blob/5508e11cf873384a28dc7416168d34e85f2c06cf/nova/compute/manager.py#L2293

And then deletes the volume here (if the delete_on_termination flag was set to True):

https://github.com/openstack/nova/blob/5508e11cf873384a28dc7416168d34e85f2c06cf/nova/compute/manager.py#L2320

The problem is this code races since the detach is async, nova gets back a 202 and then goes on to delete the volume, which can fail if the volume status is not 'available' yet, as seen here:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Failed%20to%20delete%20volume%5C%22%20AND%20message:%5C%22due%20to%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22

http://logs.openstack.org/36/231936/9/check/gate-tempest-dsvm-full-lio/31de861/logs/screen-n-cpu.txt.gz?level=TRACE#_2015-12-18_13_59_16_071

2015-12-18 13:59:16.071 WARNING nova.compute.manager [req-22431c70-78da-4fea-b132-170d27177a6f tempest-TestVolumeBootPattern-196984582 tempest-TestVolumeBootPattern-290257504] Failed to delete volume: 16f9252c-4036-463b-a053-60d4f46796c1 due to Invalid input received: Invalid volume: Volume status must be available or error or error_restoring or error_extending and must not be migrating, attached, belong to a consistency group or have snapshots. (HTTP 400) (Request-ID: req-260c7d2a-d0aa-4ee1-b5a0-9b0c45f1d695)

This isn't an error in nova because the compute manager's _delete_instance method calls _cleanup_volumes with raise_exc=False, but this will orphan volumes in cinder, which then requires manual cleanup on the cinder side.

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-12-18:

We could wait for detach to complete or timeout, similar to what we do with boot from volume when creating the volume and attaching it to the instance:

https://github.com/openstack/nova/blob/5508e11cf873384a28dc7416168d34e85f2c06cf/nova/compute/manager.py#L1398

Zhihai Song (szhsong) on 2015-12-21

Changed in nova:
assignee:	nobody → Zhihai Song (szhsong)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-22: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/260339

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-12-22:

Note that this might re-introduce the race seen in bug 1464259 where tempest is racing to delete the volume snapshot while nova is deleting the volume associated with the snapshot. If we start waiting for the volume to be detached before we delete it, that could add just enough time for the race in tempest to show up again.

OpenStack Infra (hudson-openstack) on 2016-10-07

Changed in nova:
assignee:	Zhihai Song (szhsong) → Chris Friesen (cbf123)

OpenStack Infra (hudson-openstack) on 2016-11-28

Changed in nova:
assignee:	Chris Friesen (cbf123) → ChangBo Guo(gcb) (glongwave)

OpenStack Infra (hudson-openstack) on 2016-12-28

Changed in nova:
assignee:	ChangBo Guo(gcb) (glongwave) → Swami Reddy (swamireddy)

OpenStack Infra (hudson-openstack) on 2017-03-09

Changed in nova:
assignee:	Swami Reddy (swamireddy) → ChangBo Guo(gcb) (glongwave)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-03:

I think this might have been bad triage of the issue. The os-detach API in cinder is synchronous, the cinder-api service does a synchronous rpc call to cinder-volume to detach the volume and change the status to available:

https://github.com/openstack/cinder/blob/cbee6066e4ff92addf3452114f6f6be355a6ac40/cinder/volume/rpcapi.py#L217

https://github.com/openstack/cinder/blob/cbee6066e4ff92addf3452114f6f6be355a6ac40/cinder/volume/manager.py#L1363

I believe the problem in Tempest for TestVolumeBootPattern is that the volume has snapshots when tearing down which makes the volume detach in nova fail because the volume snapshots aren't deleted yet.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-03:

See https://review.openstack.org/#/c/565601/5 for more context - that was changed because it failed the ceph job, because apparently with rbd volumes you can't delete the volume snapshots until the original volume is deleted, which in the cinder API you normally can't do that if there are snapshots, so it's a weird catch-22.

Changed in nova:
status:	In Progress → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-08: Change abandoned on nova (master)

Change abandoned by melanie witt (<email address hidden>) on branch: master
Review: https://review.openstack.org/260339
Reason: This patch hasn't been updated in 8 months and the bug associated with it has been marked Invalid.

We think the problem is with the tempest test and I've proposed a tempest change here: https://review.openstack.org/589648

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.