OpenStack Compute (nova)

BDM is not deleted if an instance booted from volume and failed on schedule stage

Bug #1583999 reported by Jiajun Liu on 2016-05-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Expired	Undecided	Unassigned

Bug Description

Description
============

I did some test on boot from volume instance. I found that sometime the instance boot from volume will fail on evacuate operation. After some dig, I found evacuate operation failed due to the conductor service returned wrong block device mapping which has no connection info. After some more dig, I found there are some BDM should NOT exists because it belongs to a deleted instance. After some more test, I found a way to reproduce this problem.

Steps to reproduce
====================
1, create a volume from image (image-volume1)
2, stop or disable all nova-compute
3, boot an instance (bfv1) from volume (image-volume1)
4, wait the instance became ERROR state
5, delete the instance will just created
6, look at block_device_mapping table of nova database and found instance's block device mapping still exists
7, boot another instance (bfv2) from volume (image-volume1)
8, execute evacuate operation on bfv2
9, evacuate operation failed and bfv2 became ERROR.

Environment
============
* centos 7
* liberty openstack

I looked at the master branch code. This bug still exists.

See original description

Revision history for this message

Anusha Unnam (anusha-unnam) wrote on 2016-05-20:

@Jiajun Liu, I couldn't reproduce this bug.

I followed the above steps in devstack multi-node environment:
*Ubuntu
*master

1.Created a bootable volume(v1) from an image.
2.Stopped all compute services.
3.booted an instance(test1) with the volume created(v1) and the instance changed to error state.
4.deleted the instance.
5.restarted the compute services and booted another instance(test2) with v1.
6.executed evacuate on test2 and everything worked as expected.I didn't get the error.

Revision history for this message

Jiajun Liu (ljjjustin) wrote on 2016-05-23:

@Anusha, Could you have a look at database after step 4 to check if test1's block device mapping are deleted ? I think that's possible.

In liberty branch, when nova-compute received a evacuate operation, it will call get_by_volume_id to get instance's block device mapping, however this function will return just one BDM matched that volume_id. if we have multiple BDM with the same volume_id and instance_uuid then this will be a problem and will cause detach volume failure. you can look at the source code: https://github.com/openstack/nova/blob/stable/liberty/nova/compute/manager.py#L4713

In master branch, the implementation we changed a bit. nova-compute will call get_by_volume_and instance which will match both volume_id and instance_uuid. So, in your step 6, it can get the right BDM even if test1's BDM is not deleted. you can look at the source code: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4627

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-23: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/319725

Changed in nova:
assignee:	nobody → Jiajun Liu (ljjjustin)
status:	New → In Progress

Wei Wang (damon-devops) on 2016-09-01

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-25: Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/319725
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Revision history for this message

Anusha Unnam (anusha-unnam) wrote on 2016-11-17:

The patch submitted for this bug is abandoned. So, removing the assignee. And changing the status from in-progress to new.

Changed in nova:
assignee:	Jiajun Liu (ljjjustin) → nobody
status:	In Progress → New

Revision history for this message

Anusha Unnam (anusha-unnam) wrote on 2016-11-17:

@Jiajun Liu,
I looked at database in block_device_mapping table after step4 and i checked test1's block device mapping and it is deleted. But this is in master. I didn't check in liberty.
And one question do we need shared storage in multinode environment to do evacuate operation?
Can you paste the logs if possible.

Revision history for this message

Sean Dague (sdague) wrote on 2017-07-25:

Open question from 6 months ago, marking at Incomplete

Changed in nova:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-09-24:

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.