BDM is not deleted if an instance booted from volume and failed on schedule stage

Bug #1583999 reported by Jiajun Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Description
============

I did some test on boot from volume instance. I found that sometime the instance boot from volume will fail on evacuate operation. After some dig, I found evacuate operation failed due to the conductor service returned wrong block device mapping which has no connection info. After some more dig, I found there are some BDM should NOT exists because it belongs to a deleted instance. After some more test, I found a way to reproduce this problem.

Steps to reproduce
====================
1, create a volume from image (image-volume1)
2, stop or disable all nova-compute
3, boot an instance (bfv1) from volume (image-volume1)
4, wait the instance became ERROR state
5, delete the instance will just created
6, look at block_device_mapping table of nova database and found instance's block device mapping still exists
7, boot another instance (bfv2) from volume (image-volume1)
8, execute evacuate operation on bfv2
9, evacuate operation failed and bfv2 became ERROR.

Environment
============
* centos 7
* liberty openstack

I looked at the master branch code. This bug still exists.

Revision history for this message
Anusha Unnam (anusha-unnam) wrote :

@Jiajun Liu, I couldn't reproduce this bug.

I followed the above steps in devstack multi-node environment:
*Ubuntu
*master

1.Created a bootable volume(v1) from an image.
2.Stopped all compute services.
3.booted an instance(test1) with the volume created(v1) and the instance changed to error state.
4.deleted the instance.
5.restarted the compute services and booted another instance(test2) with v1.
6.executed evacuate on test2 and everything worked as expected.I didn't get the error.

Revision history for this message
Jiajun Liu (ljjjustin) wrote :

@Anusha, Could you have a look at database after step 4 to check if test1's block device mapping are deleted ? I think that's possible.

In liberty branch, when nova-compute received a evacuate operation, it will call get_by_volume_id to get instance's block device mapping, however this function will return just one BDM matched that volume_id. if we have multiple BDM with the same volume_id and instance_uuid then this will be a problem and will cause detach volume failure. you can look at the source code: https://github.com/openstack/nova/blob/stable/liberty/nova/compute/manager.py#L4713

In master branch, the implementation we changed a bit. nova-compute will call get_by_volume_and instance which will match both volume_id and instance_uuid. So, in your step 6, it can get the right BDM even if test1's BDM is not deleted. you can look at the source code: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4627

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/319725

Changed in nova:
assignee: nobody → Jiajun Liu (ljjjustin)
status: New → In Progress
Wei Wang (damon-devops)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/319725
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Revision history for this message
Anusha Unnam (anusha-unnam) wrote :

The patch submitted for this bug is abandoned. So, removing the assignee. And changing the status from in-progress to new.

Changed in nova:
assignee: Jiajun Liu (ljjjustin) → nobody
status: In Progress → New
Revision history for this message
Anusha Unnam (anusha-unnam) wrote :

@Jiajun Liu,
I looked at database in block_device_mapping table after step4 and i checked test1's block device mapping and it is deleted. But this is in master. I didn't check in liberty.
And one question do we need shared storage in multinode environment to do evacuate operation?
Can you paste the logs if possible.

Revision history for this message
Sean Dague (sdague) wrote :

Open question from 6 months ago, marking at Incomplete

Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.