VM re-scheduler mechanism will cause BDM-volumes conflict

Bug #1195947 reported by wingwj
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
wingwj
Havana
Fix Released
High
Nikola Đipanov

Bug Description

Due to re-scheduler mechanism, when a user tries to
 create (in error) an instance using a volume
 which is already in use by another instance,
the error is correctly detected, but the recovery code
 will incorrectly affect the original instance.

Need to raise exception directly when the situation above occurred.

------------------------
------------------------
We can create VM1 with BDM-volumes (for example, one volume we called it “Vol-1”).

But when the attached-volume (Vol-1..) involved in BDM parameters to create a new VM2, due to VM re-scheduler mechanism, the volume will change to attach on the new VM2 in Nova & Cinder, instead of raise an “InvalidVolume” exception of “Vol-1 is already attached on VM1”.

In actually, Vol-1 both attached on VM1 and VM2 on hypervisor. But when you operate Vol-1 on VM1, you can’t see any corresponding changes on VM2…

I reproduced it and wrote in the doc. Please check the attachment for details~

-------------------------
I checked on the Nova codes, the problem is caused by VM re-scheduler mechanism:

Now Nova will check the state of BDM-volumes from Cinder now [def _setup_block_device_mapping() in manager.py]. If any state is “in-use”, this request will fail, and trigger VM re-scheduler.

According to existing processes in Nova, before VM re-scheduler, it will shutdown VM and detach all BDM-volumes in Cinder for rollback [def _shutdown_instance() in manager.py]. As the result, the state of Vol-1 will change from “in-use” to “available” in Cinder. But, there’re nothing detach-operations on the Nova side…

Therefore, after re-scheduler, it will pass the BDM-volumes checking in creating VM2 on the second time, and all VM1’s BDM-volumes (Vol-1) will be possessed by VM2 and are recorded in Nova & Cinder DB. But Vol-1 is still attached on VM1 on hypervisor, and will also attach on VM2 after VM creation success…

---------------

Moreover, the problem mentioned-above will occur when “delete_on_termination” of BDMs is “False”. If the flag is “True”, all BDM-volumes will be deleted in Cinder because the states are already changed from “in-use” to “available” before [def _cleanup_volumes() in manager.py].
(P.S. Success depends on the specific implementation of Cinder Driver)

Thanks~

Tags: bdm createvm
Revision history for this message
wingwj (wingwj) wrote :
Revision history for this message
wingwj (wingwj) wrote :
  • Patch Test.docx Edit (481.2 KiB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)

We can add a new “InvalidVolume”exception branch processing in _run_instance(). If it occurred, raise the exception directly to instead of re-scheduler.
That’s the easiest way in my opinion.

The new patch I made is based on the master branch version on Jun,29th. Plz check the test-doc~~

Thanks~

Revision history for this message
wingwj (wingwj) wrote :

Here is the patch I made. Plz check it.

Thanks~

wingwj (wingwj)
Changed in nova:
assignee: nobody → wingwj (wingwj)
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Wow - this is pretty horrible!!!

Thanks for reporting this, and doing a fix. I will comment more there.

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Since the review wasn't picked up by LP - here it is: https://review.openstack.org/#/c/38073

Changed in nova:
status: New → Incomplete
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

I can't seem to repoduce this actually.

Nova will block this in the API since 24fffd9d8b77e9b71e8013fc22c172f76bb4e84c on master, and this was backported to both Grizzly and Folsom stable branches.

Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Oooops - looks like I spoke too soon on this one - there is a race condition there.

If you fire off the two requests close to each other - but not like in the attached doc - the API will *NOT* see it as attached (depending on the race of course) and it will error out on the compute side, and detach the volume from inderneath the running instance.

To reporoduce try:

for x in 1 2; do nova boot --image 539b1a8a-f5f5-4f1b-afa0-f371337def9f --flavor 1 --block-device-mapping vdc=<VOLUME_ID>:None:1: testvm; done; watch nova list;

and see the volume become briefly attached and then unavailable as the other instance errors out and cleans it up in _reschedule_on_error.

We might need to come up with something different to avoid this completely.

Changed in nova:
importance: Undecided → High
tags: added: folsom-backport-potential grizzly-backport-potential
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

s/unavailable/available again/ in the previous comment

Revision history for this message
wingwj (wingwj) wrote :

I post the patch to Gerrit, plz check it~
https://review.openstack.org/#/c/38073/

Changed in nova:
status: Confirmed → In Progress
wingwj (wingwj)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/54916
Committed: http://github.com/openstack/nova/commit/a2487116d583e189dcbfe6f665ba360bf147163f
Submitter: Jenkins
Branch: stable/havana

commit a2487116d583e189dcbfe6f665ba360bf147163f
Author: Nikola Dipanov <email address hidden>
Date: Fri Nov 1 13:37:13 2013 +0100

    Prevent rescheduling on block device failure

    Due to a race condition - it is possible for more instances to race for
    the same volume. In such a scenario, the one that fails will get
    rescheduled, and in the process detach the volume of a successful
    instance.

    To prevent this, this patch makes nova not reschedule on block device
    failures. This is actually reasonable behaviour as block device failures
    are rarely related to the compute host itself and so rescheduling is not
    usually useful.

    This is a stable/havana only fix! This same issue is addressed on the
    master branch by Iefab71047996b7cc08107794d5bc628c11680a70.

    Closes-bug: 1195947

    Change-Id: I6b68965ac65cdb0e1da3b44e83428f056b1693aa

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

We were hoping that https://blueprints.launchpad.net/nova/+spec/remove-cast-to-schedule-run-instance would be making Icehouse, but sadly that did not happen, so I believe it is reasonable to propose the Havana fix that is already merged to the Icehouse tree as well.

Changed in nova:
milestone: none → icehouse-rc1
Revision history for this message
Tracy Jones (tjones-i) wrote :

wingwj - can you propose this fix then quickly? we are closing down on all but shipstoppers/regressions very soon

Revision history for this message
Tracy Jones (tjones-i) wrote :

this bug could be pushed to icehouse-rc-potential if not merged by 2/24 12pm UTC

Revision history for this message
wingwj (wingwj) wrote :

Hi Nikola & Tracy,

I got your message. I'll use the Nikola's Havana patch to fix this issue ASAP.

Revision history for this message
wingwj (wingwj) wrote : Re: [Bug 1195947] Re: VM re-scheduler mechanism will cause BDM-volumes conflict
Download full text (3.1 KiB)

Hi Tracy,

Sorry for my late reply first.

I've already updated the patch for bug/1195947 on
https://review.openstack.org/#/c/38073/.
Please review it.

Thanks~

On Wed, Mar 19, 2014 at 5:34 AM, Tracy Jones <email address hidden> wrote:

> this bug could be pushed to icehouse-rc-potential if not merged by 2/24
> 12pm UTC
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1195947
>
> Title:
> VM re-scheduler mechanism will cause BDM-volumes conflict
>
> Status in OpenStack Compute (Nova):
> In Progress
> Status in OpenStack Compute (nova) havana series:
> Fix Released
>
> Bug description:
> Due to re-scheduler mechanism, when a user tries to
> create (in error) an instance using a volume
> which is already in use by another instance,
> the error is correctly detected, but the recovery code
> will incorrectly affect the original instance.
>
> Need to raise exception directly when the situation above occurred.
>
> ------------------------
> ------------------------
> We can create VM1 with BDM-volumes (for example, one volume we called it
> "Vol-1").
>
> But when the attached-volume (Vol-1..) involved in BDM parameters to
> create a new VM2, due to VM re-scheduler mechanism, the volume will
> change to attach on the new VM2 in Nova & Cinder, instead of raise an
> "InvalidVolume" exception of "Vol-1 is already attached on VM1".
>
> In actually, Vol-1 both attached on VM1 and VM2 on hypervisor. But
> when you operate Vol-1 on VM1, you can't see any corresponding changes
> on VM2...
>
> I reproduced it and wrote in the doc. Please check the attachment for
> details~
>
> -------------------------
> I checked on the Nova codes, the problem is caused by VM re-scheduler
> mechanism:
>
> Now Nova will check the state of BDM-volumes from Cinder now [def
> _setup_block_device_mapping() in manager.py]. If any state is "in-
> use", this request will fail, and trigger VM re-scheduler.
>
> According to existing processes in Nova, before VM re-scheduler, it
> will shutdown VM and detach all BDM-volumes in Cinder for rollback
> [def _shutdown_instance() in manager.py]. As the result, the state of
> Vol-1 will change from "in-use" to "available" in Cinder. But,
> there're nothing detach-operations on the Nova side...
>
> Therefore, after re-scheduler, it will pass the BDM-volumes checking
> in creating VM2 on the second time, and all VM1's BDM-volumes (Vol-1)
> will be possessed by VM2 and are recorded in Nova & Cinder DB. But
> Vol-1 is still attached on VM1 on hypervisor, and will also attach on
> VM2 after VM creation success...
>
> ---------------
>
> Moreover, the problem mentioned-above will occur when
> "delete_on_termination" of BDMs is "False". If the flag is "True", all
> BDM-volumes will be deleted in Cinder because the states are already
> changed from "in-use" to "available" before [def _cleanup_volumes() in
> manager.py].
> (P.S. Success depends on the specific implementation of Cinder Driver)
>
> Thanks~
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/11...

Read more...

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

I also posted a patch for this https://review.openstack.org/#/c/80945/

I have no idea why it did not get picked up by LP. At this moment we can use either.

Revision history for this message
Tracy Jones (tjones-i) wrote :

I think we can mark this as fix released since Nikola's patch got merged

https://review.openstack.org/#/c/80945/

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Tracy - I assume you meant "fix committed" (fix released is usually for once the release is actually cut).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/80945
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8f932311da19ea9de7ba1b344484ccdb748f5786
Submitter: Jenkins
Branch: master

commit 8f932311da19ea9de7ba1b344484ccdb748f5786
Author: Nikola Dipanov <email address hidden>
Date: Fri Nov 1 13:37:13 2013 +0100

    Prevent rescheduling on block device failure

    Due to a race condition - it is possible for more instances to race for
    the same volume. In such a scenario, the one that fails will get
    rescheduled, and in the process detach the volume of a successful
    instance.

    To prevent this, this patch makes nova not reschedule on block device
    failures. This is actually reasonable behaviour as block device failures
    are rarely related to the compute host itself and so rescheduling is not
    usually useful.

    This bug does not exist in the new boot code in the manager which will
    be used once remove-cast-to-schedule-run-instance bp lands (see
    Iefab71047996b7cc08107794d5bc628c11680a70). However, it is now clear
    that this will not be merged for Icehouse, so this patch is a
    "forward port" of a patch we already applied to stable/havana.

    Closes-bug: #1195947

    Change-Id: I6b68965ac65cdb0e1da3b44e83428f056b1693aa

Alan Pevec (apevec)
tags: removed: folsom-backport-potential grizzly-backport-potential
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.