After evacuation if instance is deleted then notification will be marked as failure

Bug #1693728 reported by Abhishek Kekane
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
masakari
Fix Released
Undecided
Dinesh Bhor

Bug Description

Masakari can face a race condition where after evacuation of an instance to other host user might perform some actions on that instance which gives wrong instance vm_state to ConfirmEvacuationTask that results into notification failure.

As of now masakari first evacuate all instances from failure host to new host and later it confirms one by one whether evacuation is completed successfully or not. So between evacuation and confirmation any instance can be deleted by user which will lead to failure in confirmation. This is ambiguous for masakari as masakari has successfully evacuated that instance but later it is deleted by the user.

Reference:
http://eavesdrop.openstack.org/meetings/masakari/2017/masakari.2017-05-23-04.00.log.html#l-129

Changed in masakari:
assignee: nobody → Dinesh Bhor (dinesh-bhor)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari (master)

Fix proposed to branch: master
Review: https://review.openstack.org/468771

Changed in masakari:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (master)

Reviewed: https://review.openstack.org/468771
Committed: https://git.openstack.org/cgit/openstack/masakari/commit/?id=25d33d2cb1eec271227309a34a45b0d3f0987d50
Submitter: Jenkins
Branch: master

commit 25d33d2cb1eec271227309a34a45b0d3f0987d50
Author: dineshbhor <email address hidden>
Date: Tue May 23 13:50:37 2017 +0530

    Fix race condition between evacuation and its confirmation

    Masakari can face a race condition where after evacuation of an
    instance to other host user might perform some actions on that
    instance which gives wrong instance vm_state to ConfirmEvacuationTask
    that results into notification failure.

    To fix this issue this patch proposes to lock the instance before
    evacuation till its confirmation so that any normal user will not
    be able to perform any actions on it. To achieve this the
    ConfirmEvacuationTask is completly removed and the confirmation is
    done in the EvacuateInstancesTask itself by per instance.
    Evacuating an instance and confirming it's evacuation immediately
    can reduce the performance so this patch uses the
    eventlet.greenpool.GreenPool which executes the complete evacuation
    and confirmation of an instance in a separate thread.
    To check if the server is already locked or not upgraded the
    novaclient's NOVA_API_VERSION from 2.1 to 2.9 as the 'locked'
    property is available in nova api_version 2.9 and above.

    This patch introduces a new config option
    'host_failure_recovery_threads' which will be the number of threads
    to be used for evacuating and confirming the instances evacuation.
    The default value for this config option is 3.

    Closes-Bug: #1693728
    Change-Id: Ib5145878633fd424bca5bcbd5cfed13d20362f94

Changed in masakari:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.