Masakari failed to rescue PAUSED instances

Bug #1663513 reported by Rikimaru Honjo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
masakari
Fix Released
High
Dinesh Bhor

Bug Description

[Actual]
1. A instance failure about PAUSED instance happened. I notified about it.
2. masakari-api received the notification. masakari-engine called stop API about the instance.
3. Nova returned "409". Because PAUSED instance couldn't be stopped.
4. Masakari changed notification status to "error".

[Expected]
I expected that the PAUSED instance to be rescued.
In my idea, masakari should reset-state API before stop API.

[Reproduce steps]
1. Create a VM instance.
2. Call pause API for the instance.
3. Wait until pausing is completed.
4. Login to compute node and Run "sudo kill -9 <PID=qemu process of the instance>".

Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

Maybe there are other statuses which aren't be able to rescue by Masakari.
But I haven't recognized all of those statuses now. Sorry.

Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

I wrote reproduce steps to bug description.

description: updated
description: updated
Changed in masakari:
assignee: nobody → Dinesh Bhor (dinesh-bhor)
Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

Maybe RESCUED and SUSPENDED instances will have same issue.
But, sorry, this is just a hypothesis.

Instance statuses:
https://docs.openstack.org/developer/nova/vmstates.html

Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

I experimented the idea written in #3.

As a result...

Sorry, SUSPENDED instance doesn't have this issue because kvm/qemu process is not existed if instance is SUSPENDED.

But, RESCUED instance has same issue.
IMO, Masakari should call reset API if instance is "RESCUE".

Revision history for this message
Dinesh Bhor (dinesh-bhor) wrote :

Hi all,

To fix this issue masakari will reset the vm_state of an instance and then will try to stop and start it.
So at the end instance will be in ‘active' state which is not as expected from user's point of view.

For example, if user has paused an instance purposely and after some time qemu process is killed due to some reason, then masakari will stop and start that instance and the instance will be in ‘active’ state at the end. Later when user come back then he will find that instance is active and he/she won't be having any idea what has happened exactly.

User must be expecting the consistency between the vm_state before recovery and vm_state after recovery.

IMO masakari should maintain the consistency between the vm_state before recovery and vm_state after recovery.

For example, if the instance was in 'paused' vm_state then after recovering the instance qemu process by starting that instance again, masakari should pause that instance again.

Please suggest your opinion about this.

Tushar Patil (tpatil)
Changed in masakari:
status: New → Confirmed
Tushar Patil (tpatil)
Changed in masakari:
importance: Undecided → High
Revision history for this message
SamP (sampath-priyankara) wrote :

Little recap..
If the VM is in PAUSED state, and when qemu process of that VM dies,
status of the VM become SHUTOFF. In this state, we cannot execute
stop API, because the VM has already stopped.

++++++++++++++++++++++++++++++++++++++++++++
| Action | Status | Power State |
++++++++++++++++++++++++++++++++++++++++++++
| Paused | PAUSED | Paused |
--------------------------------------------
| Qemu process die | SHUTOFF | Shutdown |
--------------------------------------------

Now, if we reset the state to active, then, VM status would be,
Status=Active and Power State=Shutdown. We only can pause the VMs
in Active status.
---recp end----

Therefore, if we wants to put VM in to PAUSED, then we must start
the VM and make it Active first. Instead of reset the reset the state,
we can start the VM form SHOTOFF status, and make if PAUSED again.
The flow would be,
(1) Try to stop VM
(2) If (1) fails, then start VM and make it active
(3) Then make it PAUSED

However, if we do that, we might change the internal state of the VM,
because we start the VM for a short time.
If that VM in cluster, bad things may happen.
IMHO, we have 2 options,
(A) Maintain the consistency between the vm_state before and after recovery
(what Dinesh proposed above)
(B) Or do nothing
    If the VM is in PAUSED or SUSPENDED state, then skip the recovery for
    those VMs and leave a Warning to operator.

  I think recovery method customization would be the best place to address these kind of issues.
In the meantime I would prefer option (B) do nothing for these cases.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari (master)

Fix proposed to branch: master
Review: https://review.openstack.org/454721

Changed in masakari:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (master)

Reviewed: https://review.openstack.org/454721
Committed: https://git.openstack.org/cgit/openstack/masakari/commit/?id=7aef2966f5a8c0f097d9614fcece7657e80f20b4
Submitter: Jenkins
Branch: master

commit 7aef2966f5a8c0f097d9614fcece7657e80f20b4
Author: dineshbhor <email address hidden>
Date: Thu Apr 6 19:22:24 2017 +0530

    Ignore instance recovery for 'paused' or 'rescued' instance

    If masakari receives instance failure notification it fails to
    recover that instance if it is in 'paused' or 'rescued' state.
    As a recovery action masakari-engine gives call to nova to stop
    the instance but as nova doesn't allow this it returns 409 which
    result into instance recovery failure and masakari marks that
    notification status as "error".

    This can be solved by maintaning consistency between the vm_state
    before and after recovery but it requires to start the instance
    again to gain the qemu process of an instance back alive which
    might change the internal state of the instance which results into
    inconsistency between instance state before and after recovery.
    So as a solution this patch proposes to ignore the instance recovery
    and logs a warning if the instance is in 'paused' or 'rescued' state.

    Closes-Bug: #1663513
    Change-Id: Id1cce45aad253527bedb58ab32f3d89637e02582

Changed in masakari:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.