VMs paused unbeknownst to nova compute are destroyed

Bug #1097806 reported by Andres Lagar-Cavilla on 2013-01-09
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Yun Mao
Folsom
Medium
Yun Mao

Bug Description

Libvirt-managed qemu/KVM VMs can be paused outside of nova compute's workflow through a variety of means.

* By issuing virsh suspend
* By issuing virsh qemu-monitor-command '{"execute" : "stop"}'
* By causing qemu to emit a STOP event, for example when attaching a GDB debugger and single-stepping
* By connecting through an additional qemu monitor and issuing any commands that may cause qemu to emit a STOP event.

Starting in Folsom (specifically https://github.com/openstack/nova/commit/129b87e17d3333aeaa9e855a70dea51e6581ea63#L6R2502 i.e. commit 129b87e diff line 2502) nova compute will destroy a VM if libvirt reports it as paused and this doesn't fit nova compute's recorded state for the VM.

I surmise the original rationale is to destroy VMs that are paused by IO errors or KVM emulation errors, which would also cause qemu to emit STOP events.

The problem is that this will also destroy VMs that are paused through a variety of valid reasons as outlined above.

The problem is exacerbated by a Libvirt bug (https://bugzilla.redhat.com/show_bug.cgi?id=892791) which latches the state of a VM to paused even though the VM is running. The fix is already committed upstream (http://libvirt.org/git/?p=libvirt.git;a=commit;h=aedfcce33e4c2f266668a39fd655574fe34f1265) and we are intending for it to make its way through backports into distros.

Even with libvirt's bug fixed, there are still points in time at which nova-compute will check a VMs state, find it paused for a valid reason, and decide to erroneously destroy it.

The fix is to either remove this behavior, or to further query libvirt for the paused reason, which will show conclusively whether the VM is effectively crashed, or just paused.

Yun Mao (yunmao) on 2013-01-11
Changed in nova:
status: New → Triaged
importance: Undecided → Medium

Fix proposed to branch: master
Review: https://review.openstack.org/19467

Changed in nova:
assignee: nobody → Yun Mao (yunmao)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/19467
Committed: http://github.com/openstack/nova/commit/f7fbdeb5672bae7d3bffd6fa76de1ce81fc132bf
Submitter: Jenkins
Branch: master

commit f7fbdeb5672bae7d3bffd6fa76de1ce81fc132bf
Author: Yun Mao <email address hidden>
Date: Fri Jan 11 11:59:23 2013 -0500

    Fix state sync logic related to the PAUSED VM state

    A VM may get into the paused state not only because the user request
    via API calls, but also due to (temporary) external instrumentations.
    Before the virt layer can reliably report the reason, we simply ignore
    the state discrepancy. In many cases, the VM state will go back to
    running after the external instrumentation is done.

    Fix bug 1097806.

    Change-Id: I8edef45d60fa79d6ddebf7d0438042a7b3986b55

Changed in nova:
status: In Progress → Fix Committed
tags: added: folsom-backport-potential
tags: removed: folsom-backport-potential
Thierry Carrez (ttx) on 2013-02-21
Changed in nova:
milestone: none → grizzly-3
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/20337
Committed: http://github.com/openstack/nova/commit/7ace55fcf9e1b7fea074f6c0331b6feafbbc4178
Submitter: Jenkins
Branch: stable/folsom

commit 7ace55fcf9e1b7fea074f6c0331b6feafbbc4178
Author: Yun Mao <email address hidden>
Date: Fri Jan 11 11:59:23 2013 -0500

    Fix state sync logic related to the PAUSED VM state

    A VM may get into the paused state not only because the user request
    via API calls, but also due to (temporary) external instrumentations.
    Before the virt layer can reliably report the reason, we simply ignore
    the state discrepancy. In many cases, the VM state will go back to
    running after the external instrumentation is done.

    Fix bug 1097806.

    Change-Id: I8edef45d60fa79d6ddebf7d0438042a7b3986b55
    (cherry picked from commit f7fbdeb5672bae7d3bffd6fa76de1ce81fc132bf)

Thierry Carrez (ttx) on 2013-04-04
Changed in nova:
milestone: grizzly-3 → 2013.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers