libvirt-xen: Instance status in nova may be different than real status

Bug #1433049 reported by Anthony PERARD
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Anthony PERARD
Juno
Fix Released
Undecided
Unassigned

Bug Description

Tempest test ServerActionsTestJSON:test_resize_server_confirm_from_stopped and other similaire test from_stopped may fail with libvirt-xen driver due to the test timing out on waiting the instance to be SHUTOFF, but nova is reporting the instance to be ACTIVE.

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 230, in test_resize_server_confirm_from_stopped
    self._test_resize_server_confirm(stop=True)
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 209, in _test_resize_server_confirm
    self.client.wait_for_server_status(self.server_id, expected_status)
  File "/opt/stack/tempest/tempest/services/compute/json/servers_client.py", line 183, in wait_for_server_status
    ready_wait=ready_wait)
  File "/opt/stack/tempest/tempest/common/waiters.py", line 93, in wait_for_server_status
    raise exceptions.TimeoutException(message)
tempest.exceptions.TimeoutException: Request timed out
Details: (ServerActionsTestJSON:test_resize_server_confirm_from_stopped) Server a0f07187-4e08-4664-ad48-a03cffb87873 failed to reach SHUTOFF status and task state "None" within the required time (196 s). Current status: ACTIVE. Current task state: None.

From nova log, I could see "VM Started (Lifecycle Event)" being reported while the instance is shutdown and being resized.

After tracking done this bug, the issue may comes from the Change-Id I690d3d700ab4d057554350da143ff77d78b509c6, Delay STOPPED lifecycle event for Xen domains.

A way to reproduce would be to run this script on a Xen machine using a small Cirros instance:
nova boot --image 'cirros-0.3.2-x86_64-uec' --flavor 42 instance
nova stop instance
# wait sometime (around 20s) so we start with SHUTDOWN state
nova start instance
nova stop instance
nova resize instance 84
nova resize-confirm instance
# check new state, should be shutoff.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/166184

Changed in nova:
assignee: nobody → Anthony PERARD (anthony-perard)
status: New → In Progress
Changed in nova:
importance: Undecided → Low
Bob Ball (bob-ball)
Changed in nova:
milestone: none → kilo-rc1
Revision history for this message
John Garbutt (johngarbutt) wrote :

This seems to cause Xen+Libvirt serious problems, so raising the priority to medium.

Its blocking their testing, so adding rc tag

Changed in nova:
importance: Low → Medium
tags: added: kilo-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/166184
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b5a9c4e4d04d011c59fca5306be651906792f411
Submitter: Jenkins
Branch: master

commit b5a9c4e4d04d011c59fca5306be651906792f411
Author: Anthony PERARD <email address hidden>
Date: Thu Mar 19 17:46:25 2015 +0000

    libvirt: Delay only STOPPED event for Xen domain.

    This fix change bd8329b34098436d18441a8129f3f20af53c2b91 (Delay STOPPED
    lifecycle event for Xen domains)

    Without this patch, a STOPPED event could be ignore if it was following a
    STARTED event.

    A scenario that have the issue on tempest is
    ServerActionsTestJSON:test_resize_server_confirm_from_stopped, and it
    happens as follow:
    - instance is stopped
    nova start instance
    - libvirt STARTED event received and delayed
    nova stop instance
    - libvirt STOPPED event received and ignored as there is a delayed event
    nova resize instance 42
    - resize finished
    - the delayed STARTED event is emited
    nova confirme-resize instance
    nova show instance
    - instance is show as ACTIVE, but should be SHUTOFF

    Also fix unit tests.

    Change-Id: If340f9b849b930c34238c5681018a29bc826798d
    Closes-Bug: #1433049

Changed in nova:
status: In Progress → Fix Committed
tags: removed: kilo-rc-potential
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
Matt Riedemann (mriedem) wrote :

Backport proposed in a squash commit to stable/juno here: https://review.openstack.org/#/c/163378/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/163378
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f60214b4fdbfb56c8428808a29b9cf911c0b1aa2
Submitter: Jenkins
Branch: stable/juno

commit f60214b4fdbfb56c8428808a29b9cf911c0b1aa2
Author: Thomas Bechtold <email address hidden>
Date: Tue Aug 19 17:41:57 2014 +0200

    Delay STOPPED lifecycle event for Xen domains

    When using libvirt, a reboot from inside of a kvm VM doesn't trigger
    any libvirt lifecycle event. That's fine. But rebooting a Xen VM leads
    to the events VIR_DOMAIN_EVENT_STOPPED and VIR_DOMAIN_EVENT_STARTED.
    Nova compute manager catches these events and tries to sync the power
    state of the VM with the power state in the database. In the case the VM
    state is ACTIVE but the power state is something that doesn't fit, the
    stop API call is executed to trigger all stop hooks. This leads to the
    problem that a reboot of a Xen VM without using the API isn't possible.
    To fix it, delay the emission of the STOPPED lifecycle event a couple of
    seconds. If a VIR_DOMAIN_EVENT_STARTED event is received while the STOPPED
    event is pending, cancel the pending STOPPED lifecycle event so the VM
    can continue to run.

    Closes-Bug: #1293480
    (cherry picked from commit bd8329b34098436d18441a8129f3f20af53c2b91)

    ----
    NOTE(mriedem): The fix for bug 1293480 introduced bug 1433049 so we
    have to backport both together, hence the squashed commits.
    ----

    libvirt: Delay only STOPPED event for Xen domain.

    This fix change bd8329b34098436d18441a8129f3f20af53c2b91 (Delay STOPPED
    lifecycle event for Xen domains)

    Without this patch, a STOPPED event could be ignore if it was following a
    STARTED event.

    A scenario that have the issue on tempest is
    ServerActionsTestJSON:test_resize_server_confirm_from_stopped, and it
    happens as follow:
    - instance is stopped
    nova start instance
    - libvirt STARTED event received and delayed
    nova stop instance
    - libvirt STOPPED event received and ignored as there is a delayed event
    nova resize instance 42
    - resize finished
    - the delayed STARTED event is emited
    nova confirme-resize instance
    nova show instance
    - instance is show as ACTIVE, but should be SHUTOFF

    Also fix unit tests.

    Conflicts:
            nova/tests/unit/virt/libvirt/test_host.py
            nova/virt/libvirt/host.py

    NOTE(mriedem): The conflicts are due to the code being moved to the
    nova.virt.libvirt.host module in Kilo.

    Closes-Bug: #1433049
    Change-Id: If340f9b849b930c34238c5681018a29bc826798d
    (cherry picked from commit b5a9c4e4d04d011c59fca5306be651906792f411)

    --

    Change-Id: I690d3d700ab4d057554350da143ff77d78b509c6

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-rc1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.