Comment 10 for bug 1320628

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/108014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa1792eb4c1d10e9a192142ce7e20d37871d916a
Submitter: Jenkins
Branch: master

commit aa1792eb4c1d10e9a192142ce7e20d37871d916a
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 2 12:11:55 2014 -0700

    Stop stack tracing when trying to auto-stop a stopped instance

    Commit cc5388bbe81aba635fb757e202d860aeed98f3e8 added locks to
    stop_instance and the _sync_power_states periodic task to try and fix a
    race between stopping the instance via the API where the task_state is
    set to powering-off, and the periodic task seeing the instance
    power_state as shutdown in _sync_instance_power_state and calling the
    stop API again, at which point the task_state is already None from the
    first stop API call and we get an UnexpectedTaskStateError.

    The handle_lifecycle_event method is getting callbacks from the libvirt
    driver on state changes on the VM and calling the
    _sync_instance_power_state method which may try to stop the instance
    asynchronously, and lead to UnexpectedTaskStateError if the instance is
    already stopped by the time it gets the lock and the task_state has
    changed.

    Attempting to lock in handle_lifecycle_event just moves the race around
    so this change adds logic to stop_instance such that if the instance
    says it's active but the virt driver says it's not running, then we add
    None to the expected_task_state so we don't stacktrace on
    instance.save().

    An alternative and/or additional change to this would be doing a call
    rather than a cast when _sync_instance_power_state calls the stop API
    but in some previous testing it doesn't appear to make a significant
    difference in the race found when we hit the stop_instance method.

    Adds a bunch of debug logging since this code is inherently racey and
    is needed when looking at failures around these operations.

    Closes-Bug: #1339235
    Closes-Bug: #1266611
    Related-Bug: #1320628

    Change-Id: Ib495a5ab15de88051c5fa7abfb58a5445691dcad