Double powering-off state confuses the clients and causes gate failure

Bug #1320628 reported by Attila Fazekas
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
melanie witt

Bug Description

http://logs.openstack.org/52/73152/8/check/check-tempest-dsvm-full/9352c04/console.html.gz#_2014-05-13_18_12_38_547

At client side the only way to know an instance action is doable is to making sure the status is a permanent status like ACTIVE or SHUTOFF and no action in progress, so the task-state is None.

In the above linked case tempest stopped the instance and the instance reached the "SHUTOFF/None".
'State transition "ACTIVE/powering-off" ==> "SHUTOFF/None" after 10 second wait'

Cool, at this time we can start the instance right ? No, other attribute needs to be checked.

The start attempt was rewarded with 409 :
 u'Instance 7bc9de3b-1960-476f-b964-2ab2da986ec7 in task_state powering-off. Cannot start while the instance is in this state'

The below line indicates the task state, silently moved back to "SHUTOFF/powering-off" , before the 'start' attempt.

2014-05-13 18:09:13,610 State transition "SHUTOFF/powering-off" ==> "SHUTOFF/None" after 1 second wait

Please do not set the 'None' task-state when the 'powering-off' is not finished.

Tags: compute
description: updated
Tracy Jones (tjones-i)
tags: added: compute
melanie witt (melwitt)
Changed in nova:
assignee: nobody → Melanie Witt (melwitt)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Attila Fazekas (afazekas) wrote :

Similar issue can happen at "/opt/stack/new/tempest/tempest/api/compute/images/test_list_image_filters.py", line 57, in setUpClass.

This test does not stops the server, but it creates multiple snapshot.

Revision history for this message
melanie witt (melwitt) wrote :

It seems the problem is a race with _sync_instance_power_state.

In rebuild_instance, after the rebuild is finished, if the original vm_state was STOPPED, it sets task_state to POWERING_OFF and calls stop_instance.

In the nova-compute log I see the message from _sync_instance_power_state, "Instance is not stopped. Calling the stop API." [1] It observed vm_state STOPPED and power state not off from the driver, and called compute api force_stop [2] which sets task_state to POWERING_OFF again and stops it again.

[1] http://logs.openstack.org/52/73152/8/check/check-tempest-dsvm-full/9352c04/logs/screen-n-cpu.txt.gz?level=DEBUG#_2014-05-13_18_09_11_515

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/103174

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
melanie witt (melwitt) wrote :

Increasing this to high because it affects the gate.

summary: - Double powering-off state confuses the clinets and causes gate failure
+ Double powering-off state confuses the clients and causes gate failure
Changed in nova:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/103174
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cc5388bbe81aba635fb757e202d860aeed98f3e8
Submitter: Jenkins
Branch: master

commit cc5388bbe81aba635fb757e202d860aeed98f3e8
Author: melanie witt <email address hidden>
Date: Fri Jun 27 15:45:15 2014 +0000

    synchronize 'stop' and power state periodic task

    This change adds the synchronized decorator to the stop function and
    _sync_instance_power_state to address a race condition when powering
    off and updating the database for an instance. The power off and
    database update should be protected so the periodic task sees a
    consistent view when it checks power state from the driver and
    VM state from the database in order to take action.

    In the bug, a stopped instance is in the middle of being rebuilt,
    and while it is in the process of being returned to the original
    STOPPED state, _sync_power_states sees the combination of the
    power state from the driver being 'on' and the vm_state from
    the database as being STOPPED. This patch aims to ensure power
    'off' and vm_state set to STOPPED happen atomically in the
    stop_instance function. Thus, the read by _sync_power_states
    must also be synced on the instance UUID.

    Change-Id: I8aa83ab8dca35878cf792ae2d46feaa9912ffd38
    Closes-Bug: #1320628

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like this is contributing to bug 1266611 also.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/108014

Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Revision history for this message
Brian Elliott (belliott) wrote :

This patch fixes the described locking condition, but introduces a locking bug:

https://bugs.launchpad.net/nova/+bug/1363231

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/108014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa1792eb4c1d10e9a192142ce7e20d37871d916a
Submitter: Jenkins
Branch: master

commit aa1792eb4c1d10e9a192142ce7e20d37871d916a
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 2 12:11:55 2014 -0700

    Stop stack tracing when trying to auto-stop a stopped instance

    Commit cc5388bbe81aba635fb757e202d860aeed98f3e8 added locks to
    stop_instance and the _sync_power_states periodic task to try and fix a
    race between stopping the instance via the API where the task_state is
    set to powering-off, and the periodic task seeing the instance
    power_state as shutdown in _sync_instance_power_state and calling the
    stop API again, at which point the task_state is already None from the
    first stop API call and we get an UnexpectedTaskStateError.

    The handle_lifecycle_event method is getting callbacks from the libvirt
    driver on state changes on the VM and calling the
    _sync_instance_power_state method which may try to stop the instance
    asynchronously, and lead to UnexpectedTaskStateError if the instance is
    already stopped by the time it gets the lock and the task_state has
    changed.

    Attempting to lock in handle_lifecycle_event just moves the race around
    so this change adds logic to stop_instance such that if the instance
    says it's active but the virt driver says it's not running, then we add
    None to the expected_task_state so we don't stacktrace on
    instance.save().

    An alternative and/or additional change to this would be doing a call
    rather than a cast when _sync_instance_power_state calls the stop API
    but in some previous testing it doesn't appear to make a significant
    difference in the race found when we hit the stop_instance method.

    Adds a bunch of debug logging since this code is inherently racey and
    is needed when looking at failures around these operations.

    Closes-Bug: #1339235
    Closes-Bug: #1266611
    Related-Bug: #1320628

    Change-Id: Ib495a5ab15de88051c5fa7abfb58a5445691dcad

Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.