OpenStack Compute (nova)

Double powering-off state confuses the clients and causes gate failure

Bug #1320628 reported by Attila Fazekas on 2014-05-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	melanie witt	OpenStack Compute (nova) 2014.2 "juno"

Bug Description

http://logs.openstack.org/52/73152/8/check/check-tempest-dsvm-full/9352c04/console.html.gz#_2014-05-13_18_12_38_547

At client side the only way to know an instance action is doable is to making sure the status is a permanent status like ACTIVE or SHUTOFF and no action in progress, so the task-state is None.

In the above linked case tempest stopped the instance and the instance reached the "SHUTOFF/None".
'State transition "ACTIVE/powering-off" ==> "SHUTOFF/None" after 10 second wait'

Cool, at this time we can start the instance right ? No, other attribute needs to be checked.

The start attempt was rewarded with 409 :
u'Instance 7bc9de3b-1960-476f-b964-2ab2da986ec7 in task_state powering-off. Cannot start while the instance is in this state'

The below line indicates the task state, silently moved back to "SHUTOFF/powering-off" , before the 'start' attempt.

2014-05-13 18:09:13,610 State transition "SHUTOFF/powering-off" ==> "SHUTOFF/None" after 1 second wait

Please do not set the 'None' task-state when the 'powering-off' is not finished.

See original description

Tags:

Attila Fazekas (afazekas) on 2014-05-18

description:

updated

Tracy Jones (tjones-i) on 2014-05-21

tags:

added: compute

melanie witt (melwitt) on 2014-06-13

Changed in nova:
assignee:	nobody → Melanie Witt (melwitt)
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Attila Fazekas (afazekas) wrote on 2014-06-26:

Similar issue can happen at "/opt/stack/new/tempest/tempest/api/compute/images/test_list_image_filters.py", line 57, in setUpClass.

This test does not stops the server, but it creates multiple snapshot.

Revision history for this message

melanie witt (melwitt) wrote on 2014-06-27:

It seems the problem is a race with _sync_instance_power_state.

In rebuild_instance, after the rebuild is finished, if the original vm_state was STOPPED, it sets task_state to POWERING_OFF and calls stop_instance.

In the nova-compute log I see the message from _sync_instance_power_state, "Instance is not stopped. Calling the stop API." [1] It observed vm_state STOPPED and power state not off from the driver, and called compute api force_stop [2] which sets task_state to POWERING_OFF again and stops it again.

[1] http://logs.openstack.org/52/73152/8/check/check-tempest-dsvm-full/9352c04/logs/screen-n-cpu.txt.gz?level=DEBUG#_2014-05-13_18_09_11_515

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-27: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/103174

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

melanie witt (melwitt) wrote on 2014-07-04:

Increasing this to high because it affects the gate.

summary:	- Double powering-off state confuses the clinets and causes gate failure + Double powering-off state confuses the clients and causes gate failure
Changed in nova:
importance:	Medium → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-12: Fix merged to nova (master)

Reviewed: https://review.openstack.org/103174
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cc5388bbe81aba635fb757e202d860aeed98f3e8
Submitter: Jenkins
Branch: master

commit cc5388bbe81aba635fb757e202d860aeed98f3e8
Author: melanie witt <email address hidden>
Date: Fri Jun 27 15:45:15 2014 +0000

synchronize 'stop' and power state periodic task

    This change adds the synchronized decorator to the stop function and
    _sync_instance_power_state to address a race condition when powering
    off and updating the database for an instance. The power off and
    database update should be protected so the periodic task sees a
    consistent view when it checks power state from the driver and
    VM state from the database in order to take action.

    In the bug, a stopped instance is in the middle of being rebuilt,
    and while it is in the process of being returned to the original
    STOPPED state, _sync_power_states sees the combination of the
    power state from the driver being 'on' and the vm_state from
    the database as being STOPPED. This patch aims to ensure power
    'off' and vm_state set to STOPPED happen atomically in the
    stop_instance function. Thus, the read by _sync_power_states
    must also be synced on the instance UUID.

Change-Id: I8aa83ab8dca35878cf792ae2d46feaa9912ffd38
Closes-Bug: #1320628

Changed in nova:
status:	In Progress → Fix Committed

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-07-18:

Sadly it looks like we actually made the UnexpectedTaskStateError message in the logs worse after the sync patch merged on 7/14:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVW5leHBlY3RlZFRhc2tTdGF0ZUVycm9yOiBVbmV4cGVjdGVkIHRhc2sgc3RhdGU6IGV4cGVjdGluZyAodSdwb3dlcmluZy1vZmYnLCkgYnV0IHRoZSBhY3R1YWwgc3RhdGUgaXMgTm9uZVwiIEFORCB0YWdzOlwic2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiJjdXN0b20iLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMDRUMTM6MDY6MzcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMThUMTM6MDY6MzcrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNTY4ODg4MTUwMn0=

Maybe the sync patch just syncs it in such a way that it makes the race less racey.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-07-18:

Looks like this is contributing to bug 1266611 also.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-18: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/108014

Russell Bryant (russellb) on 2014-07-23

Changed in nova:
milestone:	none → juno-2
status:	Fix Committed → Fix Released

Revision history for this message

Brian Elliott (belliott) wrote on 2014-08-29:

This patch fixes the described locking condition, but introduces a locking bug:

https://bugs.launchpad.net/nova/+bug/1363231

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-12: Related fix merged to nova (master)

#10

Reviewed: https://review.openstack.org/108014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa1792eb4c1d10e9a192142ce7e20d37871d916a
Submitter: Jenkins
Branch: master

commit aa1792eb4c1d10e9a192142ce7e20d37871d916a
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 2 12:11:55 2014 -0700

Stop stack tracing when trying to auto-stop a stopped instance

    Commit cc5388bbe81aba635fb757e202d860aeed98f3e8 added locks to
    stop_instance and the _sync_power_states periodic task to try and fix a
    race between stopping the instance via the API where the task_state is
    set to powering-off, and the periodic task seeing the instance
    power_state as shutdown in _sync_instance_power_state and calling the
    stop API again, at which point the task_state is already None from the
    first stop API call and we get an UnexpectedTaskStateError.

    The handle_lifecycle_event method is getting callbacks from the libvirt
    driver on state changes on the VM and calling the
    _sync_instance_power_state method which may try to stop the instance
    asynchronously, and lead to UnexpectedTaskStateError if the instance is
    already stopped by the time it gets the lock and the task_state has
    changed.

    Attempting to lock in handle_lifecycle_event just moves the race around
    so this change adds logic to stop_instance such that if the instance
    says it's active but the virt driver says it's not running, then we add
    None to the expected_task_state so we don't stacktrace on
    instance.save().

    An alternative and/or additional change to this would be doing a call
    rather than a cast when _sync_instance_power_state calls the stop API
    but in some previous testing it doesn't appear to make a significant
    difference in the race found when we hit the stop_instance method.

Adds a bunch of debug logging since this code is inherently racey and
is needed when looking at failures around these operations.

    Closes-Bug: #1339235
    Closes-Bug: #1266611
    Related-Bug: #1320628

Change-Id: Ib495a5ab15de88051c5fa7abfb58a5445691dcad

Reviewed:  https://review.openstack.org/108014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa1792eb4c1d10e9a192142ce7e20d37871d916a
Submitter: Jenkins
Branch:    master

commit aa1792eb4c1d10e9a192142ce7e20d37871d916a
Author: Matt Riedemann <mriedem@us.ibm.com>
Date:   Tue Sep 2 12:11:55 2014 -0700

Stop stack tracing when trying to auto-stop a stopped instance
    
    Commit cc5388bbe81aba635fb757e202d860aeed98f3e8 added locks to
    stop_instance and the _sync_power_states periodic task to try and fix a
    race between stopping the instance via the API where the task_state is
    set to powering-off, and the periodic task seeing the instance
    power_state as shutdown in _sync_instance_power_state and calling the
    stop API again, at which point the task_state is already None from the
    first stop API call and we get an UnexpectedTaskStateError.
    
    The handle_lifecycle_event method is getting callbacks from the libvirt
    driver on state changes on the VM and calling the
    _sync_instance_power_state method which may try to stop the instance
    asynchronously, and lead to UnexpectedTaskStateError if the instance is
    already stopped by the time it gets the lock and the task_state has
    changed.
    
    Attempting to lock in handle_lifecycle_event just moves the race around
    so this change adds logic to stop_instance such that if the instance
    says it's active but the virt driver says it's not running, then we add
    None to the expected_task_state so we don't stacktrace on
    instance.save().
    
    An alternative and/or additional change to this would be doing a call
    rather than a cast when _sync_instance_power_state calls the stop API
    but in some previous testing it doesn't appear to make a significant
    difference in the race found when we hit the stop_instance method.
    
    Adds a bunch of debug logging since this code is inherently racey and
    is needed when looking at failures around these operations.
    
    Closes-Bug: #1339235
    Closes-Bug: #1266611
    Related-Bug: #1320628
    
    Change-Id: Ib495a5ab15de88051c5fa7abfb58a5445691dcad

Thierry Carrez (ttx) on 2014-10-16

Changed in nova:
milestone:	juno-2 → 2014.2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.