UnexpectedTaskStateError: Unexpected task state: expecting (u'powering-off',) but the actual state is None

Bug #1339235 reported by Matt Riedemann
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

This is showing up all over the n-cpu logs on teardown of tempest tests:

UnexpectedTaskStateError: Unexpected task state: expecting (u'powering-off',) but the actual state is None

For example:

http://logs.openstack.org/06/103206/4/check/check-tempest-dsvm-postgres-full/b5e8f3c/logs/screen-n-cpu.txt.gz?level=TRACE

We have nearly 40K hits on this in logstash in 7 days:

message:"UnexpectedTaskStateError: Unexpected task state: expecting (u'powering-off',) but the actual state is None" AND tags:"screen-n-cpu.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVW5leHBlY3RlZFRhc2tTdGF0ZUVycm9yOiBVbmV4cGVjdGVkIHRhc2sgc3RhdGU6IGV4cGVjdGluZyAodSdwb3dlcmluZy1vZmYnLCkgYnV0IHRoZSBhY3R1YWwgc3RhdGUgaXMgTm9uZVwiIEFORCB0YWdzOlwic2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA0ODQxMjQ3MDk4fQ==

This is the interesting traceback from the compute manager:

2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 88, in wrapped
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher payload)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 71, in wrapped
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 272, in decorated_function
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher LOG.info(_("Task possibly preempted: %s") % e.format_message())
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 266, in decorated_function
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 330, in decorated_function
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 308, in decorated_function
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 296, in decorated_function
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 2356, in stop_instance
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher instance.save(expected_task_state=task_states.POWERING_OFF)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/objects/base.py", line 187, in wrapper
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher ctxt, self, fn.__name__, args, kwargs)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 349, in object_action
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher objmethod=objmethod, args=args, kwargs=kwargs)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher retry=self.retry)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher timeout=timeout, retry=retry)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 404, in send
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher retry=retry)
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 395, in _send
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher raise result
2014-07-08 00:46:47.922 18853 TRACE oslo.messaging.rpc.dispatcher UnexpectedTaskStateError_Remote: Unexpected task state: expecting (u'powering-off',) but the actual state is None

Tags: compute
Revision history for this message
Matt Riedemann (mriedem) wrote :

I see this right before the error:

2014-07-08 00:46:47.916 INFO nova.compute.manager [req-f1763a6c-d8b1-4a7f-af14-cc5df9807611 None None] Task possibly preempted: Unexpected task state: expecting (u'powering-off',) but the actual state is None

So that leads me to believe the periodic task that syncs power states with the virt driver and hypervisor is updating the database and causing the race.

So this might be a duplicate of bug 1320628.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like this is contributing to bug 1266611 also.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/108014

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/108014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa1792eb4c1d10e9a192142ce7e20d37871d916a
Submitter: Jenkins
Branch: master

commit aa1792eb4c1d10e9a192142ce7e20d37871d916a
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 2 12:11:55 2014 -0700

    Stop stack tracing when trying to auto-stop a stopped instance

    Commit cc5388bbe81aba635fb757e202d860aeed98f3e8 added locks to
    stop_instance and the _sync_power_states periodic task to try and fix a
    race between stopping the instance via the API where the task_state is
    set to powering-off, and the periodic task seeing the instance
    power_state as shutdown in _sync_instance_power_state and calling the
    stop API again, at which point the task_state is already None from the
    first stop API call and we get an UnexpectedTaskStateError.

    The handle_lifecycle_event method is getting callbacks from the libvirt
    driver on state changes on the VM and calling the
    _sync_instance_power_state method which may try to stop the instance
    asynchronously, and lead to UnexpectedTaskStateError if the instance is
    already stopped by the time it gets the lock and the task_state has
    changed.

    Attempting to lock in handle_lifecycle_event just moves the race around
    so this change adds logic to stop_instance such that if the instance
    says it's active but the virt driver says it's not running, then we add
    None to the expected_task_state so we don't stacktrace on
    instance.save().

    An alternative and/or additional change to this would be doing a call
    rather than a cast when _sync_instance_power_state calls the stop API
    but in some previous testing it doesn't appear to make a significant
    difference in the race found when we hit the stop_instance method.

    Adds a bunch of debug logging since this code is inherently racey and
    is needed when looking at failures around these operations.

    Closes-Bug: #1339235
    Closes-Bug: #1266611
    Related-Bug: #1320628

    Change-Id: Ib495a5ab15de88051c5fa7abfb58a5445691dcad

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.