Unexpected task state: expecting [None] but the actual state is powering-off

Bug #1334345 reported by Morgan Fainberg
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

It appears that nova compute is expecting (in a number of cases) an instance to be in powering-off state, but the state is actually none. This appears to have a wide range of failure modes.

I'm seeing this ~2800 times in the last 7 days

http://logstash.openstack.org/#eyJzZWFyY2giOiJ0YWdzOlwic2NyZWVuLW4tY3B1LnR4dFwiIEFORCBtZXNzYWdlOlwib3Nsby5tZXNzYWdpbmcucnBjLmRpc3BhdGNoZXIgVW5leHBlY3RlZFRhc2tTdGF0ZUVycm9yOiBVbmV4cGVjdGVkIHRhc2sgc3RhdGU6IGV4cGVjdGluZyAodSdwb3dlcmluZy1vZmYnLCkgYnV0IHRoZSBhY3R1YWwgc3RhdGUgaXMgTm9uZVwiIEFORCBidWlsZF9zdGF0dXM6XCJGQUlMVVJFXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIiLCJzdGFtcCI6MTQwMzcxMTkxODcxNH0=

Example traceback:

UnexpectedTaskStateError: Unexpected task state: expecting (u'powering-off',) but the actual state is None
 to caller
2014-06-25 05:01:31.058 ERROR oslo.messaging._drivers.common [req-ae166761-731e-4ead-b9da-eb8e8b1cf6af None None]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
    incoming.message))
  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
    result = getattr(endpoint, method)(ctxt, **new_args)
  File "/opt/stack/new/nova/nova/exception.py", line 88, in wrapped
    payload)
  File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/exception.py", line 71, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 278, in decorated_function
    LOG.info(_("Task possibly preempted: %s") % e.format_message())
  File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 272, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 336, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 314, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 302, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 2346, in stop_instance
    instance.save(expected_task_state=task_states.POWERING_OFF)
  File "/opt/stack/new/nova/nova/objects/base.py", line 187, in wrapper
    ctxt, self, fn.__name__, args, kwargs)
  File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 354, in object_action
    objmethod=objmethod, args=args, kwargs=kwargs)
  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
    retry=self.retry)
  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
    timeout=timeout, retry=retry)
  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 401, in send
    retry=retry)
  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 392, in _send
    raise result
UnexpectedTaskStateError_Remote: Unexpected task state: expecting (u'powering-off',) but the actual state is None
Traceback (most recent call last):

  File "/opt/stack/new/nova/nova/conductor/manager.py", line 404, in _object_dispatch
    return getattr(target, method)(context, *args, **kwargs)

  File "/opt/stack/new/nova/nova/objects/base.py", line 196, in wrapper
    return fn(self, ctxt, *args, **kwargs)

  File "/opt/stack/new/nova/nova/objects/instance.py", line 473, in save
    columns_to_join=_expected_cols(expected_attrs))

  File "/opt/stack/new/nova/nova/db/api.py", line 780, in instance_update_and_get_original
    columns_to_join=columns_to_join)

  File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 164, in wrapper
    return f(*args, **kwargs)

  File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 2229, in instance_update_and_get_original
    columns_to_join=columns_to_join)

  File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 2280, in _instance_update
    actual=actual_state, expected=expected)

UnexpectedTaskStateError: Unexpected task state: expecting (u'powering-off',) but the actual state is None

Revision history for this message
Tracy Jones (tjones-i) wrote :

can you give any more info on when/how you are seeing this? Also version #s would be helpful

https://wiki.openstack.org/wiki/BugFilingRecommendations

tags: added: api
Changed in nova:
importance: Undecided → Critical
importance: Critical → High
Changed in nova:
status: New → Triaged
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

@tracy, I'm seeing it in the high list of unclassified bugs in Elastic Recheck, I am seeing it against Master - trying to make changes to devstack and I ran across this a few times. Further research led me to the Logstash link above.

I have not been able to duplicate this in a local environment

Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

s/been able to/had a chance to attempt to duplicate this.

Revision history for this message
Joe Gordon (jogo) wrote :

So yes, this is bad, but unfortunately the fingerprint you gave tags:"screen-n-cpu.txt" AND message:"oslo.messaging.rpc.dispatcher UnexpectedTaskStateError: Unexpected task state: expecting (u'powering-off',) but the actual state is None" hits successful jobs more then failed jobs. My guess is this is either not the root cause, or this bug only causes tempest to fail in specific cases and not every time.

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

@Joe Gordon, Happy to have this bug shuffled / retargeted where it's needed. I did also tag build status failure specifically since this seemed to be (at least from what I could tell) one of the major common factors between the various failure modes.

Let me know what I can do to help.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I hit this in nova unit tests:

http://logs.openstack.org/94/102594/1/check/gate-nova-python27/014a5e6/console.html

Starting from this test:

nova.tests.api.ec2.test_cloud.CloudTestCase.test_create_image_with_reboot

2014-06-27 06:20:27.115 | Traceback (most recent call last):
2014-06-27 06:20:27.115 | File "nova/tests/api/ec2/test_cloud.py", line 2538, in test_create_image_with_reboot
2014-06-27 06:20:27.115 | self._do_test_create_image(False)
2014-06-27 06:20:27.116 | File "nova/tests/api/ec2/test_cloud.py", line 2511, in _do_test_create_image
2014-06-27 06:20:27.116 | no_reboot=no_reboot)
2014-06-27 06:20:27.116 | File "nova/api/ec2/cloud.py", line 1746, in create_image
2014-06-27 06:20:27.116 | self.compute_api.start(context, instance)
2014-06-27 06:20:27.116 | File "nova/compute/api.py", line 189, in inner
2014-06-27 06:20:27.116 | return function(self, context, instance, *args, **kwargs)
2014-06-27 06:20:27.116 | File "nova/compute/api.py", line 180, in wrapped
2014-06-27 06:20:27.116 | return function(self, context, instance, *args, **kwargs)
2014-06-27 06:20:27.116 | File "nova/compute/api.py", line 216, in _wrapped
2014-06-27 06:20:27.116 | return fn(self, context, instance, *args, **kwargs)
2014-06-27 06:20:27.116 | File "nova/compute/api.py", line 170, in inner
2014-06-27 06:20:27.117 | return f(self, context, instance, *args, **kw)
2014-06-27 06:20:27.117 | File "nova/compute/api.py", line 1775, in start
2014-06-27 06:20:27.117 | instance.save(expected_task_state=[None])
2014-06-27 06:20:27.117 | File "nova/objects/base.py", line 196, in wrapper
2014-06-27 06:20:27.117 | return fn(self, ctxt, *args, **kwargs)
2014-06-27 06:20:27.117 | File "nova/objects/instance.py", line 470, in save
2014-06-27 06:20:27.117 | columns_to_join=_expected_cols(expected_attrs))
2014-06-27 06:20:27.117 | File "nova/db/api.py", line 780, in instance_update_and_get_original
2014-06-27 06:20:27.117 | columns_to_join=columns_to_join)
2014-06-27 06:20:27.117 | File "nova/db/sqlalchemy/api.py", line 164, in wrapper
2014-06-27 06:20:27.117 | return f(*args, **kwargs)
2014-06-27 06:20:27.117 | File "nova/db/sqlalchemy/api.py", line 2229, in instance_update_and_get_original
2014-06-27 06:20:27.118 | columns_to_join=columns_to_join)
2014-06-27 06:20:27.118 | File "nova/db/sqlalchemy/api.py", line 2280, in _instance_update
2014-06-27 06:20:27.118 | actual=actual_state, expected=expected)
2014-06-27 06:20:27.118 | UnexpectedTaskStateError: Unexpected task state: expecting [None] but the actual state is powering-off

tags: added: ec2
Changed in nova:
status: Triaged → Confirmed
summary: - State expected powering off but is None
+ Unexpected task state: expecting [None] but the actual state is
+ powering-off
tags: added: testing
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is a better query for now since it restricts to the unit tests:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVW5leHBlY3RlZFRhc2tTdGF0ZUVycm9yOiBVbmV4cGVjdGVkIHRhc2sgc3RhdGU6IGV4cGVjdGluZyBbTm9uZV0gYnV0IHRoZSBhY3R1YWwgc3RhdGUgaXMgcG93ZXJpbmctb2ZmXCIgQU5EIHRhZ3M6XCJjb25zb2xlXCIgQU5EIChidWlsZF9uYW1lOlwiZ2F0ZS1ub3ZhLXB5dGhvbjI2XCIgT1IgYnVpbGRfbmFtZTpcImdhdGUtbm92YS1weXRob24yN1wiKSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwMzg3Mzc3Nzc1M30=

message:"UnexpectedTaskStateError: Unexpected task state: expecting [None] but the actual state is powering-off" AND tags:"console" AND (build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")

6 hits in 7 days, check and gate, all failures.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Noticing this in the trace of that unit test failure:

libvirt: error : internal error: could not initialize domain event timer
...
WARNING [nova.compute.manager] Instance is not stopped. Calling the stop API.

It blows up when trying to (re)start the instance in cloud.py:

http://git.openstack.org/cgit/openstack/nova/tree/nova/api/ec2/cloud.py#n1746

Note there is a TODO comment in the ec2 create_image code:

    # TODO(yamahata): race condition
    # At the moment there is no way to prevent others from
    # manipulating instances/volumes/snapshots.
    # As other code doesn't take it into consideration, here we don't
    # care of it for now. Ostrich algorithm

By ostrich I'm assuming put our heads in the sand. :(

Looking at the code, it stops the instance and then waits for it to be stopped, there is a hard-coded 1 hour timeout in the code and then when it's vm_state is stopped it moves on to creating the snapshot. Once the snapshot is done it goes to start the instance and that's where we hit the invalid task_state failure.

e-r query here: https://review.openstack.org/#/c/103119/

Changed in nova:
status: Confirmed → Triaged
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/103161

Changed in nova:
status: Triaged → In Progress
Matt Riedemann (mriedem)
tags: added: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/103161
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9a01e62693a28a73120544b27ee2104558e0250e
Submitter: Jenkins
Branch: master

commit 9a01e62693a28a73120544b27ee2104558e0250e
Author: Matt Riedemann <email address hidden>
Date: Fri Jun 27 06:46:10 2014 -0700

    Enforce task_state is None in ec2 create_image stop instance wait loop

    We're hitting races in the gate where the instance.vm_state is STOPPED
    but the task_state is POWERING_OFF so when the compute_api.start method
    is called, we're in an invalid task state and fail.

    The compute manager's stop_instance method is correctly setting the
    vm_state to STOPPED and the task_state to None when the instance is
    powered off via the virt driver, so we must be hitting this from races
    in the ec2 API as noted in the TODO above the method definition.

    This change simply checks the task_state in addition to the vm_state
    in the wait loop before continuing. The error message is also updated
    for context by including the instance uuid, vm_state and task_state,
    and removes the timeout value in the message since it was in
    milliseconds, not seconds, to begin with.

    There is already a unit test that covers this change (which was racing,
    hence the bug). There are no changes to that unit test since it's really
    an integration test that's running through the compute API and compute
    manager code, so the fix tests itself.

    Closes-Bug: #1334345

    Change-Id: I13f0c743cadda6439ae15607a9ef6e4e4985626d

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Revision history for this message
David Stanek (dstanek) wrote :
Download full text (4.5 KiB)

It looks like I may have hit this today in https://review.openstack.org/#/c/112299/

The traceback actually appears as a log message in http://logs.openstack.org/99/112299/1/check/check-tempest-dsvm-full/45774bf/console.html

2014-08-06 14:24:51.299 | *** Not Whitelisted *** 2014-08-06 14:04:30.950 ERROR oslo.messaging._drivers.common [req-a85bc34d-eed9-478d-8dc3-c10a29153b0e None None] ['Traceback (most recent call last):\n', ' File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply\n incoming.message))\n', ' File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch\n result = getattr(endpoint, method)(ctxt, **new_args)\n', ' File "/opt/stack/new/nova/nova/exception.py", line 88, in wrapped\n payload)\n', ' File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/opt/stack/new/nova/nova/exception.py", line 71, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 284, in decorated_function\n LOG.info(_("Task possibly preempted: %s") % e.format_message())\n', ' File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 278, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 342, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 320, in decorated_function\n kwargs[\'instance\'], e, sys.exc_info())\n', ' File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 308, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 2396, in stop_instance\n do_stop_instance()\n', ' File "/opt/stack/new/nova/nova/openstack/common/lockutils.py", line 325, in inner\n return f(*args, **kwargs)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 2392, in do_stop_instance\n instance.save(expected_task_state=task_states.POWERING_OFF)\n', ' File "/opt/stack/new/nova/nova/objects/base.py", line 187, in wrapper\n ctxt, self, fn.__name__, args, kwargs)\n', ' File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 346, in object_action\n objmethod=objmethod, args=args, kwargs=kwargs)\n', ' File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call\n retry=self.retry)\n', ' File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send\n timeout=timeout, retry=retry)\n', ' File "/usr/local/lib/pyth...

Read more...

Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.