Bug #1140324 “Uncommon states from nova are not handled properly...” : Bugs : OpenStack Heat

Revision history for this message

Steven Hardy (shardy) wrote on 2013-03-08:

#1

I'm not sure this is valid - which states exactly do you expect us to encounter other than BUILD, which do not mean we've hit some sort of error?

I supposed we could have a tuple containing every valid state other than BUILD or ACTIVE, and enter the error path if we see any of them, but then we have a problem if any new states are added in nova - it seems to me the only two valid states for our use-case *are* BUILD followed by ACTIVE, anything else is an error, so we are already doing the right thing?

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-03-08: Re: [Bug 1140324] Re: Uncommon states from nova are not handled properly during instance creation

#2

Excerpts from Steven Hardy's message of 2013-03-08 11:54:16 UTC:
> I'm not sure this is valid - which states exactly do you expect us to
> encounter other than BUILD, which do not mean we've hit some sort of
> error?
>
> I supposed we could have a tuple containing every valid state other than
> BUILD or ACTIVE, and enter the error path if we see any of them, but
> then we have a problem if any new states are added in nova - it seems to
> me the only two valid states for our use-case *are* BUILD followed by
> ACTIVE, anything else is an error, so we are already doing the right
> thing?
>

I only think we should care about statuses that we know how to handle. All
of the following are perfectly valid, and if we are separating "create"
from "active", then its possible there could be maintenance going on by
administrators between when we created the server and we seek to verify it
is ACTIVE. These all should be waited on as they can all lead to ACTIVE:

HARD_REBOOT. The server is hard rebooting. This is equivalent to
pulling the power plug on a physical server, plugging it back in, and
rebooting it.

PASSWORD. The password is being reset on the server.

REBOOT. The server is in a soft reboot state. A reboot command was passed
to the operating system.

RESCUE. The server is in rescue mode.

RESIZE. Server is performing the differential copy of data that changed
during its initial copy. Server is down for this stage.

REVERT_RESIZE. The resize or migration of a server failed for some
reason. The destination server is being cleaned up and the original
source server is restarting.

SHUTOFF. The virtual machine (VM) was powered down by the user, but not
through the OpenStack Compute API. For example, the user issued a shutdown
-h command from within the server instance. If the OpenStack Compute
manager detects that the VM was powered down, it transitions the server
instance to the SHUTOFF status. If you use the OpenStack Compute API to
restart the instance, the instance might be deleted first, depending on
the value in the shutdown_terminate database field on the Instance model.

SUSPENDED. The server is suspended, either by request or necessity. This
status appears for only the following hypervisors: XenServer/XCP, KVM,
and ESXi. Review support tickets or contact Rackspace support to determine
why the server is in this state.

VERIFY_RESIZE. System is awaiting confirmation that the server is
operational after a move or resize.

None of these are permanent, though some may require manual intervention
using nova commands, they required manual intervention to get to that
stage as well.

If more are added, they should be treated as ERROR's until we discover
them and add support.

Excerpts from Steven Hardy's message of 2013-03-08 11:54:16 UTC:
> I'm not sure this is valid - which states exactly do you expect us to
> encounter other than BUILD, which do not mean we've hit some sort of
> error?
> 
> I supposed we could have a tuple containing every valid state other than
> BUILD or ACTIVE, and enter the error path if we see any of them, but
> then we have a problem if any new states are added in nova - it seems to
> me the only two valid states for our use-case *are* BUILD followed by
> ACTIVE, anything else is an error, so we are already doing the right
> thing?
>

I only think we should care about statuses that we know how to handle. All
of the following are perfectly valid, and if we are separating "create"
from "active", then its possible there could be maintenance going on by
administrators between when we created the server and we seek to verify it
is ACTIVE. These all should be waited on as they can all lead to ACTIVE:

HARD_REBOOT. The server is hard rebooting. This is equivalent to
pulling the power plug on a physical server, plugging it back in, and
rebooting it.

PASSWORD. The password is being reset on the server.

REBOOT. The server is in a soft reboot state. A reboot command was passed
to the operating system.

RESCUE. The server is in rescue mode.

RESIZE. Server is performing the differential copy of data that changed
during its initial copy. Server is down for this stage.

REVERT_RESIZE. The resize or migration of a server failed for some
reason. The destination server is being cleaned up and the original
source server is restarting.

SHUTOFF. The virtual machine (VM) was powered down by the user, but not
through the OpenStack Compute API. For example, the user issued a shutdown
-h command from within the server instance. If the OpenStack Compute
manager detects that the VM was powered down, it transitions the server
instance to the SHUTOFF status. If you use the OpenStack Compute API to
restart the instance, the instance might be deleted first, depending on
the value in the shutdown_terminate database field on the Instance model.

SUSPENDED. The server is suspended, either by request or necessity. This
status appears for only the following hypervisors: XenServer/XCP, KVM,
and ESXi. Review support tickets or contact Rackspace support to determine
why the server is in this state.

VERIFY_RESIZE. System is awaiting confirmation that the server is
operational after a move or resize.

None of these are permanent, though some may require manual intervention
using nova commands, they required manual intervention to get to that
stage as well.

If more are added, they should be treated as ERROR's until we discover
them and add support.

Revision history for this message

Steven Hardy (shardy) wrote on 2013-03-21:

#3

> I only think we should care about statuses that we know how to handle

I agree, which is why I believe the current behaviour is correct - anything other than BUILD followed by ACTIVE we do not know how to handle, it means something unexpected happened, either an error in nova, or out-of-band actions via the nova CLI etc, in either case, we're in an unknown and unhandled state, so I believe we are correct to assert an error has occurred in resource creation.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-03-21:

#4

Excerpts from Steven Hardy's message of 2013-03-21 11:07:23 UTC:
> > I only think we should care about statuses that we know how to handle
>
> I agree, which is why I believe the current behaviour is correct -
> anything other than BUILD followed by ACTIVE we do not know how to
> handle, it means something unexpected happened, either an error in nova,
> or out-of-band actions via the nova CLI etc, in either case, we're in an
> unknown and unhandled state, so I believe we are correct to assert an
> error has occurred in resource creation.

My thinking is that those other statuses are legitimate, and can very well
lead to ACTIVE, and so should be waited on. Its not critical, they are
uncommon, but its a bug to give up on a server just because the operator
of the compute node migrates it. Also consider a case where the user
does something requiring a reboot of the instance right after boot up.

BUILD -> ACTIVE is just an implementation detail, it does not cover
all of the use cases. That said, I think this is low priority, as we do
handle *most* use cases.

Revision history for this message

Steven Hardy (shardy) wrote on 2013-04-22:

#5

Assigning to reporter so he can post a patch with a fix ;)

Changed in heat:
status:	New → Triaged
importance:	Undecided → Low
assignee:	nobody → Clint Byrum (clint-fewbar)
milestone:	none → havana-1

Clint Byrum (clint-fewbar) on 2013-05-01

Changed in heat:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-05-01: Fix proposed to heat (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/27995

OpenStack Infra (hudson-openstack) on 2013-05-07

Changed in heat:
assignee:	Clint Byrum (clint-fewbar) → Jeremy Stanley (fungi)

Jeremy Stanley (fungi) on 2013-05-07

Changed in heat:
assignee:	Jeremy Stanley (fungi) → nobody

Steven Hardy (shardy) on 2013-05-07

Changed in heat:
assignee:	nobody → Clint Byrum (clint-fewbar)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-05-07: Fix merged to heat (master)

#7

Reviewed: https://review.openstack.org/27995
Committed: http://github.com/openstack/heat/commit/9eec986f82e76b899835a8e6a1fafaa4474a7ff4
Submitter: Jenkins
Branch: master

commit 9eec986f82e76b899835a8e6a1fafaa4474a7ff4
Author: Clint Byrum <email address hidden>
Date: Wed May 1 15:51:03 2013 -0700

Wait for any nova server status that makes sense

    Nova may return some transient states based on operator actions that do
    not mean a resource has failed. Rather than report these as unexpected,
    wait on them just like BUILD.

Fixes bug #1140324

Change-Id: I757c073fb8a7da44f41e9a9cb9ae71dbc35d3c33

Changed in heat:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2013-05-29

Changed in heat:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-10-17

Changed in heat:
milestone:	havana-1 → 2013.2

OpenStack Heat

Uncommon states from nova are not handled properly during instance creation

Bug Description

Other bug subscribers

Remote bug watches