New state machine handles deploy failures incorrectly

Bug #1405420 reported by Vladyslav Drok
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Ruby Loo

Bug Description

When exception occurs during deployment, state machine goes to DEPLOYFAIL provisioning state, while not clearing target_provision_state.
E.g. exception occurs:

  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 455, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 212, in main
    result = function(*args, **kwargs)
  File "/opt/stack/ironic/ironic/conductor/manager.py", line 712, in _do_node_deploy
    node.last_error = _("Failed to deploy. Error: %s") % e
  File "/usr/local/lib/python2.7/dist-packages/oslo/utils/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/ironic/ironic/conductor/manager.py", line 700, in _do_node_deploy
    task.driver.deploy.prepare(task)
  File "/opt/stack/ironic/ironic/drivers/modules/agent.py", line 256, in prepare
    node.instance_info = build_instance_info_for_deploy(task)
  File "/opt/stack/ironic/ironic/drivers/modules/agent.py", line 169, in build_instance_info_for_deploy
    reason=_("Agent deploy supports only HTTP URLs"))
ImageUnacceptable: Image file:///home/user/aaa is unacceptable: Agent deploy supports only HTTP URLs

After that, any node-update request gives

Node 5189df61-abd0-498d-b1c5-1bba22d33376 can not be updated while a state transition is in progress. (HTTP 409)

When trying node-set-provision-state

Node 5189df61-abd0-498d-b1c5-1bba22d33376 is already being provisioned or decommissioned. (HTTP 409)

In ironic.nodes DB table target_provision_state for this node remains 'deploy done'.

Revision history for this message
Vladyslav Drok (vdrok) wrote :
Ruby Loo (rloo)
Changed in ironic:
status: New → Triaged
status: Triaged → Confirmed
Dmitry Tantsur (divius)
Changed in ironic:
importance: Undecided → High
Revision history for this message
Ruby Loo (rloo) wrote :

This is due to the fsm code/states that were added. It used to be that when a node was in DEPLOYFAIL provision_state, the target_provision_state was set to NOSTATE.

WIth the fsm changes (to move towards the new state machine [1]), when a node is put in DEPLOYFAIL provision_state, the target_provision_state is ACTIVE (the same target as for DEPLOYING provision_state). As describe in the bug, this prevents API requests like node-update's from succeeding.

For now, we could set the target_provision_state to NOSTATE but we'll need to come up/decide soon on the 'right' solution for what/how to handle nodes that are in a *FAIL provision_state.

Alternatively, we could change the code to eg allow node-updates if the provision_state is *FAIL?

[1] http://specs.openstack.org/openstack/ironic-specs/specs/kilo/new-ironic-state-machine.html

Ruby Loo (rloo)
Changed in ironic:
assignee: nobody → Ruby Loo (rloo)
Revision history for this message
Ruby Loo (rloo) wrote :

Prior to the fsm code/state changes, when a node's provision_state was DEPLOYFAIL, one could:
- PUT /v1/nodes/(node_uuid)/states/provision, with target being one of [DELETE, ACTIVE, REBUILD]
- PATCH /v1/nodes

If we changed the target_provision_state to NOSTATE when provision_state is DEPLOYFAIL, that would fix it for now. But the proposed new state machine doesn't have a NOSTATE state, so this would have to change again.

With the new state machine, if we really want the target_provision_state to be ACTIVE when provision_state is DDEPLOYFAIL, then for now, we can make code changes that get us closer to that 'final solution'.

I'm going to make a change to ironic-api, so that operations mentioned above will be allowed for nodes in provision_state=DEPLOYFAIL. It may be the case that we want to do this for most or all nodes in a *FAIL provision state, but that is something we should think about for each individual *FAIL state and this is the only *FAIL state that we have now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/149027

Changed in ironic:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/149027
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=7b16cb7401e72cce31f7fa53af11255283f792f4
Submitter: Jenkins
Branch: master

commit 7b16cb7401e72cce31f7fa53af11255283f792f4
Author: Ruby Loo <email address hidden>
Date: Wed Jan 21 19:04:59 2015 +0000

    Allow operations on DEPLOYFAIL'd nodes

    Prior to the code changes to use fsm, when a node's provision state
    was set to DEPLOYFAIL, its target_provision_state was set to NOSTATE.
    These operations could be performed on these failed nodes:
    - PUT /v1/nodes/(node_uuid)/states/provision, with target being one
      of [DELETE, ACTIVE, REBUILD]
    - PATCH /v1/nodes

    After those code changes, the node's target_provision_state was set
    to ACTIVE (in line with the new state machine proposal) and the
    above operations were no longer possible.

    With these changes (to check for nodes in DEPLOYFAIL provision state),
    the above operations can now be performed.

    Change-Id: I46d8e0e9e50cc8c35ccd7b03df18e34668781b64
    Closes-Bug: #1405420

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: kilo-2 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.