Nodes left in an inconsistent state due to lack of free conductor workers to start the deployment

Bug #1331494 reported by Lucas Alvares Gomes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Lucas Alvares Gomes

Bug Description

Before spawning the greenthread that will actually do the work for provisioning a node Ironic first set the provision_state of the node to DEPLOYING and target_provision_state to DEPLOYED to expose that the work is in progress[1], but if there's no free workers to actually deploy that machine the deployment fails and the nodes are left in an incosistent state where it's impossible to try to hit deploy again nor delete the node.

LOGS: http://paste.openstack.org/show/84394/

[1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L386-L389

== How to reproduce ==

Set the config option workers_pool_size to 1 (1 for the dedicated greenthread running the keepalive)

[conductor]

# The size of the workers greenthread pool. (integer value)
workers_pool_size=1

Now try to deploy a node.

== More ==

Something similar will happen when you try to delete a node and there's no available free workers, Ironic will set the provision_state to DELETING and target_provision_state to DELETED and the operation will fail, after that the node is now in an incosistent state where it can't be deleted anymore because Ironic thinks that there's already an operation running to delete that node.

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
importance: Undecided → High
description: updated
Changed in ironic:
milestone: none → juno-2
Changed in ironic:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/100957

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/100958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/100957
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=ebf09d96139531ef60843834beee4bb83b99b1de
Submitter: Jenkins
Branch: master

commit ebf09d96139531ef60843834beee4bb83b99b1de
Author: Lucas Alvares Gomes <email address hidden>
Date: Wed Jun 18 16:20:20 2014 +0100

    Add set_spawn_error_hook to TaskManager

    This patch is adding a way to create a hook on the TaskManager that gets
    called upon an exception being raised. The task.spawn_after() do not
    raise any exception making it impossible to add some logic around it in
    case something goes bad with the method being executed by the greenthread.

    Partial-Bug: #1331494
    Change-Id: I5755df359a6e8678b64d4d59c25a9192f575b13d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/100958
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=8b554deed62d86d19d7459ab1ac36cf2dc27745a
Submitter: Jenkins
Branch: master

commit 8b554deed62d86d19d7459ab1ac36cf2dc27745a
Author: Lucas Alvares Gomes <email address hidden>
Date: Wed Jun 18 17:19:37 2014 +0100

    Fix nodes left in an incosistent state if no workers

    This patch is fixing the problem of leaving the nodes in an inconsistent
    state if there's no free conductor workers available to deploy or the
    tear down a node, the patch is using the set_spawn_error_hook() method
    of TaskManager to run some custom code that will rollback the nodes
    to the previous provision_state and target_provision_state in case
    NoFreeConductorWorker is raised.

    Closes-Bug: #1331494
    Change-Id: I5d6e8e2c69cbdf1f9abe169afe617aa79783e57d

Changed in ironic:
status: In Progress → Fix Committed
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.