Ironic

[RFE] Tooling for recovering from -WAIT and -ING state

Bug #1580931 reported by Tan Lin on 2016-05-12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Fix Released	Wishlist	Unassigned

Bug Description

Ironic nodes can be stuck in deploying/deploywait/cleaning/cleanwait/inspecting/deleting which then require an operator to touch the DB to recover these nodes.
Although we already have periodic tasks _check_cleanwait_timeouts()/_check_inspect_timeouts()/_check_deploy_timeouts()/_check_deploying_states to get ride of the situation.
But it's not good enough. So there is a need to create a command line tool to handle this.
And users are able to recover these notes directly.

Another possible solution is that we can extend abort action to above states and make nodes into corresponding fail states

Tags:

Tan Lin (tan-lin-good) on 2016-05-12

Changed in ironic:
assignee:	nobody → Tan Lin (tan-lin-good)

Revision history for this message

Dmitry Tantsur (divius) wrote on 2016-05-12:

Could you please elaborate on "not good enough"? All -WAIT states are supposed to be abort-able. Doesn't it work for you?

summary:	- failing nodes + [RFE] Tooling for recovering from -WAIT and -ING state
Changed in ironic:
status:	New → Incomplete
importance:	Undecided → Wishlist

Revision history for this message

Tan Lin (tan-lin-good) wrote on 2016-05-13:

OK, first of all, there are some users who didn't want to wait the timeout (1800s by default), but change the states as soon as they found the node is stuck. So we need a way to support this.

For wait state:
We have two wait states deploy-wait and clean-wait.
clean-wait support ``abort`` action and will move the node to clean-failed. That's really awesome.
But deploy-wait only support ``deleted`` action and will move the node to deleting state.
It's better to move it to deploy-failed which can run ``rebuild`` and ``active`` action.

For ING state:
Ironic cannot ``abort`` them at all.

Revision history for this message

Jim Rollenhagen (jim-rollenhagen) wrote on 2016-05-13:

Operators have indicated they need a tool to force e.g. deploying or cleaning out, if something gets stuck there. Currently they need to touch the DB.

https://etherpad.openstack.org/p/ironic-newton-summit-ops

We decided that we don't really want an API for this (as something we'll need to maintain forever), but rather some sort of manage command.

I proposed https://review.openstack.org/#/c/311273/2 as a start, that we can enhance more as we go (deploying is the prevalent stuck state, from what I'm hearing).

Tan had some issues with the approach as far as security - this script would need an ironic config file with proper DB credentials to work, so I don't think there's any problems there.

Lucas suggested handling more than deploying, and moving to manageable instead. From what I've gathered, operators are currently hitting the DB to fail the deploy out, at which point Nova will indicate the failure, and the user (or Nova, or a script) may delete the instance (or retry the deployment). I chose deploying to start, we (or operators) can enhance it from there.

As to the comments already here, I think the current deploywait/abort case is fine. If the deployment fails, the nova user can just deploy another instance.

Revision history for this message

Dmitry Tantsur (divius) wrote on 2016-05-16:

Hmm, I kind of agree that it would be good to "abort" a deploy-wait and leave it in a failed state for operator's inspection (or any other reasons to avoid getting rescheduled immediately).

I'm not sure how we can safely abort -ING states, as during them Ironic is doing something with a lock held.

Changed in ironic:
status:	Incomplete → Confirmed

Revision history for this message

Lucas Alvares Gomes (lucasagomes) wrote on 2016-05-16:

I'm +1 as well for abort on deploy-wait, but it's not critical since "deleted" (as bad as it's sounds) get you out of there, plus, -wait states do have timeouts. So I kinda feel that this bug should only be for -ING states.

About the suggestion of moving stuck nodes to manageable, I understand the idea of starting small and fixing DEPLOYING first (baby steps), but I don't see much difference between a node stuck in DEPLOYING or CLEANING or INSPECTING, potentially, a node could get stuck in all these states. Since the tool was literally changing the database and setting provision_state to "deploy error", I've suggested moving it to "manageable" because that's generic, hidden to the nova scheduler and is basically the start point for the state machine, that will keep the same behavior for CLEANING, DEPLOYING and INSPECTING.

Revision history for this message

Dmitry Tantsur (divius) wrote on 2016-05-16:

I think an operator's forced action is a good reason to set maintenance mode

Revision history for this message

Jim Rollenhagen (jim-rollenhagen) wrote on 2016-05-16:

@Lucas, I think it's safer to follow the state machine flow here. If a thing gets stuck in "deploying", we should move it to "deploy failed" (well, because the deploy failed). Putting it in "manageable" is almost certainly going to confuse some tooling, maybe even Nova. If an instance is waiting to be deployed, and nova sees deployfail, it will either fail or reschedule the build. What would it do if it sees manageable? Do we want to be sure that in the future we code around someone possibly skipping across the state machine?

We could still make it one generic tool, and just a quick if-else block to choose the right "failed state" for the current state.

OpenStack Infra (hudson-openstack) on 2016-05-17

Changed in ironic:
assignee:	Tan Lin (tan-lin-good) → Jim Rollenhagen (jim-rollenhagen)
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2016-05-18

Changed in ironic:
assignee:	Jim Rollenhagen (jim-rollenhagen) → Tan Lin (tan-lin-good)

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2016-05-18:

Another note; if you're in the *ING states, you're often also locked. This tool should probably break the lock as well (or maybe even have 'break a lock' as a specific option?).

Revision history for this message

Tan Lin (tan-lin-good) wrote on 2016-05-19:

Agree with JayF, I update a PoC patch on https://review.openstack.org/#/c/311273/ and let me write a spec so we can continue to discuss.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-23: Fix proposed to ironic-specs (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/319812

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-01: Change abandoned on ironic (master)

#11

Change abandoned by Jim Rollenhagen (<email address hidden>) on branch: master
Review: https://review.openstack.org/311273

Revision history for this message

Ruby Loo (rloo) wrote on 2016-11-21:

#12

Tan Lin is no longer working on this; feel free to pick it up.

tags:	added: needs-spec
Changed in ironic:
assignee:	Tan Lin (tan-lin-good) → nobody

Jay Faulkner (jason-oldos) on 2016-11-30

Changed in ironic:
status:	In Progress → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-07: Change abandoned on ironic-specs (master)

#13

Change abandoned by Julia Kreger (<email address hidden>) on branch: master
Review: https://review.opendev.org/319812
Reason: No revision in over three years, abandoning.

Revision history for this message

Julia Kreger (juliaashleykreger) wrote on 2024-02-14:

#14

We believe this is largely covered via the abort functionality. As such, marking fix released. In other cases where it is not possible, the project does not intend to provide such an interface due to lack of infrastructure operator demand.

Changed in ironic:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.