[RFE] Tooling for recovering from -WAIT and -ING state

Bug #1580931 reported by Tan Lin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Wishlist
Unassigned

Bug Description

Ironic nodes can be stuck in deploying/deploywait/cleaning/cleanwait/inspecting/deleting which then require an operator to touch the DB to recover these nodes.
Although we already have periodic tasks _check_cleanwait_timeouts()/_check_inspect_timeouts()/_check_deploy_timeouts()/_check_deploying_states to get ride of the situation.
But it's not good enough. So there is a need to create a command line tool to handle this.
And users are able to recover these notes directly.

Another possible solution is that we can extend abort action to above states and make nodes into corresponding fail states

Tags: needs-spec rfe
Tan Lin (tan-lin-good)
Changed in ironic:
assignee: nobody → Tan Lin (tan-lin-good)
Revision history for this message
Dmitry Tantsur (divius) wrote :

Could you please elaborate on "not good enough"? All -WAIT states are supposed to be abort-able. Doesn't it work for you?

summary: - failing nodes
+ [RFE] Tooling for recovering from -WAIT and -ING state
Changed in ironic:
status: New → Incomplete
importance: Undecided → Wishlist
Revision history for this message
Tan Lin (tan-lin-good) wrote :

OK, first of all, there are some users who didn't want to wait the timeout (1800s by default), but change the states as soon as they found the node is stuck. So we need a way to support this.

For wait state:
We have two wait states deploy-wait and clean-wait.
clean-wait support ``abort`` action and will move the node to clean-failed. That's really awesome.
But deploy-wait only support ``deleted`` action and will move the node to deleting state.
It's better to move it to deploy-failed which can run ``rebuild`` and ``active`` action.

For ING state:
Ironic cannot ``abort`` them at all.

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

Operators have indicated they need a tool to force e.g. deploying or cleaning out, if something gets stuck there. Currently they need to touch the DB.

https://etherpad.openstack.org/p/ironic-newton-summit-ops

We decided that we don't really want an API for this (as something we'll need to maintain forever), but rather some sort of manage command.

I proposed https://review.openstack.org/#/c/311273/2 as a start, that we can enhance more as we go (deploying is the prevalent stuck state, from what I'm hearing).

Tan had some issues with the approach as far as security - this script would need an ironic config file with proper DB credentials to work, so I don't think there's any problems there.

Lucas suggested handling more than deploying, and moving to manageable instead. From what I've gathered, operators are currently hitting the DB to fail the deploy out, at which point Nova will indicate the failure, and the user (or Nova, or a script) may delete the instance (or retry the deployment). I chose deploying to start, we (or operators) can enhance it from there.

As to the comments already here, I think the current deploywait/abort case is fine. If the deployment fails, the nova user can just deploy another instance.

Revision history for this message
Dmitry Tantsur (divius) wrote :

Hmm, I kind of agree that it would be good to "abort" a deploy-wait and leave it in a failed state for operator's inspection (or any other reasons to avoid getting rescheduled immediately).

I'm not sure how we can safely abort -ING states, as during them Ironic is doing something with a lock held.

Changed in ironic:
status: Incomplete → Confirmed
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

I'm +1 as well for abort on deploy-wait, but it's not critical since "deleted" (as bad as it's sounds) get you out of there, plus, -wait states do have timeouts. So I kinda feel that this bug should only be for -ING states.

About the suggestion of moving stuck nodes to manageable, I understand the idea of starting small and fixing DEPLOYING first (baby steps), but I don't see much difference between a node stuck in DEPLOYING or CLEANING or INSPECTING, potentially, a node could get stuck in all these states. Since the tool was literally changing the database and setting provision_state to "deploy error", I've suggested moving it to "manageable" because that's generic, hidden to the nova scheduler and is basically the start point for the state machine, that will keep the same behavior for CLEANING, DEPLOYING and INSPECTING.

Revision history for this message
Dmitry Tantsur (divius) wrote :

I think an operator's forced action is a good reason to set maintenance mode

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

@Lucas, I think it's safer to follow the state machine flow here. If a thing gets stuck in "deploying", we should move it to "deploy failed" (well, because the deploy failed). Putting it in "manageable" is almost certainly going to confuse some tooling, maybe even Nova. If an instance is waiting to be deployed, and nova sees deployfail, it will either fail or reschedule the build. What would it do if it sees manageable? Do we want to be sure that in the future we code around someone possibly skipping across the state machine?

We could still make it one generic tool, and just a quick if-else block to choose the right "failed state" for the current state.

Changed in ironic:
assignee: Tan Lin (tan-lin-good) → Jim Rollenhagen (jim-rollenhagen)
status: Confirmed → In Progress
Changed in ironic:
assignee: Jim Rollenhagen (jim-rollenhagen) → Tan Lin (tan-lin-good)
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Another note; if you're in the *ING states, you're often also locked. This tool should probably break the lock as well (or maybe even have 'break a lock' as a specific option?).

Revision history for this message
Tan Lin (tan-lin-good) wrote :

Agree with JayF, I update a PoC patch on https://review.openstack.org/#/c/311273/ and let me write a spec so we can continue to discuss.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-specs (master)

Fix proposed to branch: master
Review: https://review.openstack.org/319812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Jim Rollenhagen (<email address hidden>) on branch: master
Review: https://review.openstack.org/311273

Revision history for this message
Ruby Loo (rloo) wrote :

Tan Lin is no longer working on this; feel free to pick it up.

tags: added: needs-spec
Changed in ironic:
assignee: Tan Lin (tan-lin-good) → nobody
Changed in ironic:
status: In Progress → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic-specs (master)

Change abandoned by Julia Kreger (<email address hidden>) on branch: master
Review: https://review.opendev.org/319812
Reason: No revision in over three years, abandoning.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

We believe this is largely covered via the abort functionality. As such, marking fix released. In other cases where it is not possible, the project does not intend to provide such an interface due to lack of infrastructure operator demand.

Changed in ironic:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.