Ironic

RFE: Readiness check before performing actions

Bug #2027688 reported by Jay Faulkner on 2023-07-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Triaged	Wishlist	Jay Faulkner

Bug Description

Many actions, typically deployments, rebuilds, or cleaning, may fail for reasons that could have been predicted beforehand.

These issues may include:
- unresponsive BMC
- difficulties reaching network switch
- external services required for deployments to complete; such as glance (or an https host for images) and operator-configured situations that require external coordination (such as, for instance, a custom cleaning or deploy step which checks in to an external CMDB)

This improves behavior for many use cases:
- nova rebuild would be non-destructive in predicive failure cases; allowing an operator to keep the workload running (even if the nova instance is error) while troubleshooting the failures
- rescheduling deployments would happen more quickly, especially in fast-track cases, because you could fail before taking time to do things like writing an image, lowering latency on reschedules
- Operators running manual cleaning through Ironic could get a valuable error, interactively, to troubleshoot without having to wait for cleaning to begin, and fail, in vain.

It's my belief this work will require a spec, because we'll likely need to:
- Add a more extensive validation method onto our driver interfaces for at least Power, Management, and Network interfaces (probably more)
- Decide if it should be opt-out or opt-in
- Figure out how far it's OK for upstream implementations of these methods to go (do we login to the switch?)

Tags:

Jay Faulkner (jason-oldos) on 2023-07-13

Changed in ironic:
assignee:	faulkner (faulkner) → Jay Faulkner (jason-oldos)

Revision history for this message

Julia Kreger (juliaashleykreger) wrote on 2023-07-17:

So validate has always been the ironic internal attempt at this, except validate drifted much more towards "do we have the configuration information", since power sync should be running and flag loss of BMC communication.

The challenge is we're talking things outside of our control and even our sphere of influence on some level.

And I guess the "unresponsive bmc" is more of an edge case where the bmc still kind of sort of works, but when we request power actions, it doesn't *actually* change state. That is one of those "we need to be in the weeds to see the weeds before us sort of things.

Glance wise, we do actually check that upfront to extract metadata. Issue is we can generate a tempurl, and swift could be down.

And then networking is a whole giant ball of wax I'd prefer a discussion around.

I think there is some value to the concepts, but again, how far do we did and do we already do the right thing or not. Or is it a "pre deployment please pre-populate and pre-check" sort of action verb?

I think our biggest issue, networking wise, is the lack of working callback handling and or support for going back and checking port binding. That would help things a lot at least from a providing the user "what failed" as opposed to "it failed".

Dmitry Tantsur (divius) on 2024-02-28

Changed in ironic:
status:	New → Triaged
tags:	added: needs-spec

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.