Comment 0 for bug 1611137

Revision history for this message
Mario Villaplana (mario-villaplana-j) wrote :

Currently, there's no way to automatically abort a clean step that's taking longer than expected.

This has a number of problems:

* Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135)
* There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have.
* The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours.

One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing.