[RFE] Allow specifying a maximum time allowed per clean step

Bug #1611137 reported by Mario Villaplana
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Triaged
Wishlist
Unassigned

Bug Description

Currently, there's no way to automatically abort a clean step that's taking longer than expected.

This has a number of problems:

* Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135)
* There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have.
* The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours.

One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing.

As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added:

```
            {
                'step': 'erase_devices',
                'priority': 10,
                'interface': 'deploy',
                'reboot_requested': False,
                'abortable': True,
                'timeout': 2400
            }
```

The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached.

If no timeout is specified, this means that no timeout will be enforced.

Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout.

Tags: rfe-approved
Changed in ironic:
assignee: nobody → Mario Villaplana (mario-villaplana-j)
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Mario,

If you can be more specific about what you're proposing (such as the exact field, and the exact actions to take on a node when it happens) I think this could be done without a spec. Can you add these specifics to the description?

Thanks,
Jay

Revision history for this message
Mario Villaplana (mario-villaplana-j) wrote :

Jay - I updated the description. Thanks.

description: updated
Dmitry Tantsur (divius)
Changed in ironic:
status: New → Confirmed
importance: Undecided → Wishlist
description: updated
tags: added: rfe-approved
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/391554

Changed in ironic:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by "Dmitry Tantsur <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/ironic/+/391554

Dmitry Tantsur (divius)
Changed in ironic:
status: In Progress → Triaged
assignee: Mario Villaplana (mario-villaplana-j) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.