Ironic

[RFE] Allow specifying a maximum time allowed per clean step

Bug #1611137 reported by Mario Villaplana on 2016-08-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Triaged	Wishlist	Unassigned

Bug Description

Currently, there's no way to automatically abort a clean step that's taking longer than expected.

This has a number of problems:

* Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135)
* There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have.
* The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours.

One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing.

As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added:

```
            {
                'step': 'erase_devices',
                'priority': 10,
                'interface': 'deploy',
                'reboot_requested': False,
                'abortable': True,
                'timeout': 2400
            }
```

The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached.

If no timeout is specified, this means that no timeout will be enforced.

Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout.

See original description

Tags:

Mario Villaplana (mario-villaplana-j) on 2016-08-08

Changed in ironic:
assignee:	nobody → Mario Villaplana (mario-villaplana-j)

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2016-08-08:

Mario,

If you can be more specific about what you're proposing (such as the exact field, and the exact actions to take on a node when it happens) I think this could be done without a spec. Can you add these specifics to the description?

Thanks,
Jay

Revision history for this message

Mario Villaplana (mario-villaplana-j) wrote on 2016-08-22:

Jay - I updated the description. Thanks.

description:

updated

Dmitry Tantsur (divius) on 2016-08-23

Changed in ironic:
status:	New → Confirmed
importance:	Undecided → Wishlist

Mario Villaplana (mario-villaplana-j) on 2016-08-23

description:

updated

Jay Faulkner (jason-oldos) on 2016-08-23

tags:

added: rfe-approved

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-28: Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/391554

Changed in ironic:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-11-19: Change abandoned on ironic (master)

Change abandoned by "Dmitry Tantsur <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/ironic/+/391554

Dmitry Tantsur (divius) on 2023-11-19

Changed in ironic:
status:	In Progress → Triaged
assignee:	Mario Villaplana (mario-villaplana-j) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.