Cleaning can restart in infinite loop in some hardware failure cases

Bug #1526561 reported by Jay Faulkner
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Triaged
Medium
Unassigned

Bug Description

In an extreme edge case, when using hardware manager dynamic loading with evaluate_hardware_support(), some machines can get stuck in an infinite cleaning loop.

Reproduction instructions:

1) Have a piece of hardware with an intermittently failing piece of hardware, like a disk that sometimes shows up in the OS and sometimes doesn't.
2) Implement a custom hardware manager for which evaluate_hardware_support() returns 0 if the piece of hardware from step 1 doesn't exist, and returns a positive int otherwise.
3) As the hardware "flaps" in and out of the OS on reboot, IPA will load a different set of hardware managers each time the piece of hardware appears/disappears on reboot. This will trigger a cleaning restart due to version change.

With my testing, having custom hardware managers with about 8 steps and 3 reboots, I saw machines restart cleaning several times in a thirty minute period.

I've thought of a few potential solutions:

1) Update documentation for hardware managers to stop encouraging dynamically loading them based on present hardware.
- Pros: Reliable behavior for any booted agent, regardless of hardware.
- Cons: Depending on complexity of cleaning steps, may require different agents for different hardware

2) Have Ironic keep track of how many times cleaning has restarted, and CLEANFAIL if cleaning restarted $clean_restart_max number of times.
- Pros: Would prevent similar bugs in this same vein. Allows deployers to decide how many cleaning restarts are reasonable.
- Cons: It's perfectly reasonable for someone to want to deploy multiple agents while a node is in a cleaning cycle. This would invalidate that use case.

I'm not sure what the right path is, but this is a recipe for badness -- Ironic should be able to deal reasonably with hardware issues and in this case it does not.

Tags: agent
Revision history for this message
Dmitry Tantsur (divius) wrote :

I would do the latter with $clean_restart_max being pretty big (maybe even calculated based on a number of steps).

Changed in ironic:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

I think it would make sense to add a check in which moves something to cleanfail if it fails more than the set configured threshold. Realistically, this would be a good medium-to-low hanging fruit item for someone learning cleaning.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.