Nodes stuck on power state transitions
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
Fix Released
|
High
|
Dmitry Tantsur |
Bug Description
Reported internally at https:/
If the conductor managing a node dies mid power state transition, that node will have the "reservation" and "target_
Workaround(s)
=============
* While not ideal, operators can (re)start a conductor service with the same hostname that was managing that node and it will clean up the locks.
* Changing the database manually
Proposed solution
=================
Just like we do for certain provision states (*WAIT), we should have a periodic task that would check for a timeout on power state.
In order to implement that we would need:
1. A "power_updated_at" field in the nodes (we do have a "provision_
2. A periodic task that will query nodes that are reserved by a conductor which is not currently online and have the target_power_state field set, based on the value of the "power_updated_at" field we will know whether it's timed out or not.
The number of seconds/minutes that we should wait for a timeout should be configurable as a config option.
Warn: a possible problem here, how does one conductor cleans the reservation from another conductor? We may need *something* here.
Outputs showing the error
=======
Changed in ironic: | |
assignee: | nobody → Lucas Alvares Gomes (lucasagomes) |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
Changed in ironic: | |
status: | In Progress → Triaged |
Changed in ironic: | |
assignee: | Lucas Alvares Gomes (lucasagomes) → Dmitry Tantsur (divius) |
Marking as high because the workarounds are not ideal, plus, we should get better on avoiding deadlocks such as this.