Nodes fail to clean when providing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
New
|
Undecided
|
Unassigned |
Bug Description
Some recent commits to master have broken the node cleaning in tripleo:
- `openstack overcloud node clean` does not switch the node to `cleaning`
- when using `node_clean=true` in `undercloud.conf` the node fails to switch from `power off` to `power on` giving the error `Failed to prepare node xxxxx for cleaning: IPMI call failed: power status.`
Introspection step and manually sending the commands work, however there are some clues about the issues:
- power status update is heavily delayed from when the command is sent and the status is updated
- if there are too many frequent requests, the ipmi might not be able to cope. I confirmed this by having an `ipmitool shell` and I get similar issues as the ironic module reports, i.e.:
- `Unable to get Chassis Power Status`
- `Error: Unable to establish LAN session`
- when I create a new session, I am able to get power status and everything back
- it could be helpful to not fully depend on the ipmi status to update the state of the node. I.e. if there is an appropriate request for the pxe boot after an ipmi power on command was set, but before it could confirm the power status, than it should skip that check and move on.
I have managed to find a "solution" to this issue: config- data/puppet- generated/ ironic/ etc/ironic/ ironic. conf` so that it contains retry_timeout = 90 interval = 10 retries = true
Edit the `/var/lib/
```
[ipmi]
command_
min_command_
use_ipmitool_
```
The issue here being that ironic triggers `ipmitool` too frequently, one on top of another. Therefore I am using `use_ipmitool_ retries` to avoid that overlap. Then ipmi commands take rather long time on the current network and old hardware, i.e. a call of `ipmitool power status` can take more time than `min_command_ interval` , so I increase `command_ retry_timeout` to compensate for that. Then I increase `min_command_ interval` so that it doesn't flood the ipmi as much and it can report the `power status`