Nodes fail to clean when providing

Bug #1987364 reported by Cristian Le
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
New
Undecided
Unassigned

Bug Description

Some recent commits to master have broken the node cleaning in tripleo:
- `openstack overcloud node clean` does not switch the node to `cleaning`
- when using `node_clean=true` in `undercloud.conf` the node fails to switch from `power off` to `power on` giving the error `Failed to prepare node xxxxx for cleaning: IPMI call failed: power status.`

Introspection step and manually sending the commands work, however there are some clues about the issues:
- power status update is heavily delayed from when the command is sent and the status is updated
- if there are too many frequent requests, the ipmi might not be able to cope. I confirmed this by having an `ipmitool shell` and I get similar issues as the ironic module reports, i.e.:
  - `Unable to get Chassis Power Status`
  - `Error: Unable to establish LAN session`
- when I create a new session, I am able to get power status and everything back
- it could be helpful to not fully depend on the ipmi status to update the state of the node. I.e. if there is an appropriate request for the pxe boot after an ipmi power on command was set, but before it could confirm the power status, than it should skip that check and move on.

Revision history for this message
Cristian Le (lecris) wrote :

I have managed to find a "solution" to this issue:
Edit the `/var/lib/config-data/puppet-generated/ironic/etc/ironic/ironic.conf` so that it contains
```
[ipmi]
command_retry_timeout = 90
min_command_interval = 10
use_ipmitool_retries = true
```

The issue here being that ironic triggers `ipmitool` too frequently, one on top of another. Therefore I am using `use_ipmitool_retries` to avoid that overlap. Then ipmi commands take rather long time on the current network and old hardware, i.e. a call of `ipmitool power status` can take more time than `min_command_interval`, so I increase `command_retry_timeout` to compensate for that. Then I increase `min_command_interval` so that it doesn't flood the ipmi as much and it can report the `power status`

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.