validate hangs if ipmitool is unable to reach BMC

Bug #1314954 reported by aeva black
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
aeva black

Bug Description

The ipmitool driver has a configurable "retry_timeout" today -- this determines how long the set_power_state() method will wait for confirmation of a power state change.

However, if ipmitool is simply unable to reach the BMC -- for example, the ipmi_address is incorrect or unroutable -- then all commands, including even calling driver.power.validate(), will hang indefinitely, leaving danging ipmitool processes and stuck python greenthreads.

Furthermore, this results in an RPC timeout in the API service, and does not provide the user with any valuable feedback, eg. that the IPMI settings were incorrect.

aeva black (tenbrae)
Changed in ironic:
status: New → Triaged
importance: Undecided → High
milestone: none → juno-1
Revision history for this message
aeva black (tenbrae) wrote :

Changing to 'critical' since there is no way for the user to understand the cause of these failures. Even reading the conductor log files does not give any helpful information about this today.

Changed in ironic:
importance: High → Critical
importance: Critical → High
Revision history for this message
aeva black (tenbrae) wrote :

Correcting myself -- that's actually just the definition of "High".

Changed in ironic:
assignee: nobody → Ramakrishnan G (rameshg87)
Revision history for this message
Dmitry Tantsur (divius) wrote :

Hi Ramakrishnan G,

this bug was assigned to you for more than a month without any progress, status is "Triaged". Could you give some status update on it and change status or assignee accordingly?

Revision history for this message
Ramakrishnan G (rameshg87) (rameshg87) wrote :

Unassigning myself for now because I am not working on it.

In my initial triage, I can't see any way of making validate to recognise quickly whether the node is reachable or not. There is no option in ipmitool to quickly check if address is reachable, niether can we find it in another way. It might come down to the network setup for the timeout for the packets sent for ipmitool connection request.

Alternately, this might get fixed if the below bug is fixed:
https://bugs.launchpad.net/ironic/+bug/1314961

Changed in ironic:
assignee: Ramakrishnan G (rameshg87) → nobody
Changed in ironic:
assignee: nobody → Mikhail Durnosvistov (mdurnosvistov)
Revision history for this message
aeva black (tenbrae) wrote :

Hi Mikhail,

This bug was targeted to the J1 milestone when you assigned this bug to yourself, though there hasn't been any visible progress or discussion about how to resolve it. With the milestone approaching, I'm going to reassign this bug.

Changed in ironic:
assignee: Mikhail Durnosvistov (mdurnosvistov) → Devananda van der Veen (devananda)
status: Triaged → In Progress
aeva black (tenbrae)
Changed in ironic:
milestone: juno-1 → juno-2
Revision history for this message
Mike Durnosvistov (glacierrdev) wrote :

Hi! :) I have no progress on this bug, but I think that can add the `timeout` variable(flag) in method `execute` [1] and if it is transmitted kill process of after timeout. [1] https://github.com/openstack/oslo-incubator/blob/master/openstack/common/processutils.py#L84

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/99121
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=6318ee1dd1d758c799a3cf09d0736de5f07bdd72
Submitter: Jenkins
Branch: master

commit 6318ee1dd1d758c799a3cf09d0736de5f07bdd72
Author: Devananda van der Veen <email address hidden>
Date: Tue Jun 10 07:46:44 2014 -0700

    Stop ipmitool.validate from touching the BMC

    Stop the IPMITool driver from calling 'mc guid' in validate().

    Validate is currently called synchronously when API requests are sent to
      GET /v1/node/NNN/validate

    While work is ongoing to make the API more asynchronous, this presents
    a particular issue in that a user can spam this URL and overwhelm the
    hardware node's BMC.

    Furthermore, validate() is called internally in several places, which is
    further contributing to BMC instability as reported in the related
    bug 1320513.

    Change-Id: I2414d2b07e2ab86c85ca18bc033368ddf43f7f43
    Closes-bug: #1314954
    Related-bug: #1320513

Changed in ironic:
status: In Progress → Fix Committed
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.