default IPMI retry timeout is too long

Bug #1383432 reported by aeva black
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Low
John Trowbridge

Bug Description

If a node is enrolled with an IPMI IP address that is unresponsive, this causes the sync power state periodic task to hang for about 10 minutes by default. While it is configurable, the default is much too high.

Here are logs with the default [ipmi] retry_interval of 60 seconds. Note the timestamps in this log - 12 minutes for the failure to be logged, with no periodic task activity during that time, and the node remains locked for the duration.

2014-10-20 11:02:14.544 DEBUG ironic.conductor.task_manager [-] Attempting to reserve node 4 from (pid=13345) reserve_node /opt/stack/ironic/ironic/conductor/task_manager.py:189
2014-10-20 11:14:51.653 WARNING ironic.drivers.modules.ipmitool [-] IPMI power status failed for node a8cb6624-0d9f-4882-affc-046ebb96ec92 with error: Failed to create the password file. Unexpected error while running command.
Command: ipmitool -I lanplus -H 1.2.3.4 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmpPuuGU4 power status
Exit code: 1
Stdout: ''
Stderr: 'Error: Unable to establish IPMI v2 / RMCP+ session\nError: Unable to establish IPMI v2 / RMCP+ session\nError: Unable to establish IPMI v2 / RMCP+ session\nUnable to get Chassis Power Status\n'.
2014-10-20 11:14:51.654 WARNING ironic.conductor.manager [-] During sync_power_state, could not get power state for node a8cb6624-0d9f-4882-affc-046ebb96ec92. Error: IPMI call failed: power status..
2014-10-20 11:15:51.694 DEBUG ironic.conductor.task_manager [-] Attempting to reserve node 1 from (pid=13345) reserve_node /opt/stack/ironic/ironic/conductor/task_manager.py:189

Here are logs with an [ipmi] retry_interval of 5 seconds. Again, note the timestamps - 30 seconds for the failure to be logged. I suggest changing the default value to 5 seconds.

2014-10-20 11:23:56.798 DEBUG ironic.conductor.task_manager [-] Attempting to reserve node 4 from (pid=23862) reserve_node /opt/stack/ironic/ironic/conductor/task_manager.py:189
2014-10-20 11:24:26.954 WARNING ironic.drivers.modules.ipmitool [-] IPMI power status failed for node a8cb6624-0d9f-4882-affc-046ebb96ec92 with error: Failed to create the password file. Unexpected error while running command.
Command: ipmitool -I lanplus -H 1.2.3.4 -L ADMINISTRATOR -U admin -R 1 -N 5 -f /tmp/tmp2tg79p power status
Exit code: 1
Stdout: ''
Stderr: 'Error: Unable to establish IPMI v2 / RMCP+ session\nError: Unable to establish IPMI v2 / RMCP+ session\nError: Unable to establish IPMI v2 / RMCP+ session\nUnable to get Chassis Power Status\n'.
2014-10-20 11:24:26.958 WARNING ironic.conductor.manager [-] During sync_power_state, could not get power state for node a8cb6624-0d9f-4882-affc-046ebb96ec92. Error: IPMI call failed: power status..
2014-10-20 11:25:26.983 DEBUG ironic.conductor.task_manager [-] Attempting to reserve node 1 from (pid=23862) reserve_node /opt/stack/ironic/ironic/conductor/task_manager.py:189

For reference, version information:

$ ipmitool -V
ipmitool version 1.8.13

$ dpkg-query --list 'ipmi*' | grep ipmitool | awk '{print $3}'
1.8.13-1ubuntu0.

$ lsb_release -a | grep Desc
Description: Ubuntu 14.04.1 LTS

Ironic version is commit SHA 4589ba37077687bff3dee2ee4e0a4f340282dc48

Tags: ipmi
aeva black (tenbrae)
Changed in ironic:
status: New → Confirmed
tags: added: ipmi
Dmitry Tantsur (divius)
Changed in ironic:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/131296

Changed in ironic:
assignee: nobody → John Trowbridge (trown)
status: Confirmed → In Progress
Revision history for this message
Ruby Loo (rloo) wrote :

It seems like the trick is deciding what to use for a default value. Devananda/Dan Prince changed it from 10 to 60 [1], but since then, the use of that config has changed too. Whatever we change it to, we should make sure it works for ipminative and ipmitool.

[1] https://review.openstack.org/#/c/82668/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/131296
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=0b10b18bf1ff01ef8dfdd1e65ea0e9ac60f87315
Submitter: Jenkins
Branch: master

commit 0b10b18bf1ff01ef8dfdd1e65ea0e9ac60f87315
Author: John Trowbridge <email address hidden>
Date: Wed Feb 18 11:10:35 2015 -0500

    Add documentation for the IPMI retry timeout option

    It is difficult to have a universal default for the retry_timeout
    option in the [impi] configuration section. This patch adds
    documentation of the tradeoffs involved, while leaving the
    conservative default in place.

    Closes-Bug: 1383432
    Change-Id: Ic7973cf7fee60cc817e5c0f5f9bfc84b3bca91c7

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → kilo-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.