ipmi cmds run too fast, cause BMC to run out of resources

Bug #1291991 reported by Dan Prince
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
aeva black
OpenStack Compute (nova)
Won't Fix
High
Dan Prince

Bug Description

When using Nova baremetal the IPMI power commands are still proving to be too fast. I routinely get stack traces that look like this when deleting baremetal instances:

2 10:39:33.351 5112 TRACE nova.compute.manager [instance: 032250f7-3255-47f5-b866-35687dcd14ce] ProcessExecutionError: Unexpected error while running command.
Mar 12 10:39:33 undercloud-undercloud-7be7u2y6y5cz nova-compute[5112]: 2014-03-12 10:39:33.351 5112 TRACE nova.compute.manager [instance: 032250f7-3255-47f5-b866-35687dcd14ce] Command: ipmitool -I lanplus -H 10.1.8.23 -U ooo-dev -f /tmp/tmpMa8D4u power status
Mar 12 10:39:33 undercloud-undercloud-7be7u2y6y5cz nova-compute[5112]: 2014-03-12 10:39:33.351 5112 TRACE nova.compute.manager [instance: 032250f7-3255-47f5-b866-35687dcd14ce] Exit code: 1
Mar 12 10:39:33 undercloud-undercloud-7be7u2y6y5cz nova-compute[5112]: 2014-03-12 10:39:33.351 5112 TRACE nova.compute.manager [instance: 032250f7-3255-47f5-b866-35687dcd14ce] Stdout: ''
Mar 12 10:39:33 undercloud-undercloud-7be7u2y6y5cz nova-compute[5112]: 2014-03-12 10:39:33.351 5112 TRACE nova.compute.manager [instance: 032250f7-3255-47f5-b866-35687dcd14ce] Stderr: 'Error in open session response message : insufficient resources for session\n\nError: Unable to establish IPMI v2 / RMCP+ session\nUnable to get Chassis Power Status\n'
Mar 12 10:39:33 undercloud-undercloud-7be7u2y6y5cz nova-compute[5112]: 2014-03-12 10:39:33.351 5112 TRACE nova.compute.manager [instance: 032250f7-3255-47f5-b866-35687dcd14ce]
Mar 12 10:39:33 undercloud-undercloud-7be7u2y6y5cz nova-compute[5112]: 2014-03-12 10:39:33.931 5112 ERROR oslo.messaging.rpc.dispatcher [-] Exception during message handling: Unexpected error while running command.

----

The root cause seems to be in the _power_off routine which repeatedly calls "power status" to determine if the instance has properly powered down after issuing the "power off". Once this fails simply resetting the instance state and retrying the delete again usually fixes the issue.

On the CLI the same commands always seem to work as well.

It does seem like our retry code is still too aggressive and we need to wait longer for each IPMI retry.

Dan Prince (dan-prince)
Changed in nova:
assignee: nobody → Dan Prince (dan-prince)
importance: Undecided → High
status: New → In Progress
Changed in ironic:
assignee: nobody → Dan Prince (dan-prince)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/80397

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/80400

aeva black (tenbrae)
summary: - ipmi cmds run to fast
+ ipmi cmds run too fast, cause BMC to run out of resources
Changed in ironic:
importance: Undecided → High
milestone: none → icehouse-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/82668

Changed in ironic:
assignee: Dan Prince (dan-prince) → Devananda van der Veen (devananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/82668
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=a6146ed69159bef7eec917f5efa71bc6dcbc7f7c
Submitter: Jenkins
Branch: master

commit a6146ed69159bef7eec917f5efa71bc6dcbc7f7c
Author: Dan Prince <email address hidden>
Date: Thu Mar 13 16:52:19 2014 -0400

    Run ipmi power status less aggressively

    Ironic runs IPMI 'power status' in a loop until the desired state is
    reached. Running repeated IPMI commands in a tight loop (1 second sleep)
    seems to occasionally cause 'insufficient resources' errors with some
    hardware types.

    In this commit we update the IPMI retry timer to use a
    DynamicLoopingCall which can wait longer for each retry of the power
    status command. We increase the wait time exponentially
    between each call, until the total time would exceed the max retry time
    on the next iteration.

    This patch also removes the spurious initial 'power status' call which
    preceeded the 'power on' or 'power off' call without any gap. Since
    Ironic has a periodic task to sync the power state, it is safe to assume
    that the hardware's current power state is already known and does not
    need to be verified inside the set_power_state method.

    As part of this change we also increase the default IPMI retry_timeout
    config to 60 seconds.

    Co-authored-by: Dan Prince <email address hidden>

    Change-Id: Ie0e29239592bb64c7a10ecedae299df2c5053359
    Closes-bug: #1291991

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-rc1 → 2014.1
Dan Prince (dan-prince)
Changed in nova:
status: In Progress → Won't Fix
Revision history for this message
Dan Sneddon (dsneddon) wrote :

I attempted to apply the proposed patch by Dan Prince into the Icehouse codebase that still uses Nova Baremetal, and it did not solve the issue. I was still running into the issue when using DynamicLoopingCall, but only against certain DRAC firmware.

DRAC 7 1.40.40 - Not affected
DRAC 7 1.45.45 - Not affected
DRAC 7 1.57.57 - Affected, even after applying patch

I have attached a shell script that triggers the bug. This is an attempt to duplicate what happens with DynamicLoopingCall, although my understanding of DynamicLoopingCall may be incorrect (does it double the backoff time, or just increase it?) It is called with:
$ ./power.sh <username> <password> <ipaddr> off

I only ran into this bug when turning the hosts off, they were less sensitive when powering on. When powering off, this shell script (which introduces a 1 second longer delay each time) will trigger the bug about 2 out of 3 runs.

My concern is that we may not have fixed the issue by changing from FixedIntervalLoopingCall to DynamicLoopingCall in Ironic, if the bug is still triggered sometimes. Perhaps we need logic to accept the "insufficient resources" error as transient, and retry rather than error out.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.