ipmi power probing is too fast

Bug #1234479 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
aeva black
OpenStack Compute (nova)
Fix Released
High
Joe Gordon

Bug Description

baremetal nodes are failing to boot for me:

2013-10-03 01:55:15,392.392 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmpMxP1Xr power on execute /opt/stack/venvs/nova/local/lib/pyt
hon2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:15,573.573 7354 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power Control: Up/On
', stderr: '' _exec_ipmitool /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/ipmi.py:136
2013-10-03 01:55:15,707.707 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmpRYI8dn power status execute /opt/stack/venvs/nova/local/lib
/python2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:15,890.890 7354 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power is off
', stderr: '' _exec_ipmitool /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/ipmi.py:136

Note the gap between up and status checking is only 200ms (from IPMI return to status check) but in that interval this hardware doesn't toggle to on.

After the deploy fails, a manual check with
ipmitool -I lanplus -H 10.10.16.40 -U Administrator -P <password> chassis power status
returns
Chassis Power is on

Tags: baremetal
Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (5.9 KiB)

Full trace:
2013-10-03 01:55:15,114.114 7354 DEBUG nova.openstack.common.rpc.amqp [-] UNIQUE_ID is 165698139cd7474598b27a01932974c4. _add_unique_id /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/openstack/common/rpc/amqp.py:
341
2013-10-03 01:55:15,207.207 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmppgqG40 power status execute /opt/stack/venvs/nova/local/lib
/python2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:15,391.391 7354 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power is off
', stderr: '' _exec_ipmitool /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/ipmi.py:136
2013-10-03 01:55:15,392.392 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmpMxP1Xr power on execute /opt/stack/venvs/nova/local/lib/pyt
hon2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:15,573.573 7354 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power Control: Up/On
', stderr: '' _exec_ipmitool /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/ipmi.py:136
2013-10-03 01:55:15,707.707 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmpRYI8dn power status execute /opt/stack/venvs/nova/local/lib
/python2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:15,890.890 7354 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power is off
', stderr: '' _exec_ipmitool /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/ipmi.py:136
2013-10-03 01:55:15,891.891 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmp6FRuex power on execute /opt/stack/venvs/nova/local/lib/pyt
hon2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:16,069.069 7354 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power Control: Up/On
', stderr: '' _exec_ipmitool /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/ipmi.py:136
2013-10-03 01:55:16,207.207 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmp5k3q5r power status execute /opt/stack/venvs/nova/local/lib
/python2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:16,432.432 7354 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power is off
', stderr: '' _exec_ipmitool /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/ipmi.py:136
2013-10-03 01:55:16,433.433 7354 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.10.16.27 -U Administrator -f /tmp/tmpFCSyJk power on execute /opt/stack/venvs/nova/local/lib/pyt
hon2.7/site-packages/nova/openstack/common/processutils.py:147
2013-10-03 01:55:16,613.613 7354 DEBUG nova.virt.baremetal.ipmi [-] i...

Read more...

Revision history for this message
aeva black (tenbrae) wrote :

Here is the (abbreviated) code in nova/virt/baremetal/ipmi.py:

141 def _power_on(self):
144 def _wait_for_power_on():
147 if self.is_power_on():
148 self.state = baremetal_states.ACTIVE
149 raise loopingcall.LoopingCallDone()
...
155 try:
156 self.retries += 1
157 self._exec_ipmitool("power on")
...
162 timer = loopingcall.FixedIntervalLoopingCall(_wait_for_power_on)
163 timer.start(interval=0.5).wait()

I'm wondering whether FixedIntervalLoopingCall is able to do sub-second timing. Perhaps changing the interval to 1 would alleviate this issue.

summary: - ipmi power probing is to fast
+ ipmi power probing is too fast
Revision history for this message
Joe Gordon (jogo) wrote :

So the 0.2 second difference is another issue.

the _wait_for_power_on call, which is called in a loop. Checks if the power is on each time. And if not re-issues a power on command (https://github.com/openstack/nova/blob/master/nova/virt/baremetal/ipmi.py#L157). So you need to measure the time between power on calls not between power status and power on. It turns out that the difference is a bout 0.5 seconds if not longer.

 I don't know enough about IPMI to say what happens when we keep issuing 'ipmi power on every 0.5 seconds' but that *may* not be good?

Also I don't know what the average time for IPMI to respond is, but I think slowing down the polling interval to 1 second can't hurt. Also we may want to bump up the default value for CONF.baremetal.ipmi_power_retry

Revision history for this message
Joe Gordon (jogo) wrote :

After some testing it looks like ipmi doesn't like being bombarded with 'power on/off' every 0.5 seconds.

It looks like power on takes <0.5 seconds and power off takes around 4 seconds so I propose a multi prong fix:

* move power on/off to befor the loop
* bump interval to 1 second and loop for 10 seconds, as why not play things a little safer (for the 4 second power off especially).

Changed in ironic:
assignee: nobody → Joe Gordon (jogo)
assignee: Joe Gordon (jogo) → nobody
Changed in nova:
assignee: nobody → Joe Gordon (jogo)
Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 1234479] Re: ipmi power probing is too fast

I think power on and off can both be slow - my log shows 5+seconds for
power on; agree on the changes.

On 4 October 2013 13:36, Joe Gordon <email address hidden> wrote:
> After some testing it looks like ipmi doesn't like being bombarded with
> 'power on/off' every 0.5 seconds.
>
> It looks like power on takes <0.5 seconds and power off takes around 4
> seconds so I propose a multi prong fix:
>
> * move power on/off to befor the loop
> * bump interval to 1 second and loop for 10 seconds, as why not play things a little safer (for the 4 second power off especially).
>
>
> ** Changed in: nova
> Assignee: (unassigned) => Joe Gordon (jogo)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1234479
>
> Title:
> ipmi power probing is too fast
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ironic/+bug/1234479/+subscriptions

--
Robert Collins <email address hidden>
Distinguished Technologist
HP Converged Cloud

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/49658

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/49658
Committed: http://github.com/openstack/nova/commit/6e30bbc126a1ad4eb87e735c812832f4fcc0054e
Submitter: Jenkins
Branch: master

commit 6e30bbc126a1ad4eb87e735c812832f4fcc0054e
Author: Joe Gordon <email address hidden>
Date: Thu Oct 3 17:48:13 2013 -0700

    Baremetal: Be more patient with IPMI and BMC

    Before we called 'power status; power on' in a loop which made the
    IPMI/BMCs not behave well. Also the total time we would wait (2.5
    seconds wasn't always enough). So make sure power on/off is only called
    once and wait up to 10 seconds for the power state change to go into
    effect.

    This patch has been tested on real baremetal using
    https://wiki.openstack.org/wiki/TripleO/TripleOCloud

    This bug is also linked to ironic so the change will be made there as
    well.

    Change-Id: I5a4d7c84ebdf9c1f7d8d0570dbc31764c31f1fc6
    Closes-Bug: #1234479

Changed in nova:
status: In Progress → Fix Committed
tags: added: havana-rc-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-rc2
tags: removed: havana-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/50935

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (milestone-proposed)

Reviewed: https://review.openstack.org/50935
Committed: http://github.com/openstack/nova/commit/c965c2b590a877994604d9163cd52446e0bd4c8a
Submitter: Jenkins
Branch: milestone-proposed

commit c965c2b590a877994604d9163cd52446e0bd4c8a
Author: Joe Gordon <email address hidden>
Date: Thu Oct 3 17:48:13 2013 -0700

    Baremetal: Be more patient with IPMI and BMC

    Before we called 'power status; power on' in a loop which made the
    IPMI/BMCs not behave well. Also the total time we would wait (2.5
    seconds wasn't always enough). So make sure power on/off is only called
    once and wait up to 10 seconds for the power state change to go into
    effect.

    This patch has been tested on real baremetal using
    https://wiki.openstack.org/wiki/TripleO/TripleOCloud

    This bug is also linked to ironic so the change will be made there as
    well.

    Change-Id: I5a4d7c84ebdf9c1f7d8d0570dbc31764c31f1fc6
    Closes-Bug: #1234479
    (cherry picked from commit 6e30bbc126a1ad4eb87e735c812832f4fcc0054e)

Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-rc2 → 2013.2
Revision history for this message
aeva black (tenbrae) wrote :

This bug also affects ironic/drivers/modules/ipmitool.py. The port should be fairly straight forward as the code is only slightly modified from Nova's baremetal ipmi driver.

Changed in ironic:
status: New → Triaged
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/52492

Changed in ironic:
assignee: nobody → Devananda van der Veen (devananda)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/52492
Committed: http://github.com/openstack/ironic/commit/d5c72d122504a8d6684c9d464c1b7e3d3b4a192d
Submitter: Jenkins
Branch: master

commit d5c72d122504a8d6684c9d464c1b7e3d3b4a192d
Author: Devananda van der Veen <email address hidden>
Date: Thu Oct 17 12:45:55 2013 -0700

    Be more patient with IPMI and BMC

    Before the ipmitool driver called 'power status; power on' in a loop
    which made the IPMI/BMCs not behave well. Also the total time we would
    wait (5 seconds) wasn't always enough. So make sure power on/off is
    only called once and wait up to 10 seconds for the power state change to
    go into effect.

    This is a port of change-id I5a4d7c84ebdf9c1f7d8d0570dbc31764c31f1fc6

    Change-Id: I50982b24e5443b34d5597581ea2947a0301d35c9
    Closes-Bug: #1234479

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → icehouse-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.