[2.2, 2.1, 1.9] Unable to power manage IPMI if BMC does not support changing boot order

Bug #1516065 reported by Claude Durocher on 2015-11-13
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
High
Newell Jensen
1.9
High
Newell Jensen
2.1
High
Newell Jensen
Trunk
High
Newell Jensen

Bug Description

I'm testing maas 1.9.0~rc1+bzr4488-0ubuntu1 on trusty. When trying to commision a server with IPMI LAN_2_0, it fails. It works fine in version 1.8. A manual test with ipmipower also works fine.

/var/log/maas/clusterd.log:

2015-11-12 16:56:33-0500 [ClusterClient,client] little-stop: Power could not be turned on.
 Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1155, in gotResult
     _inlineCallbacks(r, g, deferred)
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/power/change.py", line 287, in change_power_state
     system_id, hostname, power_type, power_change, context)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/power/__init__.py", line 276, in perform_power
     power_func, system_id, **kwargs)
   File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
     result = context.call(ctx, function, *args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/power/ipmi.py", line 163, in power_on
     self._issue_ipmi_command('on', **kwargs)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/power/ipmi.py", line 144, in _issue_ipmi_command
     ipmi_chassis_config_command, power_change, power_address, env)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/drivers/power/ipmi.py", line 82, in _issue_ipmi_chassis_config_command
     "Failed to power %s %s: %s" % (change, address, stderr))
 provisioningserver.drivers.power.PowerFatalError: Failed to power on 192.168.100.4:

ipmipower -h 192.168.100.4 -D LAN_2_0 -u maas -p password -s
192.168.100.4: off

ipmitool -H 192.168.100.4 -I lanplus -U maas -P 'password' mc info

Device ID : 17
Device Revision : 1
Firmware Revision : 2.25
IPMI Version : 2.0
Manufacturer ID : 11
Manufacturer Name : Hewlett-Packard
Product ID : 8192 (0x2000)
Product Name : Unknown (0x2000)
Device Available : yes
Provides Device SDRs : yes
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device

Related branches

Changed in maas:
assignee: nobody → Newell Jensen (newell-jensen)
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Claude,

We have tested various machines against IPMI without any issues. Can you please let us know what type of machine you are using?

Thanks

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Also,

MAAS RC release is 1.9.0~rc1+bzr4496-0ubuntu1~wily1 and it seems you are using bzr4488. Any chance you can upgrade?

ppa:maas/next.

Thanks.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I think I know what your issue is. Can you do the following:

vim ipmi.cfg

and copy:

Section Chassis_Boot_Flags
        Boot_Flags_Persistent No
        Boot_Device PXE
EndSection

Then do:

ipmi-chassis-config -D LAN_2_0 -h 192.168.100.4 -u maas -p password --commit --filename ipmi.cfg

And post the result?

Thanks

Revision history for this message
Claude Durocher (claude-durocher) wrote :

reply to #1: HP Proliant Gen5 servers
reply to #2: I used ppa:maas/next-proposed, I will try with ppa:maas/next
reply to #3: I got no output from the command (and no errors)

Changed in maas:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Claude Durocher (claude-durocher) wrote :

Same error with ppa:maas/next :

||/ Name Version Architecture Description
+++-========================================-=========================-=========================-=====================================================================================
ii maas 1.9.0~rc1+bzr4496-0ubuntu all MAAS server all-in-one metapackage
ii maas-cli 1.9.0~rc1+bzr4496-0ubuntu all MAAS command line API tool
ii maas-cluster-controller 1.9.0~rc1+bzr4496-0ubuntu all MAAS server cluster controller
ii maas-common 1.9.0~rc1+bzr4496-0ubuntu all MAAS server common files
ii maas-dhcp 1.9.0~rc1+bzr4496-0ubuntu all MAAS DHCP server
ii maas-dns 1.9.0~rc1+bzr4496-0ubuntu all MAAS DNS server
ii maas-proxy 1.9.0~rc1+bzr4496-0ubuntu all MAAS Caching Proxy
ii maas-region-controller 1.9.0~rc1+bzr4496-0ubuntu all MAAS server complete region controller
ii maas-region-controller-min 1.9.0~rc1+bzr4496-0ubuntu all MAAS Server minimum region controller

Revision history for this message
Claude Durocher (claude-durocher) wrote :

I did a little big more digging: it seems like ipmi-chassis-config always exit with return code 1 with no output.

I also tried to print the chassis confg and here's the result (notice there's no end section to Chassis_Boot_Flags):

#
# Section Chassis_Front_Panel_Buttons Comments
#
# The following configuration options are for enabling or disabling button
# functionality on the chassis. Button may refer to a pushbutton, switch, or
# other front panel control built into the system chassis.
#
# The value of the below may not be able to be checked out. Therefore we
# recommend the user configure all four fields rather than a subset of them,
# otherwise some assumptions on configure may be made.
#
Section Chassis_Front_Panel_Buttons
 ## Possible values: Yes/No
 Enable_Standby_Button_For_Entering_Standby Yes
 ## Possible values: Yes/No
 Enable_Diagnostic_Interrupt_Button Yes
 ## Possible values: Yes/No
 Enable_Reset_Button Yes
 ## Possible values: Yes/No
 Enable_Power_Off_Button_For_Power_Off_Only Yes
EndSection
#
# Section Chassis_Power_Conf Comments
#
# The following configuration options are for configuring chassis power
# behavior.
#
# The "Power_Restore_Policy" determines the behavior of the machine when AC
# power returns after a power loss. The behavior can be set to always power on
# the machine ("On_State_AC_Apply"), power off the machine
# ("Off_State_AC_Apply"), or return the power to the state that existed before
# the power loss ("Restore_State_AC_Apply").
#
# The "Power_Cycle_Interval" determines the time the system will be powered down
# following a power cycle command.
#
Section Chassis_Power_Conf
 ## Possible values: Off_State_AC_Apply/Restore_State_AC_Apply/On_State_AC_Apply
 Power_Restore_Policy Restore_State_AC_Apply
 ## Give value in seconds
 ## Power_Cycle_Interval
EndSection
#
# Section Chassis_Boot_Flags Comments
#
# The following configuration options are for configuring chassis boot behavior.
# Please note that some fields may apply to all future boots while some may only
# apply to the next system boot.
#
# "Boot_Flags_Persistent" determines if flags apply to the next boot only or all
# future boots.
#
# "Boot_Device" allows the user to configure which device the BIOS should boot
# off of. Most users may wish to select NO-OVERRIDE to select the configuration
# currently determined by the BIOS. Note that the configuration value BIOS-SETUP
# refers to booting *into* the BIOS Setup, not from it. FLOPPY may refer to any
# type of removeable media. "Device_Instance_Selector" may be be used to select
# a specific device instance for booting.
#
Section Chassis_Boot_Flags

Revision history for this message
Claude Durocher (claude-durocher) wrote :

Here's a debug log of ipmi-chassis-config: http://pastebin.com/01HyKgfp

Notice the ipmi_cmd_get_system_boot_options_boot_flags: session timeout

Revision history for this message
Claude Durocher (claude-durocher) wrote :

I've disabled the return code check in _issue_ipmi_chassis_config_command, restarted maas-clusterd and now the nodes are commissioning fine:

        #if process.returncode != 0:
        # raise PowerFatalError(
        # "Failed to power %s %s: %s" % (change, address, stderr))

As a side note, our current firmware on the iLO2 is: 2.25 04/14/2014

Revision history for this message
Ryan Beisner (1chb1n) wrote :

After upgrade from 1.8 to 1.9RC2, our ppc64el machine won't power on or commission. We observe:

http://paste.ubuntu.com/13351412/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI: Installed: 1.9.0~rc2+bzr4504-0ubuntu1~trusty1

Also, the amd64 (dell poweredge r610) machines power on/off and commission without issue. AFAICT, this is only affecting our ppc64el machine.

tags: added: uosci
summary: - Unable to commision nodes in maas 1.9 with IPMI
+ [1.9] Unable to power manage IPMI node if BMC does not support changing
+ boot order
summary: - [1.9] Unable to power manage IPMI node if BMC does not support changing
- boot order
+ [1.9] Unable to power manage IPMI if BMC does not support changing boot
+ order
Revision history for this message
Ryan Beisner (1chb1n) wrote : Re: [1.9] Unable to power manage IPMI if BMC does not support changing boot order

This workaround allowed me to commission the ppc64el node:
1.8.2+bzr4041-0ubuntu1~trusty1 + patch from http://paste.ubuntu.com/13351548/

clusterd log shows this:
Nov 19 19:47:22 lescina maas.node: [INFO] gregory-ppc64.dellstack: Commissioning started
Nov 19 19:47:22 lescina maas.drivers.power.ipmi: [WARNING] Failed to change the boot order to PXE on: ipmi_ctx_open_outofband_2_0: BMC busy
Nov 19 19:47:23 lescina maas.drivers.power.ipmi: [WARNING] Failed to change the boot order to PXE on: ERROR: Section post-commit `Chassis_Boot_Flags'
Nov 19 19:47:27 lescina maas.drivers.power.ipmi: message repeated 2 times: [ [WARNING] Failed to change the boot order to PXE on: ERROR: Section post-commit `Chassis_Boot_Flags']
Nov 19 19:47:32 lescina maas.power: [INFO] Changed power state (on) of node: gregory-ppc64.dellstack (node-11c03686-9d7f-11e4-91da-d4bed9a84493)

Christian Reis (kiko) on 2015-11-19
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
milestone: none → 1.9.0
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Andrew (amoss-6) wrote :

Hi Andres,
I get the same issue against a DL360g5 in MAAS 2.0 - is there anyway to turn off boot order setup?

The output of the ipmi-chassis-config - is session timeout

amoss@maastest1-stl:~$ ipmi-chassis-config -D LAN_2_0 -h 10.3.7.53 -u maas -p 48FblJpEy2NB --commit --filename ipmi.cfg
ipmi_cmd_get_system_boot_options_boot_flags: session timeout

Revision history for this message
Keesjan Karsten (keesjank) wrote :
Download full text (3.7 KiB)

The patch in comment #11 doesn't seem to work for me. I'm using MAAS Version 2.0.0+bzr5189-0ubuntu1 (16.04.1) which seems to have the patch applied.

I have a couple of ProLiant BL460c G6, iLO 2 Firmware Version: 2.29 07/16/2015, which won't commission. I get the error in /var/log/maas/rackd.log ("provisioningserver.drivers.power.PowerConnError: The IPMI session has timed out. MAAS performed several retries. Check BMC configuration and connectivity and try again.").

The /var/log/maas/maas.log part:
[...]
Oct 6 08:07:44 maas maas.node: [INFO] blade10: Status transition from FAILED_COMMISSIONING to COMMISSIONING
Oct 6 08:07:44 maas maas.power: [INFO] Changing power state (on) of node: blade10 (4y3h8c)
Oct 6 08:07:44 maas maas.node: [INFO] blade10: Commissioning started
Oct 6 08:08:04 maas maas.drivers.power.ipmi: [WARNING] Failed to change the boot order to PXE 10.10.1.20: ipmi_cmd_get_system_boot_options_boot_f
lags: session timeout
Oct 6 08:08:29 maas maas.drivers.power.ipmi: [WARNING] Failed to change the boot order to PXE 10.10.1.20: ipmi_cmd_get_system_boot_options_boot_f
lags: session timeout
Oct 6 08:08:57 maas maas.drivers.power.ipmi: [WARNING] Failed to change the boot order to PXE 10.10.1.20: ipmi_cmd_get_system_boot_options_boot_f
lags: session timeout
Oct 6 08:09:09 maas maas.power: [ERROR] Error changing power state (on) of node: blade10 (4y3h8c)
Oct 6 08:09:09 maas maas.node: [INFO] blade10: Status transition from COMMISSIONING to FAILED_COMMISSIONING
Oct 6 08:09:09 maas maas.node: [ERROR] blade10: Marking node failed: Node could not be powered on: Could not contact node's BMC: The IPMI session
 has timed out. MAAS performed several retries. Check BMC configuration and connectivity and try again.
[...]

When testing with ipmi-chassis-config, I get the same result as in comment #12 and debug output looking similar as in previous comments.

The G7's I have commission ok though (iLO 3, firmware version 1.85 May 13 2015).

The BMC in the G6 seems to be flakey and doesn't respond well to the request to change the boot order to boot from PXE.

Since I manually set the boot order to PXE boot in all machines via the iLO web administration, I figured I don't really need this ipmi-chassis-config command anyway. I commented it out from /usr/lib/python3/dist-packages/provisioningserver/drivers/power/ipmi.py (see below), restarted rackd ("sudo systemctl restart maas-rackd.service") and the commissioning works fine now for all machines.

ubuntu@maas:~$ sudo vi /usr/lib/python3/dist-packages/provisioningserver/drivers/power/ipmi.py
[...]
        # Update the chassis config and power commands.
        ipmi_chassis_config_command.extend(common_args)
        ipmi_chassis_config_command.append('--commit')
        ipmipower_command.extend(common_args)

        # Before changing state run the chassis config command.
# if power_change in ("on", "off"):
# self._issue_ipmi_chassis_config_command(
# ipmi_chassis_config_command, power_change, power_address)

        # Additional arguments for the power command.
        if power_change == 'on':
            ipmipower_command.append('--cycle')
            ipmipowe...

Read more...

summary: - [1.9] Unable to power manage IPMI if BMC does not support changing boot
- order
+ [2.2, 2.1, 1.9] Unable to power manage IPMI if BMC does not support
+ changing boot order
Revision history for this message
Brian Morton (rokclimb15) wrote :

This appears to be due to an iLO2 firmware bug. On a DL360G6 with iLO2 I see the same behavior trying to get/set the boot flags even though they're supported. Adding --debug to the ipmi-config command you'll notice the data get re-transmitted many times trying to compute the correct checksum and failing. Adding -W nochecksumcheck to the command skips this check and works around the firmware bug. The command works as expected.

Since the ipmi-config manpage suggests this workaround is safe in the majority of cases, you could simply add this up front to the command arguments to avoid the session timeout. Thoughts? I can add a corrected patch if anyone confirms/agrees.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Brian,

I think that using the '-W nochecksumcheck' would be a good work around, however, I'm wondering... in what cases would it not be safe to use?

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Andres,

Since RCMP+ is a UDP based protocol, and UDP has support for checksumming, I can't think of a problematic situation for the layer 7 checksum computation. I verified with tcpdump that UDP checksums are present and OK for the return traffic from iLO2. So unless a network was corrupting data just enough that basic UDP checksums wouldn't catch it and RCMP+ checksums would (perhaps a stronger algorithm) then it's not a problem. Seems like a very edge case to me for a LAN configuration intended for MAAS.

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Andres,

Since this issue has been marked resolved by the recent commit, should I file a new issue for my suggested improvement above? The commit made does resolve this issue, but it just skips the PXE boot configuration after a timeout, whereas my suggested fix actually sets PXE boot and doesn't have to wait for a timeout to occur.

Revision history for this message
Leonardo Borda (lborda) wrote :

which version of 2.1 was it released ?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Leo,

This in in 2.1.2+. See the "Milestone"

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers