IPMI power template performs very minimal error checking which can lead to silent failures

Bug #1454810 reported by Jason Hobbs
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Raphaël Badin

Bug Description

When powering on, the IPMI power template performs two steps:

1) Sets the boot device to PXE
2) Issues the power on/power cycle command

step 1 has only very minimal error checking - the script only treats it as a failure if "password invalid" is in the response, but there are many other possible error messages. For all other errors, the template continues on to step 2.

This can cause a system to boot straight to disk instead of booting from PXE, which can lead to failed commissioning and deployments.

I've seen behavior that matches this problem on a few nodes in OIL - failed deployments due to booting from disk instead of PXE.

The same problem applies to step 2 - the only error caught is "invalid password".

It should be easy enough to fix this - just check $? after issuing the command to PXE boot and if it's not 0 then fail.

Tags: oil

Related branches

summary: - IPMI power template can silently fail to enable PXE boot
+ IPMI power template can silently fail
description: updated
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.8.0
description: updated
summary: - IPMI power template can silently fail
+ IPMI power template performs very minimal error checking which can lead
+ to silent failures
Changed in maas:
assignee: nobody → ubuntudotcom1 (ubuntudotcom1)
Revision history for this message
Andres Rodriguez (andreserl) wrote :

So the complexity I see with this is that even if the script fails to set PXE for booting, that doesn't necessarily mean that the machine was not set to PXE. Maybe the BMC does not allow the user to set the boot to PXE, while it will always do that by default.

If we error out when it fails to tell the machine to PXE, but the machine PXE boots by default, then we would be marking the machine as failed when it shouldn't be.

That being said, the description above is not entirely accurate. MAAS would not *only* tell the machine to PXE when using IPMI, but it would also tell the following:

Section Chassis_Power_Conf
        Power_Restore_Policy Off_State_AC_Apply
EndSection
Section Chassis_Boot_Flags
        Boot_Flags_Persistent No
        Boot_Device PXE

What the real issue was here, is that the command would fail to commit the above config due to Chassis_Power_conf, which would cause the Chassis_Boot_Flags to not be committed. This, in turn, cause not to commit the Boot_Device settings, which would cause machines to boot from disk (if they weren't manually set to PXE).

However, in the latest iteration we have remove the Chassis_Power_Conf section to not cause this to fail:

https://code.launchpad.net/~andreserl/maas/disable_chassis_power_conf

@Jason,

Can you please confirm that you are still experiencing this issue? Thanks.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Marking incomplete until Jason confirms.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I don't know for sure if I've ever seen it - I've only seen behavior that could possibly match it.

Have we actually seen BMCs that PXE boot but return an error when we ask them to PXE boot? If not, it seems like a poor excuse for not doing error checking.

Also, the same issue applies with the power on command itself - it's not error checked except for "incorrect password".

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1454810] Re: IPMI power template performs very minimal error checking which can lead to silent failures

I think the right approach here would be to add error checking, then if we
hit BMCs that cause ipmitool to raise an error when there really isn't one,
look at why and see if we can add changes to deal with them in particular.
I can say none of the IPMI BMCs I've tested raise errors when configuring
them to PXE boot.

On Fri, May 15, 2015 at 8:25 AM, Jason Hobbs <email address hidden>
wrote:

> I don't know for sure if I've ever seen it - I've only seen behavior
> that could possibly match it.
>
> Have we actually seen BMCs that PXE boot but return an error when we ask
> them to PXE boot? If not, it seems like a poor excuse for not doing
> error checking.
>
> Also, the same issue applies with the power on command itself - it's not
> error checked except for "incorrect password".
>
> ** Changed in: maas
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1454810
>
> Title:
> IPMI power template performs very minimal error checking which can
> lead to silent failures
>
> Status in MAAS:
> New
>
> Bug description:
> When powering on, the IPMI power template performs two steps:
>
> 1) Sets the boot device to PXE
> 2) Issues the power on/power cycle command
>
> step 1 has only very minimal error checking - the script only treats
> it as a failure if "password invalid" is in the response, but there
> are many other possible error messages. For all other errors, the
> template continues on to step 2.
>
> This can cause a system to boot straight to disk instead of booting
> from PXE, which can lead to failed commissioning and deployments.
>
> I've seen behavior that matches this problem on a few nodes in OIL -
> failed deployments due to booting from disk instead of PXE.
>
> The same problem applies to step 2 - the only error caught is "invalid
> password".
>
> It should be easy enough to fix this - just check $? after issuing the
> command to PXE boot and if it's not 0 then fail.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1454810/+subscriptions
>

Revision history for this message
ubuntudotcom1 (ubuntudotcom1) wrote :

Yea Jason, you're right. There really isn't much error checking in there at all right now. I think that in order to improve the observability and reliability of MAAS we really should have some more testing to see what went wrong in this case, and provide the user with a meaningful text-based human readable message on the failure and current state of the system. That way a user / operator can decide if a particular node should be marked as broken, and investigate the failure deeper. I think it would be fairly quick to add some more error checking here.

Raphaël Badin (rvb)
Changed in maas:
assignee: ubuntudotcom1 (ubuntudotcom1) → nobody
Changed in maas:
importance: Critical → High
status: New → Triaged
Raphaël Badin (rvb)
Changed in maas:
assignee: nobody → Raphaël Badin (rvb)
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The branch landed doesn't address the issue completely. It does not add error checking around the ipmi_chassis_config operation.

Changed in maas:
status: Fix Committed → Triaged
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.