AMT NUC stuck at boot prompt instead of powering down (no ACPI support in syslinux poweroff)

Bug #1376716 reported by Raphaël Badin on 2014-10-02
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
High
Raphaël Badin
1.7
High
Raphaël Badin

Bug Description

A 'READY' NUC got powered up by mistake; as expected it got a "poweroff" PXE config (i.e. in this state a node shouldn't be powered up and this MAAS asks the node to power off).

The problem is that instead of being powered off, the node got stuck with this on the console:

TFTP prefix:
Trying to load: pxelinux.cfg/01-2c-59-e5-55-ff-90 ok
APM not present.
boot:

Related branches

Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
status: Triaged → In Progress
Blake Rouse (blake-rouse) wrote :

After looking into this the poweroff.com module that is used with pxelinux does not support powering off machines without APM. I looked into booting a full Ubuntu to poweroff the node, and its just to much work to get complete in time for 1.7. We need to look into coming up with a fix for this, or just leaving it up to the power management of MAAS to turn the node off, once it notices that its on, and should be off.

Changed in maas:
status: In Progress → Confirmed
status: Confirmed → Triaged
assignee: Blake Rouse (blake-rouse) → nobody
Raphaël Badin (rvb) wrote :

This wouldn't be so bad if this could only result from powering up the node by mistake in a state where it should not be powered on. It's unfortunately much worse than that:
a) I've seen my NUCs reboot instead of being powered off at the end of the commissioning cycle. The nodes were then ready and I hit this bug.
b) once the node is stuck with the "APM not present" message it's not responding to "amttool poweroff"; you need to either manually power off the node or use 'amttool reset' before using 'amttool poweroff'. This means that getting out of this situation is not trivial.

Changed in maas:
importance: High → Low
tags: added: power
Christian Reis (kiko) wrote :

In a way I am incredibly happy that we've finally found the root cause of this problem, which has manifested itself with the NUCs millions of times in the field.

Changed in maas:
milestone: none → next
Christian Reis (kiko) on 2014-10-08
Changed in maas:
importance: Low → High
Christian Reis (kiko) wrote :

I am pretty sure the problem is that the NUCs don't have APM. Here are a few relevant facts:

There's a new poweroff module included in Syslinux 5, which seems to still be APM-only but worth testing on the NUCs:
   http://www.syslinux.org/wiki/index.php/Syslinux_5_Changelog

The submission of that change triggered a discussion about ACPI support:
  http://www.zytor.com/pipermail/syslinux/2013-February/019524.html

Turns out someone did provide an ACPI-based shutdown module:
  http://www.syslinux.org/archives/2012-March/017658.html
  http://www.syslinux.org/archives/2012-March/017661.html

There is also this very simple NASM implementation of a COM module that was reportedly submitted to FreeDOS:
  http://h30499.www3.hp.com/t5/Business-PCs-Compaq-Elite-Pro/FreeDOS-FDAPM-PowerOFF-Workaround-on-HP-8000-Elite/td-p/1140783#.VDVxlq0sfOs

It's somewhat annoying that there's not a module that will just DTRT and use APM or ACPI depending on what's there, but it should be doable.

There is of course the option of stopping the use of the poweroff module entirely, and using an ephemeral image that shuts down correctly regardless of whether we're on an ACPI or APM host.

Julian Edwards (julian-edwards) wrote :

I don't think this should be High. We need to address the underlying problem, because a machine booting to the "poweroff" PXE profile is an *error* and should never be happening in the first place. It was only added as a convenience in the first place to paprt over cracks and to stop stupid people from shooting themselves in the foot.

On Wednesday 08 Oct 2014 17:42:04 you wrote:
> There is of course the option of stopping the use of the poweroff module
> entirely, and using an ephemeral image that shuts down correctly
> regardless of whether we're on an ACPI or APM host.

+1

Julian Edwards (julian-edwards) wrote :

On Wednesday 08 Oct 2014 17:42:04 you wrote:
> There is of course the option of stopping the use of the poweroff module
> entirely, and using an ephemeral image that shuts down correctly
> regardless of whether we're on an ACPI or APM host.

In addition, it can log to the event log and say something like "why am I
booting when I should not be!"

> I don't think this should be High.

I think this should be high because it doesn't only happen when a user messes up; like I said in a previous comment, sometimes the NUCs reboots instead of powering off at the end of the commissiong step. I've seen this happen quite often.

> There is of course the option of stopping the use of the poweroff module
> entirely, and using an ephemeral image that shuts down correctly
> regardless of whether we're on an ACPI or APM host.

Agreed. It will be a bit longer than using poweroff.com for platforms that support APM because it means booting into the OS but the guarantee that it will work for all supported nodes is a huge plus.

Christian Reis (kiko) wrote :

It's absolutely high given the amount of pain it causes people with NUCs in the field, including our own SE team -- the node needs to be manually power cycled at that point.

I agree that the ephemeral image is the most likely fix, but I'm checking to see if a contractor is available for making a comprehensive poweroff.com loader upstream.

Christian Reis (kiko) wrote :

I've just got a MAAS 1.7 deployment going on an OB, and I'm getting stuck on boot prompts a lot as I get the nodes set up with their power parameters. There are apparently quite a few corner cases which will cause the node to try and boot up when it's supposed to shut down, and until we nail this it'll continue to affect NUC users.

As a workaround:

  - Using xvnc4server or remmina, at the boot prompt you can type "chain"
  - Disconnect
  - Wait a few seconds and then amttool powerdown

It works.

summary: - NUC stuck at boot prompt instead of being powered down
+ AMT NUC stuck at boot prompt instead of powering down (no ACPI support
+ in syslinux poweroff)
Christian Reis (kiko) wrote :

I managed to build a version of acpioff available from the branch at https://github.com/awalls-cx18/syslinux/tree/acpi_off/com32/acpioff

Guess what? It works. Just install it as acpioff.c32 in your boot-resources directory and when you get stuck on the boot: prompt, just run it and it turns the machine off.

I am attaching here the built binary (syslinux is GPLv2, which AIUI implies his publically posted code would be as well).

Christian Reis (kiko) wrote :

To have this used by default on your installation, you can hack it in:

cd
wget https://launchpadlibrarian.net/187530745/acpioff.c32
cd /var/lib/maas/boot-resources/current/syslinux
sudo mv poweroff.com poweroff-APM.com
sudo mv ~/acpioff.c32 poweroff.c32

Enjoy!

Dustin Kirkland  (kirkland) wrote :

Sure, I can hack that into the orange-box.deb, even, @kiko.

Aha! I worked out a way to do this using AMT:

ubuntu@maas:~$ yes|AMT_PASSWORD=Password1+ amttool 10.0.0.150 powerdown
host ., powerdown [y/N] ? execute: powerdown
result: pt_status: not permitted
ubuntu@maas:~$ yes|AMT_PASSWORD=Password1+ amttool 10.0.0.150 reset cd
host ., reset [y/N] ? execute: reset
result: pt_status: success
ubuntu@maas:~$ yes|AMT_PASSWORD=Password1+ amttool 10.0.0.150 powerdown
host ., powerdown [y/N] ? execute: powerdown
result: pt_status: success
ubuntu@maas:~$ AMT_PASSWORD=Password1+ amttool 10.0.0.150 info
### AMT info on machine '10.0.0.150' ###
AMT version: 8.1.30
Hostname: .
Powerstate: S5 (soft-off)
Remote Control Capabilities:
    IanaOemNumber 157
    OemDefinedCapabilities IDER SOL BiosSetup
    SpecialCommandsSupported PXE-boot HD-boot cd-boot
    SystemCapabilitiesSupported powercycle powerdown powerup reset
    SystemFirmwareCapabilities 7821

I'll throw this into the template as a special override if we get the "pt_status: not_permitted" response if everyone agrees?

Specifically, you seem to need to reset to "cd" mode before it'll accept the powerdown command next time. Any other mode still results in the not_permitted response.

As luck would have it (or not in my case) I cannot make this work today. I also cannot find anywhere why AMT refuses to let you power down sometimes. At the moment I have one at the grub prompt waiting for an OS selection, and AMT still refuses to power down. It's pretty Mickey Mouse stuff. :/

OK here we go: https://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide/default.htm?turl=WordDocuments%2Fconfiguringintelamttogeneratepostures.htm

Basically we need to configure SOL and IDER to be off before the powerdown is going to be reliable.

And I just worked out that my powerdown commands were failing because I had an active VNC session. Once I killed VNC it all worked OK.

Christian Reis (kiko) wrote :

I just realized that ARM systems will never have APM, and ACPI is not going to be here before 16.04, so the right solution is for us to move away from using the PXELINUX poweroff file and use instead the ephemeral image to shut down.

On Wednesday 29 Oct 2014 18:45:10 you wrote:
> I just realized that ARM systems will never have APM, and ACPI is not
> going to be here before 16.04, so the right solution is for us to move
> away from using the PXELINUX poweroff file and use instead the ephemeral
> image to shut down.

I don't think we need to even do that. If MAAS knows that the machine needs
to be powered off when we get a rogue boot, let's just issue a power job to
turn it off.

Christian Reis (kiko) on 2014-10-30
Changed in maas:
milestone: next → 1.7.1
Christian Reis (kiko) wrote :

Incidentally, this fails on the Atom-based Supermicro X7-SPA line, which is also ACPI-only.

Christian Reis (kiko) wrote :

I don't think Julian's suggestion works very well because it will race with the machine startup; the ephemeral poweroff image is likely the safest approach.

On Wednesday 05 November 2014 17:23:54 you wrote:
> I don't think Julian's suggestion works very well because it will race
> with the machine startup; the ephemeral poweroff image is likely the
> safest approach.

That's not true.

The poweroff would get issued as soon as the first PXE request is made, and if
we don't send any PXE files back it won't boot at all (since we force all
boots, even local, to go via PXE for this exact reason).

Mark W Wenning (mwenning) wrote :

I am seeing the "APM not present" error when _enlisting_ a node.

This is on a MAAS hw cluster - all the servers use IPMI rather than AMT. (Dell Poweredge M610, M710, M915 blades)

This seemed to start after I updated to maas1.7.0+bzr3299-0ubuntu1~trusty .

Could this be the same bug? Any workarounds?

I've also tried a couple rack mount machines (r430), one thing I noticed is that if I pull the plug, then plug it back in, the machine will enlist, but then fail in the same way on commission. i.e. completely powering off the machine causes it to work correctly once.

Christian Reis (kiko) wrote :

IPMI vs. AMT are actually not the issue here; the issue is that the NUCs, like other systems, are ACPI-only.

Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Mark W Wenning (mwenning) wrote :

Looks like I have several problems here depending on what systems I'm trying to bring up.
I applied the workaround in in #12 above and it looks like the "older" systems (M610, M710, M915) are working now (at least most of the time).

r430 and t430 are still having problems which may be firmware related. We are updating the firmware and will post the results here.

kiko, thanks for your help last night!

Raphaël Badin (rvb) on 2014-12-04
Changed in maas:
assignee: Blake Rouse (blake-rouse) → nobody
assignee: nobody → Raphaël Badin (rvb)
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released

Hello Raphaël, or anyone else affected,

Accepted maas into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/maas/1.7.5+bzr3369-0ubuntu1~14.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Andres Rodriguez (andreserl) wrote :

This issue has been verified to work both on upgrade and fresh install, and has been QA'd. Marking verification-done.

tags: added: verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments