1.8b1 Failed deployment/release timeout

Bug #1442059 reported by Adam Collard
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Raphaël Badin

Bug Description

Every so often MAAS incorrectly marks a node as failing to deploy (or to release).

The node event log claims that it failed to power on (via AMT), even though the PXE curtin request was received and recorded.

Log from a failed deployment:

PXE Request - power off
Failed to power on node - Timed out Thu, 09 Apr. 2015 04:53:33
Node changed status - From 'Deploying' to 'Failed deployment' Thu, 09 Apr. 2015 04:53:33
Installation complete - Node disabled netboot Thu, 09 Apr. 2015 04:53:28
PXE Request - curtin install Thu, 09 Apr. 2015 04:52:00
PXE Request - curtin install Thu, 09 Apr. 2015 04:51:58
Powering node on Thu, 09 Apr. 2015 04:51:33

Log from a successful deployment:
Node changed status - From 'Deploying' to 'Deployed' Wed, 08 Apr. 2015 13:20:54
Installation complete - Node disabled netboot Wed, 08 Apr. 2015 13:20:19
PXE Request - curtin install Wed, 08 Apr. 2015 13:18:46
Node powered on Wed, 08 Apr. 2015 13:18:43
Powering node on Wed, 08 Apr. 2015 13:18:20

Log from a failed release:
 Node powered off Thu, 09 Apr. 2015 09:21:03
 Failed to power off node - Timed out Thu, 09 Apr. 2015 09:20:41
 Node changed status - From 'Releasing' to 'Releasing failed' Thu, 09 Apr. 2015 09:19:40
 Powering node off Thu, 09 Apr. 2015 09:18:41

Shouldn't a node that issues a PXE request, completes installation etc. be marked as powered on?

Related branches

tags: added: landscape power
Revision history for this message
Adam Collard (adam-collard) wrote :

Maybe retries are succeeding but MAAS doesn't notice? It's odd

Changed in maas:
importance: Undecided → Critical
status: New → Triaged
milestone: none → 1.8.0
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Adam, can you attach logs please? /var/log/maas/*.log

Revision history for this message
Adam Collard (adam-collard) wrote :
Revision history for this message
Adam Collard (adam-collard) wrote :
Revision history for this message
Adam Collard (adam-collard) wrote :
Revision history for this message
Jeff Lane  (bladernr) wrote :

I have only seen this happen locally one time, however early on I DID have a devil of a time getting successful deployments but I can't really remember if this is exactly what I saw. What it boiled down to for me was boot order.

My hardware has three ethernet devices, 2 are dedicated data, and one is shared data/AMT.

When I was PXE booting from the AMT port, I ran into all manner of issues that I think are because the AMT port appears to the kernel as eth2, not eth0 (or something relating to how the devices are addressed).

IN any case, the resolution to all my problems was to reconfigure the boot order in EFI/BIOS so that the AMT port was not a valid PXE boot option, leaving only the two dedicated ports for PXE.

I believe it would also work if I just moved the shared AMT port to a lower position than the dedicated ports in boot sequence, but for my use, just removing it was sufficient.

Revision history for this message
Jeff Lane  (bladernr) wrote :

I am, however, now running into issues now with AMT becoming unresponsive after sitting idle for a number of hours. I discovered that by deploying the system doing work and letting it sit up to overnight, AMT would eventually stop responding to power state queries. Sometimes, AMT comes back on its own, sometimes, I have to unplug and replug the AMT ethernet cable, and once or twice (early on) I had to hard power cycle the system (unplug/replug the power cord).

I initially thought something related to the MAAS AMT power template work was broken, but as I discovered, it's actually because AMT is flaky and unreliable.

Revision history for this message
Adam Collard (adam-collard) wrote :

Note similar issues appear with powering off nodes when releasing them

 Node powered off Thu, 09 Apr. 2015 09:21:03
 Failed to power off node - Timed out Thu, 09 Apr. 2015 09:20:41
 Node changed status - From 'Releasing' to 'Releasing failed' Thu, 09 Apr. 2015 09:19:40
 Powering node off Thu, 09 Apr. 2015 09:18:41

Revision history for this message
Adam Collard (adam-collard) wrote :

Truncated logs showing failed deployment of born.beretstack

Revision history for this message
Raphaël Badin (rvb) wrote :

Now, what I see in the logs you posted in comment #9 is different: powering up the machine worked as expected ([1]) but the installation failed and thus MAAS marked the machine "Failed Deployment" after 40 minutes ([2]).

[1]
Apr 17 06:37:14 virtue maas.power: [INFO] Changing power state (on) of node: born.local (node-d6072d54-4b4c-11e4-ad24-a0b3cce4ecca)
Apr 17 06:37:37 virtue maas.power: [INFO] Changed power state (on) of node: born.local (node-d6072d54-4b4c-11e4-ad24-a0b3cce4ecca)

[2]
Apr 17 07:17:14 virtue maas.node: [ERROR] born.local: Marking node failed: Node operation 'Deploying' timed out after 0:40:00.

summary: - 1.8b1 Failed deployment timeout powering on AMT
+ 1.8b1 Failed deployment/release timeout
description: updated
Raphaël Badin (rvb)
Changed in maas:
assignee: nobody → Raphaël Badin (rvb)
status: Triaged → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Joseph Motacek (cleanshooter) wrote :

Rather than increase timeout times might it be a good idea to add something to the settings to allow administrators to manually increase it if their systems require more time. I too am experiencing this issue in my setup and I have 1.8 installed already... the problem is I am using some really slow HP G5 servers which take a well over 2 minutes to get past POST and then wait for PXE to actually start scanning for a DHCP server... I've noticed that in the UI I'll get 'Commissioning Failed' like minute before commissioning is finished on the server and it's shuts down.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.