MAAS deployment failures on server with Redfish

Bug #2004661 reported by Rod Smith
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Incomplete
High
Unassigned
3.3
Triaged
High
Unassigned
3.4
Triaged
High
Unassigned

Bug Description

On an Ampere Altra based server (flavio) in the Server Certification lab, deployments are failing when the node is configured to use Redfish. The following appears in the node's deployment log in the web UI:

 Fri, 03 Feb. 2023 15:19:33 Failed to power on node - Power on for the node failed: Failed to complete power action: Redfish request failed with response status code: 400.
 Fri, 03 Feb. 2023 15:19:33 Node changed status - From 'Deploying' to 'Failed deployment'
 Fri, 03 Feb. 2023 15:19:33 Marking node failed - Power on for the node failed: Failed to complete power action: Redfish request failed with response status code: 400.
 Fri, 03 Feb. 2023 15:18:28 Powering on
 Fri, 03 Feb. 2023 15:18:27 Node - Started deploying 'flavio'.
 Fri, 03 Feb. 2023 15:18:27 Deploying

I'm attaching additional MAAS log files to this bug report.

The deployment fails after the node has powered on but while it's still in POST; it doesn't even get to the point where it PXE-boots for the first time.

Two other nodes on our network also use Redfish and appear to be unaffected. The affected server has an ARM64 CPU and is running the latest firmware. I'm 95% certain that it worked when it was first installed a few months ago. We first noticed this problem on 17 January, 2023. Our current MAAS version is 3.2.6-12016-g.19812b4da, installed via snap.

As a workaround, we can set the server to use IPMI rather than Redfish.

Tags: power
Revision history for this message
Rod Smith (rodsmith) wrote :
Jeff Lane  (bladernr)
description: updated
Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

For some reason, `set_pxe_boot` returned 400, even though we send a valid static JSON for all machines. Looks like vendor-specific issue, although further investigation is required

Changed in maas:
status: New → Triaged
Revision history for this message
Adam Collard (adam-collard) wrote :

Ensure we keep track of all the quirks of Redfish in our tests to ensure coverage and minimise regressions. Is there a library for interacting with Redfish that abstracts away from the details?

tags: added: power
Changed in maas:
importance: Undecided → High
milestone: none → 3.4.0
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Could we get a tcpdump of the communication between MAAS and the Redfish server that contains the error? Also, when that server used to work (with 95% likelihood), was it on the same MAAS version as when it started failing? The logs show hints of possible proxy involvement - has anything changed in that configuration recently?

Changed in maas:
status: Triaged → Incomplete
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
Revision history for this message
Rod Smith (rodsmith) wrote :

Here's the tcpdump; the command I used was:

sudo tcpdump -i eno1 -n -w flavio-tcpdump.pcap host 10.245.129.101

Sorry for the delay.

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

all Redfish communication is encrypted, so unfortunately the pcap wasn't very insightful

> Is there a library for interacting with Redfish that abstracts away from the details?

yes, OpenStack has a python module for this: https://opendev.org/openstack/sushy

Changed in maas:
status: Incomplete → Triaged
milestone: 3.4.x → 3.5.0
Changed in maas:
milestone: 3.5.0 → 3.5.x
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Could we get access to the server where this happens, or get an unencrypted dump of the communication between MAAS and Redfish endpoint?

Changed in maas:
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.