MAAS deployment failures on server with Redfish

Bug #2004661 reported by Rod Smith
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
High
Unassigned
3.3
Triaged
High
Unassigned
3.4
Triaged
High
Unassigned
3.5
Triaged
High
Unassigned

Bug Description

On an Ampere Altra based server (flavio) in the Server Certification lab, deployments are failing when the node is configured to use Redfish. The following appears in the node's deployment log in the web UI:

 Fri, 03 Feb. 2023 15:19:33 Failed to power on node - Power on for the node failed: Failed to complete power action: Redfish request failed with response status code: 400.
 Fri, 03 Feb. 2023 15:19:33 Node changed status - From 'Deploying' to 'Failed deployment'
 Fri, 03 Feb. 2023 15:19:33 Marking node failed - Power on for the node failed: Failed to complete power action: Redfish request failed with response status code: 400.
 Fri, 03 Feb. 2023 15:18:28 Powering on
 Fri, 03 Feb. 2023 15:18:27 Node - Started deploying 'flavio'.
 Fri, 03 Feb. 2023 15:18:27 Deploying

I'm attaching additional MAAS log files to this bug report.

The deployment fails after the node has powered on but while it's still in POST; it doesn't even get to the point where it PXE-boots for the first time.

Two other nodes on our network also use Redfish and appear to be unaffected. The affected server has an ARM64 CPU and is running the latest firmware. I'm 95% certain that it worked when it was first installed a few months ago. We first noticed this problem on 17 January, 2023. Our current MAAS version is 3.2.6-12016-g.19812b4da, installed via snap.

As a workaround, we can set the server to use IPMI rather than Redfish.

Tags: power
Revision history for this message
Rod Smith (rodsmith) wrote :
Jeff Lane  (bladernr)
description: updated
Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

For some reason, `set_pxe_boot` returned 400, even though we send a valid static JSON for all machines. Looks like vendor-specific issue, although further investigation is required

Changed in maas:
status: New → Triaged
Revision history for this message
Adam Collard (adam-collard) wrote :

Ensure we keep track of all the quirks of Redfish in our tests to ensure coverage and minimise regressions. Is there a library for interacting with Redfish that abstracts away from the details?

tags: added: power
Changed in maas:
importance: Undecided → High
milestone: none → 3.4.0
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Could we get a tcpdump of the communication between MAAS and the Redfish server that contains the error? Also, when that server used to work (with 95% likelihood), was it on the same MAAS version as when it started failing? The logs show hints of possible proxy involvement - has anything changed in that configuration recently?

Changed in maas:
status: Triaged → Incomplete
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
Revision history for this message
Rod Smith (rodsmith) wrote :

Here's the tcpdump; the command I used was:

sudo tcpdump -i eno1 -n -w flavio-tcpdump.pcap host 10.245.129.101

Sorry for the delay.

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

all Redfish communication is encrypted, so unfortunately the pcap wasn't very insightful

> Is there a library for interacting with Redfish that abstracts away from the details?

yes, OpenStack has a python module for this: https://opendev.org/openstack/sushy

Changed in maas:
status: Incomplete → Triaged
milestone: 3.4.x → 3.5.0
Changed in maas:
milestone: 3.5.0 → 3.5.x
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Could we get access to the server where this happens, or get an unencrypted dump of the communication between MAAS and Redfish endpoint?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Igor Gnip (igorgnip) wrote :

I can confirm seeing same issue with HP DL360 Gen11 (amd64 cpu) servers.

Looking at what happens during on/off/reboot process:

- power off then polling for power status yielded multiple On then finally Off state (considered normal)
- power on then polling for power status yielded multiple power off power state responses then it had intermediate "Reset" power state for couple of seconds then went to On but since 'Set Pxe BOOT" option was called, after couple of seconds, server 'rebooted' (looking at console) and went back to "Reset" for another few seconds before moving back to On. I did not get any status code 400 errors from my reproducer script yet.

My gut feeling is that in specific power state (possibly while in transition from on to off or from off to on - set PXE boot can get refused - I am seeing from ILO UI same - during server POST - it is not possible to change boot order or set Boot Once entry.)

Could it be that maas needs to make sure server is off or on before setting the PXE boot ?

Disclaimer - I am not the original poster of this issue and my intention is not to hijack the thread, just to provide support with more details. I will also work on reproducing this by implementing calls to set pxe boot purposely in the transition from on to off and off to on.
I might also have (because I can print it) - output of communication as Jerzy requested.

Revision history for this message
Alan Baghumian (alanbach) wrote :

@Rod Was your Ampere server also made by HP by the chance?

Revision history for this message
Anton Troyanov (troyanov) wrote :

I wonder if doing the same steps (as MAAS does) with Redfishtool [0] will lead to the same result?

[0]: https://github.com/DMTF/Redfishtool

Revision history for this message
Rod Smith (rodsmith) wrote :

@Alan, yes, flavio (the system on which we found this) is an HPE ProLiant RL300 Gen11.

@Jerzy, sorry for the delay, and yes, you can access it. You should be able to deploy it via Testflinger via the "flavio" queue name. If you need access to the MAAS server, that can be arranged, too.

tags: added: bug-council
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Let's have a look at the failing server in Cert lab and make MAAS Redfish handling more robust against deviating vendor implementations.

Changed in maas:
milestone: 3.5.x → 3.6.0
status: Incomplete → Triaged
tags: removed: bug-council
Revision history for this message
Igor Gnip (igorgnip) wrote :

Hello,

I modified part of the redfish.py driver in our lab to include some logging
and to wait in infinite loop for power state change. This makes redfish driver work much better but I am sure the implementation is not ideal considering blocking nature of infinite loop - what if status never changes etc... please take it as a POC of what could eventually lead to a proper fix

@asynchronous
@inlineCallbacks
def power_on(self, node_id, context):
"""Power on machine."""
  url, node_id, headers = yield self.process_redfish_context(context)
  power_state = yield self.power_query(node_id, context)
  maaslog.warning("power on machine called")

  # Power off the machine if currently on.
  if power_state == "on":
    maaslog.warning("since current power state is %s - ForceOff needs to be called first" % power_state)
    yield self.power("ForceOff", url, node_id, headers)

  maaslog.warning("entering infinite loop until server is off")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "off":
      maaslog.warning("finally, server is off")
      break
    maaslog.warning("still waiting for server to be off")
    time.sleep(1) # there is likely better way to delay next attempt

  # Set to PXE boot.
  maaslog.warning("set pxe boot can be done only on OFF server")
  yield self.set_pxe_boot(url, node_id, headers)
  maaslog.warning("now we can power on")
  # Power on the machine.
  yield self.power("On", url, node_id, headers)

  maaslog.warning("entering infinite loop until server is on")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "on":
      maaslog.warning("server is finally on")
      break
    maaslog.warning("still waiting for server to be on")
    time.sleep(1)

If code always makes sure to wait for power status to change and make sure to try to alter bootnext only
if server is off, then power on the server and make sure it is on before continuing, everything works perfectly. There is a caveat though - I noticed that while transitioning from off to on, sometimes status becomes "reset" which is not a documented power status and it returns http code 400 which gets logged and exception thrown. Power status can also can alternate from off to on, then to reset and then back to on within 10-45 seconds - which makes the 'wait for on' a bit more complex however, I did not notice any issues with subsequent "reset" state since we only care that state change actually happened.

http code 400 is returned by HPE redfish implementation when server is in transition / flux somewhere between off and on so in such cases, making another request does not make much sense - but 400 is probably bad choice of http code as this is not fatal, the same api request succeeds few seconds later.
Still, I believe maas needs to handle this odd case and just retry the request after some time.

Revision history for this message
Igor Gnip (igorgnip) wrote :

To simplify, I believe what impacts stability the most is that we try to do 'set next boot' pxe before server state is off. Everything else is optional to additionally harden the driver but set next boot failing because server is still on is what causes maas to fail. (looking in maas 3.2). Not sure how latest maas with hardcoded power statuses behaves but I see there is no 'reset' status so that can't be good.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.