Bug #2004661 “MAAS deployment failures on server with Redfish” : Series 3.3 : Bugs : MAAS

Revision history for this message

Rod Smith (rodsmith) wrote on 2023-02-03:

#1

MAAS log files Edit (6.5 MiB, application/x-tar)

Jeff Lane  (bladernr) on 2023-02-03

description:

updated

Revision history for this message

Igor Brovtsin (igor-brovtsin) wrote on 2023-02-17:

#2

For some reason, `set_pxe_boot` returned 400, even though we send a valid static JSON for all machines. Looks like vendor-specific issue, although further investigation is required

Changed in maas:
status:	New → Triaged

Revision history for this message

Adam Collard (adam-collard) wrote on 2023-03-23:

#3

Ensure we keep track of all the quirks of Redfish in our tests to ensure coverage and minimise regressions. Is there a library for interacting with Redfish that abstracts away from the details?

tags:	added: power
Changed in maas:
importance:	Undecided → High
milestone:	none → 3.4.0

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2023-03-31:

#4

Could we get a tcpdump of the communication between MAAS and the Redfish server that contains the error? Also, when that server used to work (with 95% likelihood), was it on the same MAAS version as when it started failing? The logs show hints of possible proxy involvement - has anything changed in that configuration recently?

Changed in maas:
status:	Triaged → Incomplete

Alberto Donato (ack) on 2023-06-29

Changed in maas:
milestone:	3.4.0 → 3.4.x

Revision history for this message

Rod Smith (rodsmith) wrote on 2023-11-07:

#5

flavio-tcpdump.pcap Edit (284.7 KiB, application/vnd.tcpdump.pcap)

Here's the tcpdump; the command I used was:

sudo tcpdump -i eno1 -n -w flavio-tcpdump.pcap host 10.245.129.101

Sorry for the delay.

Revision history for this message

Alexsander de Souza (alexsander-souza) wrote on 2023-11-13:

#6

all Redfish communication is encrypted, so unfortunately the pcap wasn't very insightful

> Is there a library for interacting with Redfish that abstracts away from the details?

yes, OpenStack has a python module for this: https://opendev.org/openstack/sushy

Alexsander de Souza (alexsander-souza) on 2023-11-13

Changed in maas:
status:	Incomplete → Triaged
milestone:	3.4.x → 3.5.0

Jerzy Husakowski (jhusakowski) on 2024-03-01

Changed in maas:
milestone:	3.5.0 → 3.5.x

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2024-03-07:

#7

Could we get access to the server where this happens, or get an unencrypted dump of the communication between MAAS and Redfish endpoint?

Changed in maas:
status:	Triaged → Incomplete

Revision history for this message

Igor Gnip (igorgnip) wrote on 2024-05-15:

#8

I can confirm seeing same issue with HP DL360 Gen11 (amd64 cpu) servers.

Looking at what happens during on/off/reboot process:

- power off then polling for power status yielded multiple On then finally Off state (considered normal)
- power on then polling for power status yielded multiple power off power state responses then it had intermediate "Reset" power state for couple of seconds then went to On but since 'Set Pxe BOOT" option was called, after couple of seconds, server 'rebooted' (looking at console) and went back to "Reset" for another few seconds before moving back to On. I did not get any status code 400 errors from my reproducer script yet.

My gut feeling is that in specific power state (possibly while in transition from on to off or from off to on - set PXE boot can get refused - I am seeing from ILO UI same - during server POST - it is not possible to change boot order or set Boot Once entry.)

Could it be that maas needs to make sure server is off or on before setting the PXE boot ?

Disclaimer - I am not the original poster of this issue and my intention is not to hijack the thread, just to provide support with more details. I will also work on reproducing this by implementing calls to set pxe boot purposely in the transition from on to off and off to on.
I might also have (because I can print it) - output of communication as Jerzy requested.

Revision history for this message

Alan Baghumian (alanbach) wrote on 2024-05-15:

#9

@Rod Was your Ampere server also made by HP by the chance?

Revision history for this message

Anton Troyanov (troyanov) wrote on 2024-05-16:

#10

I wonder if doing the same steps (as MAAS does) with Redfishtool [0] will lead to the same result?

[0]: https://github.com/DMTF/Redfishtool

Revision history for this message

Rod Smith (rodsmith) wrote on 2024-05-16:

#11

@Alan, yes, flavio (the system on which we found this) is an HPE ProLiant RL300 Gen11.

@Jerzy, sorry for the delay, and yes, you can access it. You should be able to deploy it via Testflinger via the "flavio" queue name. If you need access to the MAAS server, that can be arranged, too.

Jerzy Husakowski (jhusakowski) on 2024-06-03

tags:

added: bug-council

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2024-06-06:

#12

Let's have a look at the failing server in Cert lab and make MAAS Redfish handling more robust against deviating vendor implementations.

Changed in maas:
milestone:	3.5.x → 3.6.0
status:	Incomplete → Triaged
tags:	removed: bug-council

Revision history for this message

Igor Gnip (igorgnip) wrote on 2024-06-09:

#13

Hello,

I modified part of the redfish.py driver in our lab to include some logging
and to wait in infinite loop for power state change. This makes redfish driver work much better but I am sure the implementation is not ideal considering blocking nature of infinite loop - what if status never changes etc... please take it as a POC of what could eventually lead to a proper fix

@asynchronous
@inlineCallbacks
def power_on(self, node_id, context):
"""Power on machine."""
  url, node_id, headers = yield self.process_redfish_context(context)
  power_state = yield self.power_query(node_id, context)
  maaslog.warning("power on machine called")

  # Power off the machine if currently on.
  if power_state == "on":
    maaslog.warning("since current power state is %s - ForceOff needs to be called first" % power_state)
    yield self.power("ForceOff", url, node_id, headers)

  maaslog.warning("entering infinite loop until server is off")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "off":
      maaslog.warning("finally, server is off")
      break
    maaslog.warning("still waiting for server to be off")
    time.sleep(1) # there is likely better way to delay next attempt

  # Set to PXE boot.
  maaslog.warning("set pxe boot can be done only on OFF server")
  yield self.set_pxe_boot(url, node_id, headers)
  maaslog.warning("now we can power on")
  # Power on the machine.
  yield self.power("On", url, node_id, headers)

  maaslog.warning("entering infinite loop until server is on")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "on":
      maaslog.warning("server is finally on")
      break
    maaslog.warning("still waiting for server to be on")
    time.sleep(1)

If code always makes sure to wait for power status to change and make sure to try to alter bootnext only
if server is off, then power on the server and make sure it is on before continuing, everything works perfectly. There is a caveat though - I noticed that while transitioning from off to on, sometimes status becomes "reset" which is not a documented power status and it returns http code 400 which gets logged and exception thrown. Power status can also can alternate from off to on, then to reset and then back to on within 10-45 seconds - which makes the 'wait for on' a bit more complex however, I did not notice any issues with subsequent "reset" state since we only care that state change actually happened.

http code 400 is returned by HPE redfish implementation when server is in transition / flux somewhere between off and on so in such cases, making another request does not make much sense - but 400 is probably bad choice of http code as this is not fatal, the same api request succeeds few seconds later.
Still, I believe maas needs to handle this odd case and just retry the request after some time.

Hello,

I modified part of the redfish.py driver in our lab to include some logging
and to wait in infinite loop for power state change. This makes redfish driver work much better but I am sure the implementation is not ideal considering blocking nature of infinite loop - what if status never changes etc...  please take it as a POC of what could eventually lead to a proper fix

@asynchronous
@inlineCallbacks
def power_on(self, node_id, context):
"""Power on machine."""
  url, node_id, headers = yield self.process_redfish_context(context)
  power_state = yield self.power_query(node_id, context)
  maaslog.warning("power on machine called")

# Power off the machine if currently on.
  if power_state == "on":
    maaslog.warning("since current power state is %s - ForceOff needs to be called first" % power_state)
    yield self.power("ForceOff", url, node_id, headers)

maaslog.warning("entering infinite loop until server is off")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "off":
      maaslog.warning("finally, server is off")
      break
    maaslog.warning("still waiting for server to be off")
    time.sleep(1) # there is likely better way to delay next attempt
  
  # Set to PXE boot.
  maaslog.warning("set pxe boot can be done only on OFF server")
  yield self.set_pxe_boot(url, node_id, headers)
  maaslog.warning("now we can power on")
  # Power on the machine.
  yield self.power("On", url, node_id, headers)

maaslog.warning("entering infinite loop until server is on")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "on":
      maaslog.warning("server is finally on")
      break
    maaslog.warning("still waiting for server to be on")
    time.sleep(1)

If code always makes sure to wait for power status to change and make sure to try to alter bootnext only 
if server is off, then power on the server and make sure it is on before continuing, everything works perfectly. There is a caveat though - I noticed that while transitioning from off to on, sometimes status becomes "reset" which is not a documented power status and it returns http code 400 which gets logged and exception thrown. Power status can also can alternate from off to on, then to reset and then back to on within 10-45 seconds - which makes the 'wait for on' a bit more complex however, I did not notice any issues with subsequent "reset" state since we only care that state change actually happened.

http code 400 is returned by HPE redfish implementation when server is in transition / flux somewhere between off and on so in such cases, making another request does not make much sense - but 400 is probably bad choice of http code as this is not fatal, the same api request succeeds few seconds later.
Still, I believe maas needs to handle this odd case and just retry the request after some time.

Revision history for this message

Igor Gnip (igorgnip) wrote on 2024-06-12:

#14

To simplify, I believe what impacts stability the most is that we try to do 'set next boot' pxe before server state is off. Everything else is optional to additionally harden the driver but set next boot failing because server is still on is what causes maas to fail. (looking in maas 3.2). Not sure how latest maas with hardcoded power statuses behaves but I see there is no 'reset' status so that can't be good.

Revision history for this message

Sébastien Claude (seb54000) wrote on 2024-06-19:

#15

@Igor, we have exactly the same behaviour on HP ProLiant DL380 Gen11

I know we can change the boot order (next boot order) while the machine is off. The only restrictions is during POST and this is exactly why the power management failed via redfish from time to time (quite often in fact)

The problem is that power_query is unable to send back details if we are in POST or not. I guess the implementation should do a retry on the PXE boot order https://github.com/canonical/maas/blob/de2288e04a9ad6fee7dd8954f6502d754f2ef280/src/provisioningserver/drivers/power/redfish.py#L272

Maybe in the spirit of what is done for ipmi (which we don't want to use as it is not enabled over LAN by default in ILO) : https://github.com/canonical/maas/blob/de2288e04a9ad6fee7dd8954f6502d754f2ef280/src/provisioningserver/drivers/power/ipmi.py#L488

We are using maas 3.4.2

Revision history for this message

Sébastien Claude (seb54000) wrote on 2024-06-19:

#16

Maybe we should retry PXE boot order or wait until we are not in PoweringOn state as POST is surely included in it

https://github.com/canonical/maas/blob/de2288e04a9ad6fee7dd8954f6502d754f2ef280/src/provisioningserver/drivers/power/redfish.py#L293C9-L294C53
According to the Redfish schema, the power states can take the following values:
Off, On, Paused, PoweringOff, PoweringOn

Revision history for this message

Igor Gnip (igorgnip) wrote on 2024-06-19:

#17

That's the problem, HP ilo does not have a PoweringOn / PoweringOff state, it has 'Reset' instead.
And per my discovery, it seems set next boot works 100% of the time only if server is off as "Reset" can alternate between Reset and On and Off until server is finally fully On. Setting pxe boot while server is On is probably too late.

I believe implementation should wait for server to be off before doing the set boot once request.
Snippet of code I pasted above is not good as the infinite loop could get stuck in case of incorrect credentials or non-responsive server but ideally, that is what we need to do except have a sane amount of retries instead of infinite loop.

Revision history for this message

Igor Gnip (igorgnip) wrote on 2024-06-19:

#18

Yes, that means HP is not abiding to the Redfish schema 100%

Revision history for this message

Sébastien Claude (seb54000) wrote on 2024-06-19:

#19

Thanks, my experience through the ILO webUI was that you can change the next boot order while the macine is poweredON (I mean fully booted).

Anyway, would be much better to be able to wait until the machine is fully powered Off. I don't know if other users have same problem on different hardware like Dell.

Revision history for this message

Igor Gnip (igorgnip) wrote 13 hours ago:

#20

The snippet above - I replaced sleep(1) with defer and this works fine in our lab, it plainly won't make set pxe next until server is off. Furthermore, there is a limit of time (not an infinite loop) just in case server is broken or unresponsive to prevent stuck actions - it will eventually timeout and fail.

My concern here is that 'resetting' state is missing from latest maas and there is no clear plan how to backport this. We can idealize plans for far maas future but what we really need is a quick fix for the versions currently in use.

	Status	Importance	Assigned to	Milestone
MAAS	Triaged	High	Unassigned	MAAS 3.6.0
3.3	Triaged	High	Unassigned	MAAS 3.3.x
3.4	Triaged	High	Unassigned	MAAS 3.4.x
3.5	Triaged	High	Unassigned	MAAS 3.5.x

MAAS

MAAS deployment failures on server with Redfish

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches