Comment 13 for bug 2004661

Revision history for this message
Igor Gnip (igorgnip) wrote :

Hello,

I modified part of the redfish.py driver in our lab to include some logging
and to wait in infinite loop for power state change. This makes redfish driver work much better but I am sure the implementation is not ideal considering blocking nature of infinite loop - what if status never changes etc... please take it as a POC of what could eventually lead to a proper fix

@asynchronous
@inlineCallbacks
def power_on(self, node_id, context):
"""Power on machine."""
  url, node_id, headers = yield self.process_redfish_context(context)
  power_state = yield self.power_query(node_id, context)
  maaslog.warning("power on machine called")

  # Power off the machine if currently on.
  if power_state == "on":
    maaslog.warning("since current power state is %s - ForceOff needs to be called first" % power_state)
    yield self.power("ForceOff", url, node_id, headers)

  maaslog.warning("entering infinite loop until server is off")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "off":
      maaslog.warning("finally, server is off")
      break
    maaslog.warning("still waiting for server to be off")
    time.sleep(1) # there is likely better way to delay next attempt

  # Set to PXE boot.
  maaslog.warning("set pxe boot can be done only on OFF server")
  yield self.set_pxe_boot(url, node_id, headers)
  maaslog.warning("now we can power on")
  # Power on the machine.
  yield self.power("On", url, node_id, headers)

  maaslog.warning("entering infinite loop until server is on")
  while True:
    power_state = yield self.power_query(node_id,context)
    if power_state == "on":
      maaslog.warning("server is finally on")
      break
    maaslog.warning("still waiting for server to be on")
    time.sleep(1)

If code always makes sure to wait for power status to change and make sure to try to alter bootnext only
if server is off, then power on the server and make sure it is on before continuing, everything works perfectly. There is a caveat though - I noticed that while transitioning from off to on, sometimes status becomes "reset" which is not a documented power status and it returns http code 400 which gets logged and exception thrown. Power status can also can alternate from off to on, then to reset and then back to on within 10-45 seconds - which makes the 'wait for on' a bit more complex however, I did not notice any issues with subsequent "reset" state since we only care that state change actually happened.

http code 400 is returned by HPE redfish implementation when server is in transition / flux somewhere between off and on so in such cases, making another request does not make much sense - but 400 is probably bad choice of http code as this is not fatal, the same api request succeeds few seconds later.
Still, I believe maas needs to handle this odd case and just retry the request after some time.