Improve maintenance power/reset control command retry handling

Bug #2031945 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
platform management BMC power on/off and reset command retry handling is not working as expected.

Issue 1: host-reset is skipping the graceful altogether and always issuing immediate and performing only 4 rather than 5 retries.

Issue 2: host-power-off algorithm timing leads to 2 retries being used up for a single retry which is then leading to only 5 retries being done. Even the first successful try leads to a logged retry.

Need to improve on and drive consistency into the maintenance power on/off
and reset handling in terms of retries and use of graceful and immediate commands.

Severity
--------
Minor. Issue only exists in BMC command failure handling cases which also still works but does not behave ideally.

Steps to Reproduce
------------------
system host-power-off <host>
system host-reset <host>

Expected Behavior
------------------
retries and command type are as expected.

Actual Behavior
----------------
retries and command type are not as expected.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Any BMC provisioned host

Branch/Pull Time/Commit
-----------------------
Any load prior to the close of this bug report

Last Pass
---------
test escape

Timestamp/Logs
--------------
not necessary. issue is understood

Test Activity
-------------
Normal Use

Workaround
----------
None

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/892051

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/895728

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on metal (master)

Change abandoned by "Eric MacDonald <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/895728
Reason: Uploaded by accident

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/906748

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on metal (master)

Change abandoned by "Eric MacDonald <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/892051

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/906748
Committed: https://opendev.org/starlingx/metal/commit/50dc29f6c025de0b9dfea3196cf3bedff8c36908
Submitter: "Zuul (22348)"
Branch: master

commit 50dc29f6c025de0b9dfea3196cf3bedff8c36908
Author: Eric Macdonald <email address hidden>
Date: Mon Sep 18 18:48:56 2023 +0000

    Improve maintenance power/reset control command retry handling

    This update improves on and drives consistency into the
    maintenance power on/off and reset handling in terms of
    retries and use of graceful and immediate commands.

    This update maintains the 10 retries for both power-on
    and power-off commands and increases the number of retries
    for the reset command from 5 to 10 to line up with the
    power operation commands.

    This update also ensures that the first 5 retries are done
    with the graceful action command while the last 5 are with
    the immediate.

    This update also removed a power on handling case that could
    have lead to a stuck state. This case was virtually impossible
    to hit based on the required sequence of intermittent command
    failures but that scenario handling was fixed up anyway.

    Issues have been seen with the power-off handling on some servers.
    Suspect that those servers need more time to power-off. So, this
    introduced a 30 seconds delay following a power-off command before
    issuing the power status query to give the server some time to
    power-off before retrying the power-off command.

    Test Plan: Both IPMI and Redfish

    PASS: Verify power on/off and reset handling support up to 10 retries
    PASS: Verify graceful command is used for the first power on/off
          or reset try and the first 5 retries
    PASS: Verify immediate command is used for the final 5 retries
    PASS: Verify reset handling with/without retries (none/mid/max)
    PASS: Verify power-on handling with/without retries (none/mid/max)
    PASS: Verify power-off handling with/without retries (none/mid/max)
    PASS: Verify power status command failure handling for power on/off
    NOTE: FIT (fault insertion testing) was used to create retry scenarios

    PASS: Verify power-off inter retry delay feature
    PASS: Verify 30 second power-off to power query delay
    PASS: Verify redfish power/reset commands used are logged by default
    PASS: Verify power-off/on and reset logging

    Regression:

    PASS: verify power-on/off and reset handling without retries
    PASS: Verify power-off handling when power is already off
    PASS: Verify power-on handling when power is already on

    Closes-Bug: 2031945
    Signed-off-by: Eric Macdonald <email address hidden>
    Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.