Power off host operation reports completed even if host remains powered on

Bug #1865087 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

The maintenance Power Off FSM is not producing a failed response if a BMC accepts the power off request but does not actually power off the host. This was seen on system WP 8-12 compute-1 (WP11) when the BMC somehow got into a funky state and accepted the power off request but then did not proceed with the power off operation.

The FSM needs to be enhanced with the following case states immediately following the MTC_POWEROFF__RESP_WAIT state to detect and fail the power off operation if the host never actually powers off or never goes offline.

   MTC_POWEROFF__POWER_STATUS
   MTC_POWEROFF__POWER_STATUS_WAIT
   MTC_POWEROFF__VERIFY

This is day one behavior that was never observed nor fault insertion tested for.

Severity
--------
Minor with the following reasoning.
1. Server is already out of service if its being powered off so no immediate service affecting impact.
2. Could be considered a double fault scenario.
3. Host does not mistakenly appear powered off while its not. Instead it just bounces back as online.

Steps to Reproduce
------------------
Difficult without tricking the code by provisioning host 'A' with BMC info from host 'B' and then executing a power off for host 'A' that will result in host 'B' being powered off while host 'A' remains powered on and online.

Expected Behavior
------------------
Power off command reports completed and host is not powered off.

Actual Behavior
----------------
Power off command reports completed and host is powered off.

Reproducibility
---------------
Reproducible 100%

System Configuration
--------------------
Any host with BMC support/provisioned.

Branch/Pull Time/Commit
-----------------------
Any

Last Pass
---------
Never. Requires faulty BMC that does not power off but accepts power off command.

Timestamp/Logs
--------------

2020-02-27T19:36:31.845 [822616.01412] controller-1 mtcAgent hdl mtcNodeHdlrs.cpp (3319) offline_handler : Info : compute-1 still seeing mtcAlive (Y:Y)
2020-02-27T19:36:32.947 [822616.01413] controller-1 mtcAgent hdl mtcNodeHdlrs.cpp (3319) offline_handler : Info : compute-1 still seeing mtcAlive (Y:Y)
2020-02-27T19:36:33.314 [822616.01414] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp (4818) power_handler : Info : compute-1 Power-Off Completed

Test Activity
-------------
Observed while debugging why a BMC did not power off its host while it accepted the power off command.

Workaround
----------
Not required.

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
summary: - Power off host reports completed even if host remains powered on
+ Power off host operation reports completed even if host remains powered
+ on
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Low / not gating - it doesn't seem worth to try to workaround a faulty BMC by adding complexity to the stx software.

Suggest you close this as Won't Fix

tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

I'm fine with that but please keep in mind that this is maintenance code.
The nature of maintenance code is designed to deal with and report faults against faulty hardware.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/763421

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As recommended by Eric, marking for stx.5.0 to add software robustness to deal with this h/w fault

tags: added: stx.5.0
Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.