Mtce reinstall fails for some servers that are powered off
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
The BMC of some servers, like HPEs, silently reject 'set next boot device' and possibly other board management commands while it is in what is called 'POST Mode' ; aka executing BIOS. This is the time immediately following 'power-on' or 'reset' and continues until the server executes Linux (from disk or iso image).
If the Maintenance Reinstall Handler detects that a server's power is off it will issue a 'power-on' board management command which immediately puts the server into 'POST Mode' (BIOS) where it silently rejects the next board management command to 'set next boot device to pxe' ; which is followed by 'reset'.
Since the 'set next boot device' command did not occur (was silently rejected) the server will boot a valid image on disk if it exists thereby failing the intended reinstall operation.
A similar issue was detected during the development of the 'rvmc' (Redfish Virtual Media Controller) container for subcloud install.
Issue was corrected by switching to an algorithm that always powered the server off first, followed by other board management commands then followed by a power on.
The Maintenance Reinstall Handler will also need to make a similar change.
New Algorithm:
Step 1. Power Off Host
Step 2. Wait for Shutdown
Step 3. Verify Power Off
Step 4. Set next boot device to pxe
Step 5. Power on Host
With this algorithm the host will always Network Boot regardless of the initial power state of the host.
Severity:
---------
Major (has work around)
Work Around:
------------
Power On server, wait for BIOS to complete , issue Reinstall command.
Steps to Reproduce:
-------
Power off server in advance of issuing Maintenance Host Reinstall command/operation.
Expected Behavior:
------------------
Host boots from network even if valid image on disk.
Actual Behavior
----------------
Host boots from local disk instead of from network.
Reproducibility:
----------------
100% of time if Host to be reinstalled has a valid image on disk and is powered off in advance of the reinstall operation.
System Configuration
-------
Any system with HPE Servers.
Branch/Pull Time/Commit
-------
Lab/Server Specific.
Since the development of the deployment manager which added forced network boot to the Maintenance Reinstall Handler.
Last Pass
---------
Newly observed behavior on the HP380 lab using the new auto install with advanced power off algorithm.
Timestamp/Logs
--------------
Not Required.
Test Activity
-------------
Auto Installation with advanced host power-off changes.
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
stx.4.0 / medium priority - there is a workaround