mtcAgent node failure mode handling on-time-reset did not happen
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
unlocked-enabled node failure recovery handling issues a one time reset if that node's BMC is provisioned. Node failures due to management and cluster-host network connectivity loss that persists can lead to an isolated host.
This case is normally handled if the inventory profile for that host has the bmc provisioned properly and maintenance has verified and established connectivity to that bmc.
However, the one time reset sometimes does not occur in cases where the active controller spontaneously reboots causing Service Management (SM) to swact to the standby controller. When SM starts mtcAgent on the newly activated standby controller establishing access to all the provisioned nodes BMCs takes some time.
If maintenance detects heartbeat loss against the failed controller BEFORE maintenance has established connectivity with its BMC then the one time reset is skipped. Severe node isolation cases where the reset is skipped may not get recovered.
Severity
--------
Major: Failed node is not recovered in rare (but possible) node failure modes. See description above.
Steps to Reproduce
------------------
provision the bmc for all hosts
Force down all the links on the active controller
Expected Behavior
------------------
Maintenance is able to recover the node via bmc reset
Actual Behavior
----------------
Sometimes the bmc reset is skipped and the node is not recovered until after the 20 minute graceful recovery timeout. The reset will be retried and succeed after 20 minutes but this extends the controller outage for longer than it should.
Reproducibility
---------------
Intermittent
System Configuration
-------
Multi-node system
Branch/Pull Time/Commit
-------
Any prior to the closing of this issue case.
Last Pass
---------
Unknown
Timestamp/Logs
--------------
bmc not accessible
Test Activity
-------------
Normal Use
Workaround
----------
None
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.metal |
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Fix proposed to branch: master /review. opendev. org/c/starlingx /metal/ +/899971
Review: https:/