Mtce not detecting unresponsive hosts after LO DOR
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Invalid
|
Medium
|
Eric MacDonald |
Bug Description
rief Description
-----------------
Dead Office Recovery (DOR) enhancement changes in Mtce in R5 introduced an issue where heartbeat is not started over a mtcAgent process restart for powered off hosts if the DOR timer does not expire before the add handler completes. This case is more likely in larger systems since that DOR timer length is a factor that gets longer with the number of hosts in the system.
Severity
--------
<Major: System/Feature is usable but degraded>
Steps to Reproduce
------------------
DOR a large system having 12 or more hosts that are all unlocked-enabled but prevent one or more of those hosts from powering back on.
Expected Behavior
------------------
Hosts that were prevented from being powered on should be detected as failing heartbeat my maintenance on DOR recovery.
Actual Behavior
----------------
No detection because heartbeat is not started on those hosts.
Reproducibility
---------------
Intermittent based on the size of the system but only occurs against hosts that were fine before the DOR or mtcAgent process restart but are completley unresponsive afterwards.
System Configuration
-------
large System 12 or more hosts
Branch/Pull Time/Commit
-------
Issue is suspected to exist in R5.
Will be proven during issue investigation and fix.
summary: |
- Mtce not detecting unresponsive hosts in LO DOR + Mtce not detecting unresponsive hosts after LO DOR |
tags: | added: stx.metal |
tags: |
added: stx.2019.05 removed: stx.2019.03 |
tags: |
added: stx.2.0 removed: stx.2019.05 |
Targeting for stx.2019.03 - double fault scenario (DOR + nodes not recovering after)