Mtce not detecting unresponsive hosts after LO DOR

Bug #1799753 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Eric MacDonald

Bug Description

rief Description
-----------------
Dead Office Recovery (DOR) enhancement changes in Mtce in R5 introduced an issue where heartbeat is not started over a mtcAgent process restart for powered off hosts if the DOR timer does not expire before the add handler completes. This case is more likely in larger systems since that DOR timer length is a factor that gets longer with the number of hosts in the system.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
DOR a large system having 12 or more hosts that are all unlocked-enabled but prevent one or more of those hosts from powering back on.

Expected Behavior
------------------
Hosts that were prevented from being powered on should be detected as failing heartbeat my maintenance on DOR recovery.

Actual Behavior
----------------
No detection because heartbeat is not started on those hosts.

Reproducibility
---------------
Intermittent based on the size of the system but only occurs against hosts that were fine before the DOR or mtcAgent process restart but are completley unresponsive afterwards.

System Configuration
--------------------
large System 12 or more hosts

Branch/Pull Time/Commit
-----------------------
Issue is suspected to exist in R5.
Will be proven during issue investigation and fix.

summary: - Mtce not detecting unresponsive hosts in LO DOR
+ Mtce not detecting unresponsive hosts after LO DOR
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Targeting for stx.2019.03 - double fault scenario (DOR + nodes not recovering after)

Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Eric MacDonald (rocksolidmtce)
status: New → Triaged
tags: added: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Revision history for this message
Ken Young (kenyis) wrote :

emacdona MacDonald, Eric added a comment - 12/Dec/18 2:18 PM
Cannot reproduce.
Tried with single and multi node failures.
Hosts were detected as failed and put into graceful recovery that eventually failed and timed out with full disabled-failed and reboot recovery loop until host(s) got powered on

Issue not seen in this load.

[wrsroot@controller-1 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Release 18.10
###

SW_VERSION="18.10"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2018-12-10_20-18-00"
SRC_BUILD_ID="167"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="167"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2018-12-10 20:20:40 -0500"

Will monitor over then next few weeks.

Changed in starlingx:
status: Triaged → Invalid
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.