StarlingX

Mtce not detecting unresponsive hosts after LO DOR

Bug #1799753 reported by Eric MacDonald on 2018-10-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Eric MacDonald

Bug Description

rief Description
-----------------
Dead Office Recovery (DOR) enhancement changes in Mtce in R5 introduced an issue where heartbeat is not started over a mtcAgent process restart for powered off hosts if the DOR timer does not expire before the add handler completes. This case is more likely in larger systems since that DOR timer length is a factor that gets longer with the number of hosts in the system.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
DOR a large system having 12 or more hosts that are all unlocked-enabled but prevent one or more of those hosts from powering back on.

Expected Behavior
------------------
Hosts that were prevented from being powered on should be detected as failing heartbeat my maintenance on DOR recovery.

Actual Behavior
----------------
No detection because heartbeat is not started on those hosts.

Reproducibility
---------------
Intermittent based on the size of the system but only occurs against hosts that were fine before the DOR or mtcAgent process restart but are completley unresponsive afterwards.

System Configuration
--------------------
large System 12 or more hosts

Branch/Pull Time/Commit
-----------------------
Issue is suspected to exist in R5.
Will be proven during issue investigation and fix.

Tags:

Eric MacDonald (rocksolidmtce) on 2018-10-24

summary:

- Mtce not detecting unresponsive hosts in LO DOR
+ Mtce not detecting unresponsive hosts after LO DOR

Ghada Khalil (gkhalil) on 2018-10-24

tags:

added: stx.metal

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2018-10-26:

Targeting for stx.2019.03 - double fault scenario (DOR + nodes not recovering after)

Changed in starlingx:
importance:	Undecided → Medium
assignee:	nobody → Eric MacDonald (rocksolidmtce)
status:	New → Triaged
tags:	added: stx.2019.03

Ken Young (kenyis) on 2019-01-18

tags:

added: stx.2019.05
removed: stx.2019.03

Revision history for this message

Ken Young (kenyis) wrote on 2019-03-18:

emacdona MacDonald, Eric added a comment - 12/Dec/18 2:18 PM
Cannot reproduce.
Tried with single and multi node failures.
Hosts were detected as failed and put into graceful recovery that eventually failed and timed out with full disabled-failed and reboot recovery loop until host(s) got powered on

Issue not seen in this load.

[wrsroot@controller-1 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Release 18.10
###

SW_VERSION="18.10"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2018-12-10_20-18-00"
SRC_BUILD_ID="167"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="167"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2018-12-10 20:20:40 -0500"

Will monitor over then next few weeks.

Changed in starlingx:
status:	Triaged → Invalid

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.