StarlingX

Worker node is stuck in degraded yet there is no alarm

Bug #1845344 reported by Tee Ngo on 2019-09-25

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Eric MacDonald

Bug Description

Brief Description
-----------------
While the system was idle, a worker node became degraded and stayed in this state for a while without any alarms.

Severity
--------
Major

Steps to Reproduce
------------------
Soak a standard system with a large number app pods

Note: this system exhibits periodic activation of MNFA (https://bugs.launchpad.net/starlingx/+bug/1845342)

Expected Behavior
------------------
Worker node should regain "available" state as soon as its heartbeat becomes regular.

Actual Behavior
----------------
Most nodes show the expected behavior but occasionally a node would get stuck in degraded state until the next MNFA is activated.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
IPv6 standard system

Branch/Pull Time/Commit
-----------------------
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-09-17_20-00-00"
SRC_BUILD_ID="15"

Last Pass
---------
N/A - this appears to be a race condition

Timestamp/Logs
--------------
See logs attached

Test Activity
-------------
System Test

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-09-25:

#1

mnfa_issue.tgz Edit (8.4 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-25:

#2

minor (missing alarm), but would be nice to fix if not risky

summary:	- Worker node is stucked in degraded yet there is no alarm + Worker node is stuck in degraded yet there is no alarm
tags:	added: stx.fm stx.metal
tags:	added: stx.fault removed: stx.fm
Changed in starlingx:
status:	New → Triaged
assignee:	nobody → Eric MacDonald (rocksolidmtce)
importance:	Undecided → Low

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-25:

#3

Issue is understood (day 1 race condition relating to MNFA).
Fix is simple, implemented and soaking in large lab.
Issue has not been seen since test fix applied to lab.

Fix will be posted for review after 12 hrs of soak time.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-25:

#4

Issue is not missing alarm.
Issue is stuck degrade.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-29: Fix proposed to metal (master)

#5

Fix proposed to branch: master
Review: https://review.opendev.org/685577

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-07: Fix merged to metal (master)

#6

Reviewed: https://review.opendev.org/685577
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=f01fd854702d0c8654b9e441315551b28a0a66bc
Submitter: Zuul
Branch: master

commit f01fd854702d0c8654b9e441315551b28a0a66bc
Author: Eric MacDonald <email address hidden>
Date: Sat Sep 28 19:41:53 2019 -0400

Fix MNFA recovery race condition that leads to stuck degrade

Seeing from 0 to 10% of hosts get stuck in the degrade state
after MNFA recovery.

    Clearing host degrade on Multi-Node Failure Avoidance (MNFA)
    recovery does not send degrade clear but does clear the hbs
    controol states. Instead relies on explicit events from
    hbsAgent per host/network to do so.

    If MNFA Recovery (exit) event occurs before all hbsAgent
    clear messages arrive then the hbs control clear tricks
    the mtcAgent into thinking that there was no degrade event
    active when it actually may still be.

    This fix enables the clear option the mon_host MNFA Recovery
    call so that the host's degrade condition is cleared.
    It also removes the unnecessary heartbeat disable call.

Test Plan:

    PASS: soak MNFA in large system over and over to verify
          a 0-10% stuck degrade occurance rate drops to 0
          after many (more than 20) occurances.

Regression:

PASS: Verify heartbeat.
PASS: Verify single node graceful recovery.

    Change-Id: I699a376af5a95cc8dcc6ea5cc8266dc14fbacd09
    Closes-Bug: 1845344
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

mnfa_issue.tgz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.