Worker node is stuck in degraded yet there is no alarm

Bug #1845344 reported by Tee Ngo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
While the system was idle, a worker node became degraded and stayed in this state for a while without any alarms.

Severity
--------
Major

Steps to Reproduce
------------------
Soak a standard system with a large number app pods

Note: this system exhibits periodic activation of MNFA (https://bugs.launchpad.net/starlingx/+bug/1845342)

Expected Behavior
------------------
Worker node should regain "available" state as soon as its heartbeat becomes regular.

Actual Behavior
----------------
Most nodes show the expected behavior but occasionally a node would get stuck in degraded state until the next MNFA is activated.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
IPv6 standard system

Branch/Pull Time/Commit
-----------------------
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-09-17_20-00-00"
SRC_BUILD_ID="15"

Last Pass
---------
N/A - this appears to be a race condition

Timestamp/Logs
--------------
See logs attached

Test Activity
-------------
System Test

Revision history for this message
Tee Ngo (teewrs) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

minor (missing alarm), but would be nice to fix if not risky

summary: - Worker node is stucked in degraded yet there is no alarm
+ Worker node is stuck in degraded yet there is no alarm
tags: added: stx.fm stx.metal
tags: added: stx.fault
removed: stx.fm
Changed in starlingx:
status: New → Triaged
assignee: nobody → Eric MacDonald (rocksolidmtce)
importance: Undecided → Low
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue is understood (day 1 race condition relating to MNFA).
Fix is simple, implemented and soaking in large lab.
Issue has not been seen since test fix applied to lab.

Fix will be posted for review after 12 hrs of soak time.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue is not missing alarm.
Issue is stuck degrade.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/685577

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/685577
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=f01fd854702d0c8654b9e441315551b28a0a66bc
Submitter: Zuul
Branch: master

commit f01fd854702d0c8654b9e441315551b28a0a66bc
Author: Eric MacDonald <email address hidden>
Date: Sat Sep 28 19:41:53 2019 -0400

    Fix MNFA recovery race condition that leads to stuck degrade

    Seeing from 0 to 10% of hosts get stuck in the degrade state
    after MNFA recovery.

    Clearing host degrade on Multi-Node Failure Avoidance (MNFA)
    recovery does not send degrade clear but does clear the hbs
    controol states. Instead relies on explicit events from
    hbsAgent per host/network to do so.

    If MNFA Recovery (exit) event occurs before all hbsAgent
    clear messages arrive then the hbs control clear tricks
    the mtcAgent into thinking that there was no degrade event
    active when it actually may still be.

    This fix enables the clear option the mon_host MNFA Recovery
    call so that the host's degrade condition is cleared.
    It also removes the unnecessary heartbeat disable call.

    Test Plan:

    PASS: soak MNFA in large system over and over to verify
          a 0-10% stuck degrade occurance rate drops to 0
          after many (more than 20) occurances.

    Regression:

    PASS: Verify heartbeat.
    PASS: Verify single node graceful recovery.

    Change-Id: I699a376af5a95cc8dcc6ea5cc8266dc14fbacd09
    Closes-Bug: 1845344
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.