Hearbeat always fails on nodes that reboot with reconfigured heartbeat action handling

Bug #2067917 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------

If the system administrator reconfigures the maintenance heartbeat fault handling action from the default 'fail' to any other setting [degrade,alarm,none] and a node reboots outside of maintenance control, then upon reboot recovery, the 'start host services' and, if the node is an AIO controller, the required subfunction 'goenabled' scripts are not executed. In such a case, the missing subfunction 'goenabled' flag file (/var/run/goenabled_subf) prevents the hbsAgent and hbsClient on that node from entering its in-service mode of operation. Instead they run waiting for the node's In-Test phase to
complete ; which never happens.

The /var/run/goenabled_subf flag file is the AIO In-Test complete gate. It is set if the subfunction 'goenabled' tests pass. However, because this flag file is in /var/run (a volatile directory) it is lost/cleared over a reboot.

Severity
--------
Major for customers that reconfigure maintenance heartbeat fault action handling.

Steps to Reproduce
------------------
system service-parameter-modify platform maintenance heartbeat_failure_action=alarm
system service-parameter-apply platform
log into standby controller and reboot

Expected Behavior
------------------
Node recovers in-service with heartbeat working

Actual Behavior
----------------
Node recovers but heartbeat is not working

Reproducibility
---------------
100% reproducible with heartbeat reconfigured to alarm, degrade or none

System Configuration
--------------------
AIO

Branch/Pull Time/Commit
-----------------------
All loads built prior to this issue being fixed
Loads prior to June 3, 2024

Last Pass
---------
Test escape

Timestamp/Logs
--------------

for the hbsClient

2024-05-28T22:26:04.967 [13537.00020] controller-x hbsClient --- daemon_files.cpp (1081) daemon_wait_for_file : Warn : Waiting for /var/run/goenabled_subf

or for the hbsAgent

2024-06-03T13:33:09.079 [2826.00008] localhost hbsAgent hbs hbsAgent.cpp (1706) daemon_service_run : Info : GOENABLE wait ...

Test Activity
-------------
Normal use in lossy networking environment

Workaround
----------
Lock and unlock affected nodes

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/921332

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)
Download full text (5.1 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/921332
Committed: https://opendev.org/starlingx/metal/commit/1335bc484df331771e995ae822df3af84cc5739d
Submitter: "Zuul (22348)"
Branch: master

commit 1335bc484df331771e995ae822df3af84cc5739d
Author: Eric MacDonald <email address hidden>
Date: Tue Jun 4 19:42:54 2024 +0000

    Add auto run goenabled and start hosts services to mtcClient

    The 'mtcClient' currently automatically runs the main function's
    'goenabled' scripts on process startup for all nodes if and when
    their run preconditions are met.

    However, that is not true for 'start host services' and, in the AIO
    system type case, the subfunction 'goenabled' scripts.

    Typically, this is acceptable because the 'mtcAgent' will request
    these scripts to be run during unlock and failure recovery scenarios.

    However, if the system administrator reconfigures the maintenance
    heartbeat fault handling action from the default 'fail' to any other
    setting [degrade,alarm,none] and a node reboots outside of maintenance
    control, then upon reboot recovery, the 'start host services' and,
    if the node is an AIO controller, the required subfunction 'goenabled'
    scripts are not executed. In such a case, the missing subfunction
    'goenabled' flag file (/var/run/goenabled_subf) prevents the hbsAgent
    and hbsClient on that node from entering its in-service mode of
    operation. Instead they run waiting for the node's In-Test phase to
    complete ; which never happens.

    This can lead to what appears to be suck maintenance heartbeat alarms.
    However, its really caused by the maintenance heartbeat processes on
    that node gated from performing their mission mode function.

    The /var/run/goenabled_subf flag file is the AIO In-Test complete gate.
    It is set if the subfunction 'goenabled' tests pass. However, because
    this flag file is in /var/run (a volatile directory) it is lost/cleared
    over a reboot.

    This update adds the automatic execution of the AIO controller's
    subfunction 'goenabled' scripts and the 'start host services' for
    all nodes. Once all the required preconditions are met the scripts
    are run and that node is ready for service, regardless of how and
    the conditions underwhich it rebooted.

    Testing of this update is focused on
    - Verifying the originating issue is resolved.
    - Verify the changed behavior over the install of all system types.
    - Verify the changed behavior with an uncontrolled reboot or each
      node type for all the supported maintenance heartbeat failure
      action modes.

    Test Plan:

    PASS: Verify install of the following system types
    PASS: - AIO SX
    PASS: - AIO DX and AIO DX Plus
    PASS: - Standard DX with worker and storage nodes (vbox)
    PASS: - System Controller with 1 subcloud (dc-libvirt)

    PASS: Verify spontaneous reboot of unlocked active AIO controller with
    PASS: - heartbeat_failure_action=fail
    PASS: - heartbeat_failure_action=degrade
    PASS: - heartbeat_failure_action=alarm
    PASS: - heartbeat_failure_action=none

 ...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.