Comment 2 for bug 2067917

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/921332
Committed: https://opendev.org/starlingx/metal/commit/1335bc484df331771e995ae822df3af84cc5739d
Submitter: "Zuul (22348)"
Branch: master

commit 1335bc484df331771e995ae822df3af84cc5739d
Author: Eric MacDonald <email address hidden>
Date: Tue Jun 4 19:42:54 2024 +0000

    Add auto run goenabled and start hosts services to mtcClient

    The 'mtcClient' currently automatically runs the main function's
    'goenabled' scripts on process startup for all nodes if and when
    their run preconditions are met.

    However, that is not true for 'start host services' and, in the AIO
    system type case, the subfunction 'goenabled' scripts.

    Typically, this is acceptable because the 'mtcAgent' will request
    these scripts to be run during unlock and failure recovery scenarios.

    However, if the system administrator reconfigures the maintenance
    heartbeat fault handling action from the default 'fail' to any other
    setting [degrade,alarm,none] and a node reboots outside of maintenance
    control, then upon reboot recovery, the 'start host services' and,
    if the node is an AIO controller, the required subfunction 'goenabled'
    scripts are not executed. In such a case, the missing subfunction
    'goenabled' flag file (/var/run/goenabled_subf) prevents the hbsAgent
    and hbsClient on that node from entering its in-service mode of
    operation. Instead they run waiting for the node's In-Test phase to
    complete ; which never happens.

    This can lead to what appears to be suck maintenance heartbeat alarms.
    However, its really caused by the maintenance heartbeat processes on
    that node gated from performing their mission mode function.

    The /var/run/goenabled_subf flag file is the AIO In-Test complete gate.
    It is set if the subfunction 'goenabled' tests pass. However, because
    this flag file is in /var/run (a volatile directory) it is lost/cleared
    over a reboot.

    This update adds the automatic execution of the AIO controller's
    subfunction 'goenabled' scripts and the 'start host services' for
    all nodes. Once all the required preconditions are met the scripts
    are run and that node is ready for service, regardless of how and
    the conditions underwhich it rebooted.

    Testing of this update is focused on
    - Verifying the originating issue is resolved.
    - Verify the changed behavior over the install of all system types.
    - Verify the changed behavior with an uncontrolled reboot or each
      node type for all the supported maintenance heartbeat failure
      action modes.

    Test Plan:

    PASS: Verify install of the following system types
    PASS: - AIO SX
    PASS: - AIO DX and AIO DX Plus
    PASS: - Standard DX with worker and storage nodes (vbox)
    PASS: - System Controller with 1 subcloud (dc-libvirt)

    PASS: Verify spontaneous reboot of unlocked active AIO controller with
    PASS: - heartbeat_failure_action=fail
    PASS: - heartbeat_failure_action=degrade
    PASS: - heartbeat_failure_action=alarm
    PASS: - heartbeat_failure_action=none

    PASS: Verify spontaneous reboot of unlocked standby AIO controller with
    PASS: - heartbeat_failure_action=fail
    PASS: - heartbeat_failure_action=degrade
    PASS: - heartbeat_failure_action=alarm
    PASS: - heartbeat_failure_action=none

    PASS: Verify reboot recovery after spontaneous reboot of worker
    PASS: Verify reboot recovery after spontaneous reboot of storage
    PASS: Verify start host services is run on mtcClient process startup.
    PASS: Verify start host services is run on worker and storage nodes
          when rebooted with all heartbeat failure recovery action modes.

    Regression:

    PASS: Verify degrade and alarm management over in-service heartbeat
          failure while when heartbeat_failure_action=fail
    PASS: Verify degrade and alarm management over in-service heartbeat
          failure while when heartbeat_failure_action=degrade
    PASS: Verify degrade and alarm management over in-service heartbeat
          failure while when heartbeat_failure_action=alarm
    PASS: Verify no alarm or degrade over in-service heartbeat
          failure while when heartbeat_failure_action=none
    PASS: Verify mtcClint over AIO standby controller lock/unlock
    PASS: Verify start host services is run on mtcClient on every node
          by command from mtcAgent process startup.
    PASS: Verify start host services is run on mtcClient over a unlock or
          graceful recovery by command from mtcAgent.
    PASS: Verify start host services check follows goenabled test
          completion on process startup.
    PASS: Verify stop host services is run over a node lock.
    PASS: Verify goenable main and subfunction failure handling
    PASS: Verify start hosts service failure handling
    PASS: Verify no coredump or crashdumps
    PASS: Verify no stuck alarms

    Closes-Bug: 2067917
    Change-Id: Ie8aaf5da20b092267f637ad3df125019c244991b
    Signed-off-by: Eric MacDonald <email address hidden>