mtcAgent node failure mode handling on-time-reset did not happen

Bug #2042571 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
unlocked-enabled node failure recovery handling issues a one time reset if that node's BMC is provisioned. Node failures due to management and cluster-host network connectivity loss that persists can lead to an isolated host.

This case is normally handled if the inventory profile for that host has the bmc provisioned properly and maintenance has verified and established connectivity to that bmc.

However, the one time reset sometimes does not occur in cases where the active controller spontaneously reboots causing Service Management (SM) to swact to the standby controller. When SM starts mtcAgent on the newly activated standby controller establishing access to all the provisioned nodes BMCs takes some time.

If maintenance detects heartbeat loss against the failed controller BEFORE maintenance has established connectivity with its BMC then the one time reset is skipped. Severe node isolation cases where the reset is skipped may not get recovered.

Severity
--------
Major: Failed node is not recovered in rare (but possible) node failure modes. See description above.

Steps to Reproduce
------------------
provision the bmc for all hosts
Force down all the links on the active controller

Expected Behavior
------------------
Maintenance is able to recover the node via bmc reset

Actual Behavior
----------------
Sometimes the bmc reset is skipped and the node is not recovered until after the 20 minute graceful recovery timeout. The reset will be retried and succeed after 20 minutes but this extends the controller outage for longer than it should.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
Any prior to the closing of this issue case.

Last Pass
---------
Unknown

Timestamp/Logs
--------------
bmc not accessible

Test Activity
-------------
Normal Use

Workaround
----------
None

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/899971

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)
Download full text (5.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/899971
Committed: https://opendev.org/starlingx/metal/commit/79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Submitter: "Zuul (22348)"
Branch: master

commit 79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Author: Eric MacDonald <email address hidden>
Date: Thu Nov 2 15:12:21 2023 +0000

    Add bmc reset delay in the reset progression command handler

    This update solves two issues involving bmc reset.

    Issue #1: A race condition can occur if the mtcAgent finds an
              unlocked-disabled or heartbeat failing node early in
              its startup sequence, say over a swact or an SM service
              restart and needs to issue a one-time-reset. If at that
              point it has not yet established access to the BMC then
              the one-time-reset request is skipped.

    Issue #2: When issue #1 race conbdition does not occur before BMC
              access is established the mtcAgent will issue its one-time
              reset to a node. If this occurs as a result of a crashdump
              then this one-time reset can interrupt the collection of
              the vmcore crashdump file.

    This update solves both of these issues by introducing a bmc reset
    delay following the detection and in the handling of a failed node
    that 'may' need to be reset to recover from being network isolated.

    The delay prevents the crashdump from being interrupted and removes
    the race condition by giving maintenance more time to establish bmc
    access required to send the reset command.

    To handle significantly long bmc reset delay values this update
    cancels the posted 'in waiting' reset if the target recovers online
    before the delay expires.

    It is recommended to use a bmc reset delay that is longer than a
    typical node reboot time. This is so that in the typical case, where
    there is no crashdump happening, we don't reset the node late in its
    almost done recovery. The number of seconds till the pending reset
    countdown is logged periodically.

    It can take upwards of 2-3 minutes for a crashdump to complete.
    To avoid the double reboot, in the typical case, the bmc reset delay
    is set to 5 minutes which is longer than a typical boot time.
    This means that if the node recovers online before the delay expires
    then great, the reset wasn't needed and is cancelled.

    However, if the node is truely isolated or the shutdown sequence
    hangs then although the recovery is delayed a bit to accomodate for
    the crashdump case, the node is still recovered after the bmc reset
    delay period. This could lead to a double reboot if the node
    recovery-to-online time is longer than the bmc reset delay.

    This update implements this change by adding a new 'reset send wait'
    phase to the exhisting reset progression command handler.

    Some consistency driven logging improvements were also implemented.

    Test Plan:

    PASS: Verify failed node crashdump is not interrupted by bmc reset.
    PASS: Verify bmc is accessible after the bmc reset delay.
    PASS: ...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.