StarlingX

mtcAgent node failure mode handling on-time-reset did not happen

Bug #2042571 reported by Eric MacDonald on 2023-11-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
unlocked-enabled node failure recovery handling issues a one time reset if that node's BMC is provisioned. Node failures due to management and cluster-host network connectivity loss that persists can lead to an isolated host.

This case is normally handled if the inventory profile for that host has the bmc provisioned properly and maintenance has verified and established connectivity to that bmc.

However, the one time reset sometimes does not occur in cases where the active controller spontaneously reboots causing Service Management (SM) to swact to the standby controller. When SM starts mtcAgent on the newly activated standby controller establishing access to all the provisioned nodes BMCs takes some time.

If maintenance detects heartbeat loss against the failed controller BEFORE maintenance has established connectivity with its BMC then the one time reset is skipped. Severe node isolation cases where the reset is skipped may not get recovered.

Severity
--------
Major: Failed node is not recovered in rare (but possible) node failure modes. See description above.

Steps to Reproduce
------------------
provision the bmc for all hosts
Force down all the links on the active controller

Expected Behavior
------------------
Maintenance is able to recover the node via bmc reset

Actual Behavior
----------------
Sometimes the bmc reset is skipped and the node is not recovered until after the 20 minute graceful recovery timeout. The reset will be retried and succeed after 20 minutes but this extends the controller outage for longer than it should.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
Any prior to the closing of this issue case.

Last Pass
---------
Unknown

Timestamp/Logs
--------------
bmc not accessible

Test Activity
-------------
Normal Use

Workaround
----------
None

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-11-02: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/899971

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-11-06: Fix merged to metal (master)

Download full text (5.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/899971
Committed: https://opendev.org/starlingx/metal/commit/79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Submitter: "Zuul (22348)"
Branch: master

commit 79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Author: Eric MacDonald <email address hidden>
Date: Thu Nov 2 15:12:21 2023 +0000

Add bmc reset delay in the reset progression command handler

This update solves two issues involving bmc reset.

    Issue #1: A race condition can occur if the mtcAgent finds an
              unlocked-disabled or heartbeat failing node early in
              its startup sequence, say over a swact or an SM service
              restart and needs to issue a one-time-reset. If at that
              point it has not yet established access to the BMC then
              the one-time-reset request is skipped.

    Issue #2: When issue #1 race conbdition does not occur before BMC
              access is established the mtcAgent will issue its one-time
              reset to a node. If this occurs as a result of a crashdump
              then this one-time reset can interrupt the collection of
              the vmcore crashdump file.

    This update solves both of these issues by introducing a bmc reset
    delay following the detection and in the handling of a failed node
    that 'may' need to be reset to recover from being network isolated.

    The delay prevents the crashdump from being interrupted and removes
    the race condition by giving maintenance more time to establish bmc
    access required to send the reset command.

    To handle significantly long bmc reset delay values this update
    cancels the posted 'in waiting' reset if the target recovers online
    before the delay expires.

    It is recommended to use a bmc reset delay that is longer than a
    typical node reboot time. This is so that in the typical case, where
    there is no crashdump happening, we don't reset the node late in its
    almost done recovery. The number of seconds till the pending reset
    countdown is logged periodically.

    It can take upwards of 2-3 minutes for a crashdump to complete.
    To avoid the double reboot, in the typical case, the bmc reset delay
    is set to 5 minutes which is longer than a typical boot time.
    This means that if the node recovers online before the delay expires
    then great, the reset wasn't needed and is cancelled.

    However, if the node is truely isolated or the shutdown sequence
    hangs then although the recovery is delayed a bit to accomodate for
    the crashdump case, the node is still recovered after the bmc reset
    delay period. This could lead to a double reboot if the node
    recovery-to-online time is longer than the bmc reset delay.

This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.

Some consistency driven logging improvements were also implemented.

Test Plan:

    PASS: Verify failed node crashdump is not interrupted by bmc reset.
    PASS: Verify bmc is accessible after the bmc reset delay.
    PASS: ...

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/899971
Committed: https://opendev.org/starlingx/metal/commit/79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Submitter: "Zuul (22348)"
Branch:    master

commit 79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Nov 2 15:12:21 2023 +0000

Add bmc reset delay in the reset progression command handler
    
    This update solves two issues involving bmc reset.
    
    Issue #1: A race condition can occur if the mtcAgent finds an
              unlocked-disabled or heartbeat failing node early in
              its startup sequence, say over a swact or an SM service
              restart and needs to issue a one-time-reset. If at that
              point it has not yet established access to the BMC then
              the one-time-reset request is skipped.
    
    Issue #2: When issue #1 race conbdition does not occur before BMC
              access is established the mtcAgent will issue its one-time
              reset to a node. If this occurs as a result of a crashdump
              then this one-time reset can interrupt the collection of
              the vmcore crashdump file.
    
    This update solves both of these issues by introducing a bmc reset
    delay following the detection and in the handling of a failed node
    that 'may' need to be reset to recover from being network isolated.
    
    The delay prevents the crashdump from being interrupted and removes
    the race condition by giving maintenance more time to establish bmc
    access required to send the reset command.
    
    To handle significantly long bmc reset delay values this update
    cancels the posted 'in waiting' reset if the target recovers online
    before the delay expires.
    
    It is recommended to use a bmc reset delay that is longer than a
    typical node reboot time. This is so that in the typical case, where
    there is no crashdump happening, we don't reset the node late in its
    almost done recovery. The number of seconds till the pending reset
    countdown is logged periodically.
    
    It can take upwards of 2-3 minutes for a crashdump to complete.
    To avoid the double reboot, in the typical case, the bmc reset delay
    is set to 5 minutes which is longer than a typical boot time.
    This means that if the node recovers online before the delay expires
    then great, the reset wasn't needed and is cancelled.
    
    However, if the node is truely isolated or the shutdown sequence
    hangs then although the recovery is delayed a bit to accomodate for
    the crashdump case, the node is still recovered after the bmc reset
    delay period. This could lead to a double reboot if the node
    recovery-to-online time is longer than the bmc reset delay.
    
    This update implements this change by adding a new 'reset send wait'
    phase to the exhisting reset progression command handler.
    
    Some consistency driven logging improvements were also implemented.
    
    Test Plan:
    
    PASS: Verify failed node crashdump is not interrupted by bmc reset.
    PASS: Verify bmc is accessible after the bmc reset delay.
    PASS: Verify handling of a node recovery case where the node does not
          come back before bmc_reset_delay timeout.
    PASS: Verify posted reset is cancelled if the node goes online before
          the bmc reset delay and uptime shows less than 5 mins.
    PASS: Verify reset is not cancelled if node comes back online without
          reboot before bmc reset delay and still seeing mtcAlive on one
          or more links.Handles the cluster-host only heartbeat loss case.
          The node is still rebooted with the bmc reset delay as backup.
    PASS: Verify reset progression command handling, with and
          without reboot ACKs, with and without bmc
    PASS: Verify reset delay defaults to 5 minutes
    PASS: Verify reset delay change over a manual change and sighup
    PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
    PASS: Verify host-reset when host is already rebooting
    PASS: Verify host-reboot when host is already rebooting
    PASS: Verify timing of retries and bmc reset timeout
    PASS: Verify posted reset throttled log countdown
    
    Failure Mode Cases:
    
    PASS: Verify recovery handling of failed powered off node
    PASS: Verify recovery handling of failed node that never comes online
    PASS: Verify recovery handling when bmc is never accessible
    PASS: Verify recovery handling cluster-host network heartbeat loss
    PASS: Verify recovery handling management network heartbeat loss
    PASS: Verify recovery handling both heartbeat loss
    PASS: Verify mtcAgent restart handling finding unlocked disabled host
    
    Regression:
    
    PASS: Verify build and DX system install
    PASS: Verify lock/unlock (soak 10 loops)
    PASS: Verify host-reboot
    PASS: Verify host-reset
    PASS: Verify host-reinstall
    PASS: Verify reboot graceful recovery (force and no force)
    PASS: Verify transient heartbeat failure handling
    PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
    PASS: Verify SM peer reset handling when standby controller is rebooted
    PASS: Verify logging and issue debug ability
    
    Closes-Bug: 2042567
    Closes-Bug: 2042571
    Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2023-11-24

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.9.0 stx.metal
Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.