StarlingX

mtcAgent node failure mode handling on-time-reset is interrupting crashdump handling

Bug #2042567 reported by Eric MacDonald on 2023-11-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------

unlocked-enabled node failure recovery handling issues a one time reset if that node's BMC is provisioned. In cases where the node failure is due to a sysrq crashdump this reset has been seen to interrupt crashdump handling which leads to no crashdump.

Crashdumps are important to collect as they provide event cause debug information.

This issue requests a change to hold off the bmc reset request long enough to allow the crashdump handler to complete. Crashdump handling can take upwards of 3 minutes or longer on some servers.

Severity
--------
Major: System is completely usable. However, crashdumps will not always produce a crashdump report

Steps to Reproduce
------------------
cause a crashdump by stopping the hostwd or pmond processes

Expected Behavior
------------------
node reboots in 3 minutes producing a crashdump in /.var/log/crash over the reboot

Actual Behavior
----------------
sometimes there is no crashdump

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi node systems

Branch/Pull Time/Commit
-----------------------
Any prior to closing date of this issue

Last Pass
---------
Unknwown

Timestamp/Logs
--------------
2023-09-16T15:03:19.326 [1929239.05379] controller-1 mtcAgent |-| mtcCmdHdlr.cpp ( 476) cmd_handler : Info : controller-0 Performing RESET over Board Management Interface

Test Activity
-------------
Developer Testing

Workaround
----------
Deprovision BMC

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-11-02: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/899971

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-11-06: Fix merged to metal (master)

Download full text (5.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/899971
Committed: https://opendev.org/starlingx/metal/commit/79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Submitter: "Zuul (22348)"
Branch: master

commit 79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Author: Eric MacDonald <email address hidden>
Date: Thu Nov 2 15:12:21 2023 +0000

Add bmc reset delay in the reset progression command handler

This update solves two issues involving bmc reset.

    Issue #1: A race condition can occur if the mtcAgent finds an
              unlocked-disabled or heartbeat failing node early in
              its startup sequence, say over a swact or an SM service
              restart and needs to issue a one-time-reset. If at that
              point it has not yet established access to the BMC then
              the one-time-reset request is skipped.

    Issue #2: When issue #1 race conbdition does not occur before BMC
              access is established the mtcAgent will issue its one-time
              reset to a node. If this occurs as a result of a crashdump
              then this one-time reset can interrupt the collection of
              the vmcore crashdump file.

    This update solves both of these issues by introducing a bmc reset
    delay following the detection and in the handling of a failed node
    that 'may' need to be reset to recover from being network isolated.

    The delay prevents the crashdump from being interrupted and removes
    the race condition by giving maintenance more time to establish bmc
    access required to send the reset command.

    To handle significantly long bmc reset delay values this update
    cancels the posted 'in waiting' reset if the target recovers online
    before the delay expires.

    It is recommended to use a bmc reset delay that is longer than a
    typical node reboot time. This is so that in the typical case, where
    there is no crashdump happening, we don't reset the node late in its
    almost done recovery. The number of seconds till the pending reset
    countdown is logged periodically.

    It can take upwards of 2-3 minutes for a crashdump to complete.
    To avoid the double reboot, in the typical case, the bmc reset delay
    is set to 5 minutes which is longer than a typical boot time.
    This means that if the node recovers online before the delay expires
    then great, the reset wasn't needed and is cancelled.

    However, if the node is truely isolated or the shutdown sequence
    hangs then although the recovery is delayed a bit to accomodate for
    the crashdump case, the node is still recovered after the bmc reset
    delay period. This could lead to a double reboot if the node
    recovery-to-online time is longer than the bmc reset delay.

This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.

Some consistency driven logging improvements were also implemented.

Test Plan:

    PASS: Verify failed node crashdump is not interrupted by bmc reset.
    PASS: Verify bmc is accessible after the bmc reset delay.
    PASS: ...

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/899971
Committed: https://opendev.org/starlingx/metal/commit/79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Submitter: "Zuul (22348)"
Branch:    master

commit 79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Nov 2 15:12:21 2023 +0000

Add bmc reset delay in the reset progression command handler
    
    This update solves two issues involving bmc reset.
    
    Issue #1: A race condition can occur if the mtcAgent finds an
              unlocked-disabled or heartbeat failing node early in
              its startup sequence, say over a swact or an SM service
              restart and needs to issue a one-time-reset. If at that
              point it has not yet established access to the BMC then
              the one-time-reset request is skipped.
    
    Issue #2: When issue #1 race conbdition does not occur before BMC
              access is established the mtcAgent will issue its one-time
              reset to a node. If this occurs as a result of a crashdump
              then this one-time reset can interrupt the collection of
              the vmcore crashdump file.
    
    This update solves both of these issues by introducing a bmc reset
    delay following the detection and in the handling of a failed node
    that 'may' need to be reset to recover from being network isolated.
    
    The delay prevents the crashdump from being interrupted and removes
    the race condition by giving maintenance more time to establish bmc
    access required to send the reset command.
    
    To handle significantly long bmc reset delay values this update
    cancels the posted 'in waiting' reset if the target recovers online
    before the delay expires.
    
    It is recommended to use a bmc reset delay that is longer than a
    typical node reboot time. This is so that in the typical case, where
    there is no crashdump happening, we don't reset the node late in its
    almost done recovery. The number of seconds till the pending reset
    countdown is logged periodically.
    
    It can take upwards of 2-3 minutes for a crashdump to complete.
    To avoid the double reboot, in the typical case, the bmc reset delay
    is set to 5 minutes which is longer than a typical boot time.
    This means that if the node recovers online before the delay expires
    then great, the reset wasn't needed and is cancelled.
    
    However, if the node is truely isolated or the shutdown sequence
    hangs then although the recovery is delayed a bit to accomodate for
    the crashdump case, the node is still recovered after the bmc reset
    delay period. This could lead to a double reboot if the node
    recovery-to-online time is longer than the bmc reset delay.
    
    This update implements this change by adding a new 'reset send wait'
    phase to the exhisting reset progression command handler.
    
    Some consistency driven logging improvements were also implemented.
    
    Test Plan:
    
    PASS: Verify failed node crashdump is not interrupted by bmc reset.
    PASS: Verify bmc is accessible after the bmc reset delay.
    PASS: Verify handling of a node recovery case where the node does not
          come back before bmc_reset_delay timeout.
    PASS: Verify posted reset is cancelled if the node goes online before
          the bmc reset delay and uptime shows less than 5 mins.
    PASS: Verify reset is not cancelled if node comes back online without
          reboot before bmc reset delay and still seeing mtcAlive on one
          or more links.Handles the cluster-host only heartbeat loss case.
          The node is still rebooted with the bmc reset delay as backup.
    PASS: Verify reset progression command handling, with and
          without reboot ACKs, with and without bmc
    PASS: Verify reset delay defaults to 5 minutes
    PASS: Verify reset delay change over a manual change and sighup
    PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
    PASS: Verify host-reset when host is already rebooting
    PASS: Verify host-reboot when host is already rebooting
    PASS: Verify timing of retries and bmc reset timeout
    PASS: Verify posted reset throttled log countdown
    
    Failure Mode Cases:
    
    PASS: Verify recovery handling of failed powered off node
    PASS: Verify recovery handling of failed node that never comes online
    PASS: Verify recovery handling when bmc is never accessible
    PASS: Verify recovery handling cluster-host network heartbeat loss
    PASS: Verify recovery handling management network heartbeat loss
    PASS: Verify recovery handling both heartbeat loss
    PASS: Verify mtcAgent restart handling finding unlocked disabled host
    
    Regression:
    
    PASS: Verify build and DX system install
    PASS: Verify lock/unlock (soak 10 loops)
    PASS: Verify host-reboot
    PASS: Verify host-reset
    PASS: Verify host-reinstall
    PASS: Verify reboot graceful recovery (force and no force)
    PASS: Verify transient heartbeat failure handling
    PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
    PASS: Verify SM peer reset handling when standby controller is rebooted
    PASS: Verify logging and issue debug ability
    
    Closes-Bug: 2042567
    Closes-Bug: 2042571
    Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2023-11-24

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.9.0 stx.metal
Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.