mtcAgent node failure mode handling on-time-reset is interrupting crashdump handling

Bug #2042567 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------

unlocked-enabled node failure recovery handling issues a one time reset if that node's BMC is provisioned. In cases where the node failure is due to a sysrq crashdump this reset has been seen to interrupt crashdump handling which leads to no crashdump.

Crashdumps are important to collect as they provide event cause debug information.

This issue requests a change to hold off the bmc reset request long enough to allow the crashdump handler to complete. Crashdump handling can take upwards of 3 minutes or longer on some servers.

Severity
--------
Major: System is completely usable. However, crashdumps will not always produce a crashdump report

Steps to Reproduce
------------------
cause a crashdump by stopping the hostwd or pmond processes

Expected Behavior
------------------
node reboots in 3 minutes producing a crashdump in /.var/log/crash over the reboot

Actual Behavior
----------------
sometimes there is no crashdump

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi node systems

Branch/Pull Time/Commit
-----------------------
Any prior to closing date of this issue

Last Pass
---------
Unknwown

Timestamp/Logs
--------------
2023-09-16T15:03:19.326 [1929239.05379] controller-1 mtcAgent |-| mtcCmdHdlr.cpp ( 476) cmd_handler : Info : controller-0 Performing RESET over Board Management Interface

Test Activity
-------------
Developer Testing

Workaround
----------
Deprovision BMC

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/899971

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)
Download full text (5.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/899971
Committed: https://opendev.org/starlingx/metal/commit/79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Submitter: "Zuul (22348)"
Branch: master

commit 79d8644b1e070c464a2ea19d40fc7fe9ae56df9e
Author: Eric MacDonald <email address hidden>
Date: Thu Nov 2 15:12:21 2023 +0000

    Add bmc reset delay in the reset progression command handler

    This update solves two issues involving bmc reset.

    Issue #1: A race condition can occur if the mtcAgent finds an
              unlocked-disabled or heartbeat failing node early in
              its startup sequence, say over a swact or an SM service
              restart and needs to issue a one-time-reset. If at that
              point it has not yet established access to the BMC then
              the one-time-reset request is skipped.

    Issue #2: When issue #1 race conbdition does not occur before BMC
              access is established the mtcAgent will issue its one-time
              reset to a node. If this occurs as a result of a crashdump
              then this one-time reset can interrupt the collection of
              the vmcore crashdump file.

    This update solves both of these issues by introducing a bmc reset
    delay following the detection and in the handling of a failed node
    that 'may' need to be reset to recover from being network isolated.

    The delay prevents the crashdump from being interrupted and removes
    the race condition by giving maintenance more time to establish bmc
    access required to send the reset command.

    To handle significantly long bmc reset delay values this update
    cancels the posted 'in waiting' reset if the target recovers online
    before the delay expires.

    It is recommended to use a bmc reset delay that is longer than a
    typical node reboot time. This is so that in the typical case, where
    there is no crashdump happening, we don't reset the node late in its
    almost done recovery. The number of seconds till the pending reset
    countdown is logged periodically.

    It can take upwards of 2-3 minutes for a crashdump to complete.
    To avoid the double reboot, in the typical case, the bmc reset delay
    is set to 5 minutes which is longer than a typical boot time.
    This means that if the node recovers online before the delay expires
    then great, the reset wasn't needed and is cancelled.

    However, if the node is truely isolated or the shutdown sequence
    hangs then although the recovery is delayed a bit to accomodate for
    the crashdump case, the node is still recovered after the bmc reset
    delay period. This could lead to a double reboot if the node
    recovery-to-online time is longer than the bmc reset delay.

    This update implements this change by adding a new 'reset send wait'
    phase to the exhisting reset progression command handler.

    Some consistency driven logging improvements were also implemented.

    Test Plan:

    PASS: Verify failed node crashdump is not interrupted by bmc reset.
    PASS: Verify bmc is accessible after the bmc reset delay.
    PASS: ...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.