mtcAgent node failure mode handling on-time-reset is interrupting crashdump handling
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
unlocked-enabled node failure recovery handling issues a one time reset if that node's BMC is provisioned. In cases where the node failure is due to a sysrq crashdump this reset has been seen to interrupt crashdump handling which leads to no crashdump.
Crashdumps are important to collect as they provide event cause debug information.
This issue requests a change to hold off the bmc reset request long enough to allow the crashdump handler to complete. Crashdump handling can take upwards of 3 minutes or longer on some servers.
Severity
--------
Major: System is completely usable. However, crashdumps will not always produce a crashdump report
Steps to Reproduce
------------------
cause a crashdump by stopping the hostwd or pmond processes
Expected Behavior
------------------
node reboots in 3 minutes producing a crashdump in /.var/log/crash over the reboot
Actual Behavior
----------------
sometimes there is no crashdump
Reproducibility
---------------
Intermittent
System Configuration
-------
Multi node systems
Branch/Pull Time/Commit
-------
Any prior to closing date of this issue
Last Pass
---------
Unknwown
Timestamp/Logs
--------------
2023-09-
Test Activity
-------------
Developer Testing
Workaround
----------
Deprovision BMC
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.metal |
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Fix proposed to branch: master /review. opendev. org/c/starlingx /metal/ +/899971
Review: https:/