Enable 'soft_panic' kernel config option

Bug #1898602 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

This issue requests to enable the 'soft_panic' kernel config option for two reasons:

1. Assist in debug of system stalls that lead to a softdog (kernel watchdog) timeout.
   - a panic will show what the processor was doing during the stall.

2. To know that a the softdog was a reset cause.
   - there is no kernel or dmesg log that reaches disk over a softdog reset event.

https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c

   static int soft_panic;
   module_param(soft_panic, int, 0);
   MODULE_PARM_DESC(soft_panic,
       "Softdog action, set to 1 to panic, 0 to reboot (default=0)");

Want 'soft_panic=1'

Severity
--------
Major: No explicit bug but is an important change to assist system stall debug and understanding softdog reset timeout case cause.

Steps to Reproduce
------------------
N/A

Expected Behavior
------------------
Panic or kern log on softdog timeout

Actual Behavior
----------------
No panic or kern log on softdog timeout

Reproducibility
---------------
100%

System Configuration
--------------------
All

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
N/A

Timestamp/Logs
--------------
softdog timeout is only indicated by console log which does not persist.
  "Initiating system reboot"

Need kernel log or panic logs.
The panic logs will provide a whole lot more information as well.

Test Activity
-------------
Issue Investigation

Workaround
----------
Monitor and record console output for softdog timeout log.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - would be nice to help w/ issue debugging

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Eric MacDonald (rocksolidmtce)
tags: added: stx.5.0 stx.config stx.distro.other
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Add commands that log the BMC time and Linux time at the time of collect.

controller-1:~$ sudo ipmitool sel time get
10/08/2020 16:47:28
controller-1:~$ date
Thu Oct 8 17:47:50 UTC 2020

That way we can correlate logs between BMC and Linux them.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :
Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/759936

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/759936
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=bd24e32af21d2d053b2719b5eab021875d0fc227
Submitter: Zuul
Branch: master

commit bd24e32af21d2d053b2719b5eab021875d0fc227
Author: Eric MacDonald <email address hidden>
Date: Tue Oct 27 15:48:13 2020 -0400

    Enhance crashDumpMgr with oversized crash dump protection

    This update adds some file system protection measures on
    crash dump management. The robustness improvements include

    1. adds a crash dump upper size limit. Only the crash
       dump summary file is preserved in cases where the
       vmcore file exceeds the max-size allowed. The
       max-size is specified as a run option in the
       crashDumpMgr service file.

    2. delete any /var/crash files that are not vmcore
       or summary files. This handles the case where
       the kernel generates an incomplete crash dump with
       a filename other than 'vmcore'.

    3. delete any new crash dump if copying it to /var/log/crash
       would leave that filesystem with less than 1G of remaining
       space.

    Change-Id: I50ca2bf3c608f333ff5a468b1813fa86e70ab766
    Partial-Bug: 1898602
    Signed-off-by: Eric MacDonald <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.