Improve Mtce Host Watchdog's emergency logging

Bug #2003801 reported by Christopher de Oliveira Souza
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
New
Low
Unassigned

Bug Description

Brief Description
-----------------------------
The Host Watchdog emergency log messages are not logged and therefore not captured/included in the collect dump.

This likely occurred because emergency logs are printed out to syslog and optionally to the console, and syslog-ng (or rsyslog?) did not have a chance to get scheduled and flush the output to disk before the system was crashed by hostwd when it wrote 'c' to /proc/sysrq-trigger.

In the light of this, could we avoid this issue altogether by writing hostwd's emergency logs to /dev/kmsg in addition to syslog? This will ensure that the emergency logs are captured in the kernel's log buffer (i.e., observed via dmesg) and also in the "vmcore" file for post-mortem analysis. This would help with understanding /why/ hostwd rebooted the system. We could also add a "hostwd:" prefix to the emitted logs to make it obvious that the messages originate from hostwd.

Severity
---------------------------
Minor: affects root cause debug of some host watchdog reboot issues ; quorum process failure and quorum report missing failure mode cases.

Steps to Reproduce
---------------------------
controller-1:~# tail -f /var/log/hostwd.log &
controller-1:~# pkill -stop pmond

Expected Behavior
--------------------------
Rather 'desired behavior' would be that the host watchdog emergency logging is actually logged and included in the collect bundle.

Actual Behavior
-------------------------
Not

Reproducibility
-------------------------
Very reproducible, possibly 100%

System Configuration
-------------------------
Any node

Load info (eg: 2022-03-10_20-00-07)
--------------------------
Any load

Last Pass
---------------------------
No

Timestamp/Logs
---------------------------
sysadmin@controller-1:~$ sudo pkill -stop pmond
Password:
sysadmin@controller-1:~$ tail -f /var/log/hostwd.log
2023-01-17T14:13:58.889 [63001.00009] localhost hostwd --- hostwInit.cpp ( 137) hostw_process_config : Info : Quorum Rate : 30 secs
2023-01-17T14:13:58.889 [63001.00010] localhost hostwd --- hostwInit.cpp ( 206) socket_init : Info : setting up host socket
2023-01-17T14:13:58.889 [63001.00011] localhost hostwd --- hostwMsg.cpp ( 52) hostw_socket_init : Info : hostwd listening for status updates
2023-01-17T14:13:58.889 [63001.00012] localhost hostwd --- daemon_main.cpp ( 434) main : Info : Build Date : Thu Jan 5 17:26:41 UTC 202 3
2023-01-17T14:13:58.889 [63001.00013] localhost hostwd --- daemon_main.cpp ( 435) main : Info : ----------------------------------------- -------------
2023-01-17T14:13:58.890 [63001.00014] localhost hostwd --- hostwInit.cpp ( 279) daemon_service_run : Info : System is up and running, hostwd ready fo r action
2023-01-17T14:13:58.890 [63001.00015] localhost hostwd --- hostwInit.cpp ( 328) kernel_watchdog_init : Info : Kernel watchdog enabled with 180 second t imeout
2023-01-17T14:13:58.890 [63001.00016] localhost hostwd com nodeUtil.cpp ( 704) get_hostname : Info : controller-1
2023-01-17T14:13:58.903 [63001.00017] controller-1 hostwd --- hostwHdlr.cpp ( 109) get_kdump_support : Info : kdump is inactive (inactive)
2023-01-17T14:13:58.903 [63001.00018] controller-1 hostwd --- hostwHdlr.cpp ( 237) hostw_service : Info : Host Watchdog Service running
^C
sysadmin@controller-1:~$ sudo pkill -stop pmond
sysadmin@controller-1:~$ 2023-01-17T15:19:29.564 [63001.00019] controller-1 hostwd --- hostwHdlr.cpp ( 280) hostw_service : Warn : Process Quorum Health not received from PMON ; 1 more misses allowed before self-reboot
2023-01-17T15:20:29.565 [63001.00020] controller-1 hostwd --- hostwHdlr.cpp ( 280) hostw_service : Warn : Process Quorum Health not received from PMON ; 0 more misses allowed before self-reboot
*** Host watchdog (hostwd) not receiving messages from PMON ***
*** Host Watchdog declaring system unhealthy ***
*** Initiating reboot ***
2023-01-17T15:21:29.564 [63001.00021] controller-1 hostwd --- hostwHdlr.cpp ( 326) hostw_service :Error : *** Host watchdog (hostwd) not receiving messages from PMON ***
2023-01-17T15:21:29.564 [63001.00022] controller-1 hostwd --- hostwHdlr.cpp ( 409) hostw_log_and_reboot :Error : *** Host Watchdog declaring system unhealthy ***
2023-01-17T15:21:29.564 [63001.00023] controller-1 hostwd --- hostwHdlr.cpp ( 418) hostw_log_and_reboot :Error : *** Initiating reboot ***
2023-01-17T15:21:29.566 [63001.00024] controller-1 hostwd com nodeUtil.cpp (1087) fork_sysreq_reboot : Info : Forked Fail-Safe (Backup) Reboot Action
2023-01-17T15:21:29.568 [230261.00025] controller-1 hostwd com nodeUtil.cpp (1116) fork_graceful_reboot : Info : *** Graceful Reset Thread ***
[ OK ] Removed slice system-modprobe.slice.

Alarms
----------------------------------
| 200.004 | controller-1 experienced a service-affecting failure.

Test Activity
----------------------------------
Loaded system

Workaround
----------------------------------
log console

Tags: stx.metal
Ghada Khalil (gkhalil)
tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.