StarlingX

Improve Mtce Host Watchdog's emergency logging

Bug #2003801 reported by Christopher de Oliveira Souza on 2023-01-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	New	Low	Unassigned

Bug Description

Brief Description
-----------------------------
The Host Watchdog emergency log messages are not logged and therefore not captured/included in the collect dump.

This likely occurred because emergency logs are printed out to syslog and optionally to the console, and syslog-ng (or rsyslog?) did not have a chance to get scheduled and flush the output to disk before the system was crashed by hostwd when it wrote 'c' to /proc/sysrq-trigger.

In the light of this, could we avoid this issue altogether by writing hostwd's emergency logs to /dev/kmsg in addition to syslog? This will ensure that the emergency logs are captured in the kernel's log buffer (i.e., observed via dmesg) and also in the "vmcore" file for post-mortem analysis. This would help with understanding /why/ hostwd rebooted the system. We could also add a "hostwd:" prefix to the emitted logs to make it obvious that the messages originate from hostwd.

Severity
---------------------------
Minor: affects root cause debug of some host watchdog reboot issues ; quorum process failure and quorum report missing failure mode cases.

Steps to Reproduce
---------------------------
controller-1:~# tail -f /var/log/hostwd.log &
controller-1:~# pkill -stop pmond

Expected Behavior
--------------------------
Rather 'desired behavior' would be that the host watchdog emergency logging is actually logged and included in the collect bundle.

Actual Behavior
-------------------------
Not

Reproducibility
-------------------------
Very reproducible, possibly 100%

System Configuration
-------------------------
Any node

Load info (eg: 2022-03-10_20-00-07)
--------------------------
Any load

Last Pass
---------------------------
No

Timestamp/Logs
---------------------------
sysadmin@controller-1:~$ sudo pkill -stop pmond
Password:
sysadmin@controller-1:~$ tail -f /var/log/hostwd.log
2023-01-17T14:13:58.889 [63001.00009] localhost hostwd --- hostwInit.cpp 2023-01-17T14:13:58.889 [63001.00010] localhost hostwd --- hostwInit.cpp 2023-01-17T14:13:58.889 [63001.00011] localhost hostwd --- hostwMsg.cpp 2023-01-17T14:13:58.889 [63001.00012] localhost hostwd --- daemon_main.cpp 2023-01-17T14:13:58.889 [63001.00013] localhost hostwd --- daemon_main.cpp 2023-01-17T14:13:58.890 [63001.00014] localhost hostwd --- hostwInit.cpp 2023-01-17T14:13:58.890 [63001.00015] localhost hostwd --- hostwInit.cpp 2023-01-17T14:13:58.890 [63001.00016] localhost hostwd com nodeUtil.cpp 2023-01-17T14:13:58.903 [63001.00017] controller-1 hostwd --- hostwHdlr.cpp 2023-01-17T14:13:58.903 [63001.00018] controller-1 hostwd --- hostwHdlr.cpp ^C
sysadmin@controller-1:~$ sudo pkill -stop pmond
sysadmin@controller-1:~$ 2023-01-17T15:19:29.564 [63001.00019] 2023-01-17T15:20:29.565 [63001.00020] controller-1 hostwd --- hostwHdlr.cpp *** Host watchdog (hostwd) not receiving messages from PMON ***
*** Host Watchdog declaring system unhealthy ***
*** Initiating reboot ***
2023-01-17T15:21:29.564 [63001.00021] controller-1 hostwd --- hostwHdlr.cpp 2023-01-17T15:21:29.564 [63001.00022] controller-1 hostwd --- hostwHdlr.cpp 2023-01-17T15:21:29.564 [63001.00023] controller-1 hostwd --- hostwHdlr.cpp 2023-01-17T15:21:29.566 [63001.00024] controller-1 hostwd com nodeUtil.cpp 2023-01-17T15:21:29.568 [230261.00025] controller-1 hostwd com nodeUtil.cpp [ OK ] Removed slice system-modprobe.slice. ( 137) hostw_process_config : Info : Quorum Rate : 30 secs
( 206) socket_init : Info : setting up host socket
( 52) hostw_socket_init : Info : hostwd listening for status updates
( 434) main : Info : Build Date : Thu Jan 5 17:26:41 UTC 202 3
( 435) main : Info : ----------------------------------------- -------------
( 279) daemon_service_run : Info : System is up and running, hostwd ready fo r action
( 328) kernel_watchdog_init : Info : Kernel watchdog enabled with 180 second t imeout
( 704) get_hostname : Info : controller-1
( 109) get_kdump_support : Info : kdump is inactive (inactive)
( 237) hostw_service : Info : Host Watchdog Service running
controller-1 hostwd --- hostwHdlr.cpp ( 280) hostw_service : Warn : Process Quorum Health not received from PMON ; 1 more misses allowed before self-reboot
( 280) hostw_service : Warn : Process Quorum Health not received from PMON ; 0 more misses allowed before self-reboot
( 326) hostw_service :Error : *** Host watchdog (hostwd) not receiving messages from PMON ***
( 409) hostw_log_and_reboot :Error : *** Host Watchdog declaring system unhealthy ***
( 418) hostw_log_and_reboot :Error : *** Initiating reboot ***
(1087) fork_sysreq_reboot : Info : Forked Fail-Safe (Backup) Reboot Action
(1116) fork_graceful_reboot : Info : *** Graceful Reset Thread ***

Alarms
----------------------------------
| 200.004 | controller-1 experienced a service-affecting failure.

Test Activity
----------------------------------
Loaded system

Workaround
----------------------------------
log console

Tags:

Ghada Khalil (gkhalil) on 2023-01-27

tags:	added: stx.metal
Changed in starlingx:
importance:	Undecided → Low

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.