Mtce host watchdog too long to detect controller overload
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
The current maintenance host watchdog timeout is set at 5 minutes and its minimum setting is 1 minute.
5 minutes, and even 1 minute is too long to detect and react to overload conditions that cause non-rt process stalls of over 15 seconds. Need to detect the failure and force self recovery by reset somewhere in the 10-20 second time frame.
This is of particular concern for the Active/Standby System Controllers.
Need the mtce host watchdog to support watchdog timeout interval in the 10 or 10's of seconds time frame, not just 5 minutes.
Severity
--------
Major: Need the mtce host watchdog to support watchdog timeout interval in the 10 or 10's of seconds, not just 5 minutes.
Steps to Reproduce
------------------
Change the /etc/mtc/
Expected Behavior
------------------
hostwd continues to support pmon quorum monitoring and modified kernel watchdog period with a simple hostwd.
Actual Behavior
----------------
The hostwd does not permit timeout settings less than 60 seconds.
The pmon quorum monitoring would fail for a host watchdog setting in the 10-12 second range due to current implementation.
Reproducibility
---------------
Reproducible: 100%
System Configuration
-------
any
Branch/Pull Time/Commit
-------
all
Last Pass
---------
N/A since the hostwd process has been this way since its introduction.
Timestamp/Logs
--------------
N/A
Test Activity
-------------
Feature development
Workaround
----------
None
Changed in starlingx: | |
status: | Triaged → In Progress |
stx.5.0 / medium priority - issue w/ recovery handling