pmon-restart increments process failure count
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
The StarlingX Maintenance Process Monitor Service provides a 'pmon-restart' convenience utility that is mainly used by system patching and configuration service to gracefully restart monitored processes.
The pmon-restart service, through a call to respawn_process, increments that process's restarts counter. However, it does not clear that counter after a successful restart. That's the bug.
So, each pmon-restart mistakenly contributes to that process's failure count. This has the effect of pre-loading that process's restart counter by one for every pmon-restart of that process.
In the case that lead to this issue discovery, a process was pmon-restart'ed 4 times during one day which incremented that process's restart counter to 4 where its threshold is 3 so its already exceeded. Then, days later that process experienced a real failure which resulted in an immediate severity action because the failure threshold had already been exceeded.
The trivial fix for this issue is to clear a process's restart counter after successful pmon-restart operation in the process registration phase of recovery.
Severity
--------
Minor:
- Requires multiple events to recreate, including a real process failure.
- System auto recovers.
- Could be considered major if it occurs for monitored processes that have a critical severity level.
Steps to Reproduce
------------------
select a process that has a restarts threshold of 1 or greater.
pmon-restart process more times that its restart threshold.
kill same process
Expected Behavior
------------------
Process is auto recovered without severity action taken.
Actual Behavior
----------------
Process is auto recovered with severity action taken.
Reproducibility
---------------
Issue is 100% reproducible
System Configuration
-------
Any host in any system
Branch/Pull Time/Commit
-------
Issues exists since the introduction of the pmon-restart convenience utility.
i.e. all versions of StarlingX.
Last Pass
---------
Test escape. This failure scenario was never tested.
Timestamp/Logs
--------------
Not required. Issue is understood by code owner.
Test Activity
-------------
Developer Testing
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Low priority as it's unlikely for users to hit this, but would be nice to fix.