pmon-restart increments process failure count

Bug #1853330 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
The StarlingX Maintenance Process Monitor Service provides a 'pmon-restart' convenience utility that is mainly used by system patching and configuration service to gracefully restart monitored processes.

The pmon-restart service, through a call to respawn_process, increments that process's restarts counter. However, it does not clear that counter after a successful restart. That's the bug.

So, each pmon-restart mistakenly contributes to that process's failure count. This has the effect of pre-loading that process's restart counter by one for every pmon-restart of that process.

In the case that lead to this issue discovery, a process was pmon-restart'ed 4 times during one day which incremented that process's restart counter to 4 where its threshold is 3 so its already exceeded. Then, days later that process experienced a real failure which resulted in an immediate severity action because the failure threshold had already been exceeded.

The trivial fix for this issue is to clear a process's restart counter after successful pmon-restart operation in the process registration phase of recovery.

Severity
--------
Minor:
- Requires multiple events to recreate, including a real process failure.
- System auto recovers.
- Could be considered major if it occurs for monitored processes that have a critical severity level.

Steps to Reproduce
------------------
select a process that has a restarts threshold of 1 or greater.
pmon-restart process more times that its restart threshold.
kill same process

Expected Behavior
------------------
Process is auto recovered without severity action taken.

Actual Behavior
----------------
Process is auto recovered with severity action taken.

Reproducibility
---------------
Issue is 100% reproducible

System Configuration
--------------------
Any host in any system

Branch/Pull Time/Commit
-----------------------
Issues exists since the introduction of the pmon-restart convenience utility.
i.e. all versions of StarlingX.

Last Pass
---------
Test escape. This failure scenario was never tested.

Timestamp/Logs
--------------
Not required. Issue is understood by code owner.

Test Activity
-------------
Developer Testing

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Low priority as it's unlikely for users to hit this, but would be nice to fix.

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.metal
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Update implemented.
Testing in progress.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/695295

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/695295
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=a42301c19be61bd133bd6d509aec89b340a6e51b
Submitter: Zuul
Branch: master

commit a42301c19be61bd133bd6d509aec89b340a6e51b
Author: Eric MacDonald <email address hidden>
Date: Wed Nov 20 14:37:16 2019 -0500

    Make successful pmon-restart clear failed restarts count

    The pmon-restart service, through a call to respawn_process,
    increments that process's restarts counter but does not clear
    that counter after a successful restart.

    So, each pmon-restart mistakenly contributes to that process's
    failure count. This has the effect of pre-loading that process's
    restart counter by one for every pmon-restart of that process.

    The effect is best described by example.
    Say a process is pmon-restart'ed 4 times during one day which
    increments that process's restart counter to 4. So assuming its
    conf file specifies its threshold is 3 ; its already exceeded
    its threshold. Then, even days later that process experiences
    a real failure pmon will immediate take the severity action
    because the failure threshold had already been exceeded.

    This update ensures a process's restart counter is cleared
    after successful pmon-restart operation ; in the process pid
    registration phase of recovery.

    Test Plan:

    PASS: Verify pmon-restart continues to work.
    PASS: Verify proper thresholding of failed process following
          many pmon-restart operations.
    PEND: Verify pmon-restart and process failure automated test script
          against this update. 5 loops, all processes.

    Change-Id: Ib01446f2e053846cd30cb0ca0e06d7c987cdf581
    Closes-Bug: 1853330
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.