Comment 4 for bug 1853330

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/695295
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=a42301c19be61bd133bd6d509aec89b340a6e51b
Submitter: Zuul
Branch: master

commit a42301c19be61bd133bd6d509aec89b340a6e51b
Author: Eric MacDonald <email address hidden>
Date: Wed Nov 20 14:37:16 2019 -0500

    Make successful pmon-restart clear failed restarts count

    The pmon-restart service, through a call to respawn_process,
    increments that process's restarts counter but does not clear
    that counter after a successful restart.

    So, each pmon-restart mistakenly contributes to that process's
    failure count. This has the effect of pre-loading that process's
    restart counter by one for every pmon-restart of that process.

    The effect is best described by example.
    Say a process is pmon-restart'ed 4 times during one day which
    increments that process's restart counter to 4. So assuming its
    conf file specifies its threshold is 3 ; its already exceeded
    its threshold. Then, even days later that process experiences
    a real failure pmon will immediate take the severity action
    because the failure threshold had already been exceeded.

    This update ensures a process's restart counter is cleared
    after successful pmon-restart operation ; in the process pid
    registration phase of recovery.

    Test Plan:

    PASS: Verify pmon-restart continues to work.
    PASS: Verify proper thresholding of failed process following
          many pmon-restart operations.
    PEND: Verify pmon-restart and process failure automated test script
          against this update. 5 loops, all processes.

    Change-Id: Ib01446f2e053846cd30cb0ca0e06d7c987cdf581
    Closes-Bug: 1853330
    Signed-off-by: Eric MacDonald <email address hidden>