increase pmon wait time for SM to start

Bug #1998349 reported by Bin Qian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

The maintenance Process Monitor is frequently unable to recover a failed Service Management (SM) process due to a 'spawn timeout' failure.

SM's pmon.conf file has the startuptime = 5 seconds but in Debian it's frequently taking longer than that for the SM process to start and produce its PID file.
In order to make SM recovery process smooth, increase this timeout to 15 seconds.

Tags: stx.8.0 stx.ha
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/866179

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/866179
Committed: https://opendev.org/starlingx/ha/commit/8b5ee400b53856a71d158165238b0e86c814f6c4
Submitter: "Zuul (22348)"
Branch: master

commit 8b5ee400b53856a71d158165238b0e86c814f6c4
Author: Bin Qian <email address hidden>
Date: Wed Nov 30 15:36:37 2022 +0000

    Update pmon SM wait time to 15 seconds

    SM's pmon.conf file has the startuptime = 5 seconds but in Debian
    it's frequently taking longer than that for the SM process to
    start and produce its PID file.
    In order to make SM recovery process smooth, increase this timeout
    to 15 seconds.

    Closes-bug: 1998349

    Signed-off-by: Bin Qian <email address hidden>
    Change-Id: I64476e394e346c9b8cf5b5aca2ad04ba463b9728

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/866504

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/866504
Committed: https://opendev.org/starlingx/ha/commit/88aeba251be2dfaa46a895a221319ec3521ee1f3
Submitter: "Zuul (22348)"
Branch: master

commit 88aeba251be2dfaa46a895a221319ec3521ee1f3
Author: Bin Qian <email address hidden>
Date: Fri Dec 2 17:38:23 2022 -0500

    Update SM lsb script for quick start

    pidof command returns subprocess id when SM main process terminates.
    This result a false postive that SM is already running so the start
    action is skipped.

    Make changes to the SM lsb script to distingrish if a subprocess ID
    is returned, and attempt to kill it to speed up recovery of SM.

    Revert the change to extend startuptime to 15 seconds back to 5.

    Test Cases:
        kill SM process, observe SM process starts immediately after the
        subprocess is killed. SM is recovered within 2 seconds.
        (calculated by last and first logging of SM)

    Change-Id: Ida834e7dd31a493ee6193b4d8ee73ebd97513de2
    Closes-Bug: 1998349
    Signed-off-by: Bin Qian <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
importance: Undecided → Medium
tags: added: stx.8.0 stx.ha
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.