pm-qos-mgr robustness: remove RPM from controller and make daemon restartable

Bug #1840356 reported by Jim Gauld
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Al Bailey

Bug Description

Brief Description
-----------------
The pm-qos-mgr package should only be installed on worker nodes. This package needs to be excluded from controllers. It is currently only excluded from storage. This is not a functional problem, but requires unnecessary patching on controllers if we ever have to update it. This should match the packaging of worker-utils.

The pom-qos-mgr process does not automatically restart if it is killed/stopped abnormally. The RPM spec needs to be modified to handle the case of process stopping abnormally, it should automatically get restarted by systemd.

Proposed changes:
Add "Restart=on-abnormal" to the [Service] section.

Eg, Should look like this.
stx-config/pm-qos-mgr/src/pm-qos-mgr.service
. . .
[Service]
Type=simple
ExecStart=/usr/bin/pm-qos-mgr
Restart=on-abnormal

Align with packaging done for worker-utils:, eg ./stx-metal :
./bsp-files/filter_out_from_controller:worker-utils
./bsp-files/filter_out_from_storage:worker-utils

Need to add the following:
./bsp-files/filter_out_from_controller:pm-qos-mgr

The following already exists.
./bsp-files/filter_out_from_storage:pm-qos-mgr

Severity
--------
Minor: System/Feature is usable with minor issue.

Steps to Reproduce
------------------
Install load on 2+2 Standard configuration.

On controller, observe this RPM exists when it should not.
rpm -qa|grep pm-qos-mgr
pm-qos-mgr-1.0-1.tis.x86_64

sudo pkill -9 -f /usr/bin/pm-qos-mgr
After process dies, it does not get restarted with a new PID.

Expected Behavior
------------------
pm-qos-mgr RPM should only be installed on workers, including AIO controller,worker.
pm-qos-mgr process should restart if it is killed.

Actual Behavior
----------------
pm-qos-mgr is installed everywhere except storage, this includes controllers.
After process is killed, it does not get restarted with a new PID.

Reproducibility
---------------
100%

System Configuration
--------------------
Multi-node system (2+2 Standard)

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-08-14_20-59-00"
SRC_BUILD_ID="10"

JOB="StarlingX_2.0_build"
BUILD_BY="jenkins"
BUILD_NUMBER="10"
BUILD_HOST="yow-cgts4-lx.wrs.com"
BUILD_DATE="2019-08-14 21:00:26 -0400"

Introduced with this:
stx-config
commit 76b1a7a16f536f1187053a22c485d8343e8cc727
Author: Jim Gauld <email address hidden>
Date: Tue May 28 16:34:12 2019 -0400
    Introduce PM QoS cpu latency manager for kubelet

Last Pass
---------
n/a

Timestamp/Logs
--------------
n/a

Test Activity
-------------
Developer testing.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - robustness/cleanup

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Jim Gauld (jgauld)
tags: added: stx.3.0 stx.containers
Revision history for this message
Frank Miller (sensfan22) wrote :

Assigning to Al to implement a solution.

Changed in starlingx:
assignee: Jim Gauld (jgauld) → Al Bailey (albailey1974)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/681547

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/681550

Revision history for this message
Al Bailey (albailey1974) wrote :

Speaking with Eric, Jim and Don, it seems that this process should be managed by pmon rather than adding that change to the service file.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/681550
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=fbdcfa4c4d65202b34d5e6ddb69ee7138eb542ef
Submitter: Zuul
Branch: master

commit fbdcfa4c4d65202b34d5e6ddb69ee7138eb542ef
Author: Al Bailey <email address hidden>
Date: Wed Sep 11 12:15:37 2019 -0500

    Only install pm-qos-mgr on worker and AIO nodes

    The pm-qos-mgr package should only be installed on worker nodes.
    This package needs to be excluded from controllers.
    It had only been excluded from storage.
    This is not a functional problem, but impacted patching
    on controllers if we ever have to update it.

    pm-qos-mgr should match the packaging (filters) of worker-utils.

    Change-Id: I7a68d01be1e4d7dd6f3ef327ccce795362643515
    Partial-Bug: 1840356
    Signed-off-by: Al Bailey <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/681547
Committed: https://git.openstack.org/cgit/starlingx/utilities/commit/?id=01a06ea6e50c1dafde44a6d840d09080fd692ccd
Submitter: Zuul
Branch: master

commit 01a06ea6e50c1dafde44a6d840d09080fd692ccd
Author: Al Bailey <email address hidden>
Date: Wed Sep 11 13:07:48 2019 -0500

    Allow pm-qos-mgr to restart if killed abnormally

    The pm-qos-mgr process does not automatically restart if it is
    killed/stopped abnormally.

    Configuring pm-qos-mgr to be managed by pmon.

    Change-Id: Ifb632d71d63dab6f6b1935880843870ba742f196
    Depends-On: https://review.opendev.org/#/c/681550/
    Partial-Bug: 1840356
    Signed-off-by: Al Bailey <email address hidden>

Revision history for this message
Al Bailey (albailey1974) wrote :

Both commits have merged, this bug is now fixed.

Changed in starlingx:
status: In Progress → Fix Committed
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.