The confirm_stop function of some OCF scripts has a flaw

Bug #2064368 reported by Tee Ngo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Li Zhu

Bug Description

Brief Description
-----------------
A bug in DC OCF scripts could lead to orphaned processes of the affected services following service restart or controller swact. The orphan processes would eventually exit. This bug was introduced when switching from CentOS (python2) to Debian (python3).

Severity
--------
Minor

Steps to Reproduce
------------------
- Deploy a DC system with many subclouds
- Soak the system for a few hours
- Restart an audit service such as dcmanager-audit-worker using command sudo sm-restart service dcmanager-audit while the workers are performing periodic audits

Expected Behavior
------------------
The dcmanager-audit-worker processes are shut down and new ones are spawn.

Actual Behavior
----------------
Sometimes the old worker processes did not get stopped/killed over the restart and become orphans. These processes would linger around, auditing the same subclouds as the new worker processes until their queues are empty. They eventually exit.

Reproducibility
---------------
Infrequently

System Configuration
--------------------
Disributed Cloud system

Branch/Pull Time/Commit
-----------------------
Apr. 29th, 2024 master load

Last Pass
---------
StarlingX 7.0

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Other - issue discovered by chance

Workaround
----------
Manually kill the orphan processes.

Tee Ngo (teewrs)
summary: - Flaw in the confirm stop of DC OCF scripts
+ Flaw in some OCF scripts can lead to orphan processes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/917825

Changed in starlingx:
status: New → In Progress
Tee Ngo (teewrs)
summary: - Flaw in some OCF scripts can lead to orphan processes
+ The confirm_stop function of some OCF scripts has a flaw
Changed in starlingx:
importance: Undecided → Medium
Tee Ngo (teewrs)
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
assignee: Tee Ngo (teewrs) → Li Zhu (lzhu1)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/917825
Committed: https://opendev.org/starlingx/distcloud/commit/7ce8b728696b7d336f646d52de99811c9d93d416
Submitter: "Zuul (22348)"
Branch: master

commit 7ce8b728696b7d336f646d52de99811c9d93d416
Author: Li Zhu <email address hidden>
Date: Tue Apr 30 22:43:30 2024 -0400

    Fix up confirm_stop functions of DC OCF scripts

    The DC OCF scripts were not updated over the switch to Debian
    in StarlingX 8.0. As a result, it could lead to orphan processes
    over the service restart or controller swact. The orphan processes
    consume resources and perform duplicate/obsolete tasks (e.g.
    auditing the same subclouds as the corresponding worker processes)
    until their work queues are empty.

    This commit fixes up the pgrep option to restore the functionality
    of the confirm_stop function of the OCF script. Processes that
    fail to be terminated will get killed.

    Test Plan:
      - Deploy a small DC system. Verify that all DC services can
        be started, stopped and restarted by SM.
      - Deploy a large DC system with many subclouds. Reduce the
        thread_pool_size of dcmanager-audit-worker. Let the system
        soak for a couple of hours. Restart the service in the
        middle of the audit cycle. Verify that dcmanager-audit-worker
        sevice was successfully restarted and there are no orphan
        processes.

    Closes-Bug: 2064368
    Change-Id: Ie5cbc89cde374e32d4e0a3799a9f8833c071d206
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.10.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.