The DC OCF scripts were not updated over the switch to Debian
in StarlingX 8.0. As a result, it could lead to orphan processes
over the service restart or controller swact. The orphan processes
consume resources and perform duplicate/obsolete tasks (e.g.
auditing the same subclouds as the corresponding worker processes)
until their work queues are empty.
This commit fixes up the pgrep option to restore the functionality
of the confirm_stop function of the OCF script. Processes that
fail to be terminated will get killed.
Test Plan:
- Deploy a small DC system. Verify that all DC services can
be started, stopped and restarted by SM.
- Deploy a large DC system with many subclouds. Reduce the thread_pool_size of dcmanager-audit-worker. Let the system
soak for a couple of hours. Restart the service in the
middle of the audit cycle. Verify that dcmanager-audit-worker
sevice was successfully restarted and there are no orphan
processes.
Closes-Bug: 2064368
Change-Id: Ie5cbc89cde374e32d4e0a3799a9f8833c071d206
Signed-off-by: Tee Ngo <email address hidden>
Reviewed: https:/ /review. opendev. org/c/starlingx /distcloud/ +/917825 /opendev. org/starlingx/ distcloud/ commit/ 7ce8b728696b7d3 36f646d52de9981 1c9d93d416
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 7ce8b728696b7d3 36f646d52de9981 1c9d93d416
Author: Li Zhu <email address hidden>
Date: Tue Apr 30 22:43:30 2024 -0400
Fix up confirm_stop functions of DC OCF scripts
The DC OCF scripts were not updated over the switch to Debian
in StarlingX 8.0. As a result, it could lead to orphan processes
over the service restart or controller swact. The orphan processes
consume resources and perform duplicate/obsolete tasks (e.g.
auditing the same subclouds as the corresponding worker processes)
until their work queues are empty.
This commit fixes up the pgrep option to restore the functionality
of the confirm_stop function of the OCF script. Processes that
fail to be terminated will get killed.
Test Plan:
thread_ pool_size of dcmanager- audit-worker. Let the system audit-worker
- Deploy a small DC system. Verify that all DC services can
be started, stopped and restarted by SM.
- Deploy a large DC system with many subclouds. Reduce the
soak for a couple of hours. Restart the service in the
middle of the audit cycle. Verify that dcmanager-
sevice was successfully restarted and there are no orphan
processes.
Closes-Bug: 2064368 32d4e0a3799a9f8 833c071d206
Change-Id: Ie5cbc89cde374e
Signed-off-by: Tee Ngo <email address hidden>