Ansible playbooks running in subprocesses are not stopped when dcmanager/orchestrator is terminated

Bug #1972013 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kyle MacLeod

Bug Description

Brief Description
This issue was observed during the test of large-scale subcloud prestage. If dcmanager-orchestrator is abruptly restarted during orchestration (manual restart, uncontrolled host-swact, service crash), the ansible-playbooks running in sub-processes are not terminated leaving hundreds of processes running. As a result, retrying orchestration can lead multiple playbooks attempting to prestage the same subcloud at the same time.

Severity
Major

Steps to Reproduce
Create and apply a prestage strategy for a large number of subclouds
Perform host-swact while the strategy is being applied

Expected Behavior
The sub-processes are cleaned up/terminated

Actual Behavior
Playbooks running in sub-processes continue to prestage the subclouds

Reproducibility
100% reproducible

System Configuration
Distributed cloud

Load info
StarlingX master

Last Pass
This was not observed before

Timestamp/Logs
N/A. This issue is readily reproducible

Alarms
N/A

Test Activity
Developer Testing

Workaround
Manually kill all ansible-playbooks processes
pgrep -f ansible-playbook | xargs kill -9

Kyle MacLeod (kmacleod)
Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/840981

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/840981
Committed: https://opendev.org/starlingx/distcloud/commit/b24837a73d41cf526a928beb281188b624e05f7a
Submitter: "Zuul (22348)"
Branch: master

commit b24837a73d41cf526a928beb281188b624e05f7a
Author: Kyle MacLeod <email address hidden>
Date: Fri May 6 13:51:19 2022 -0400

    Registration-based subprocess cleanup on service shutdown

    Introduce a helper class SubprocessCleanup in dccommon
    which allows a worker to register a subprocess that must
    be cleaned up (killed) upon service exit.

    There are two parts to this mechanism:
    1. Registration:
        - The subprocess is registered for cleanup when
          spawned (see utils.run_playbook_with_timeout)
        - Suprocess is also spawned using setsid in order to
          start a new process group + session
    2. The Service calls subprocess_cleanup upon stopping.
        - All registered subprocesses are terminated
          using the os.killpg() call to terminate the
          entire subprocess process group.

    Caveat: This mechanism only handles clean process
    exit cases. If the process crashes or is is killed
    non-gracefully via SIGKILL, the cleanup will not happen.

    Closes-Bug: 1972013

    Test Plan:

    PASS:

    Orchestrated prestaging:

    * Perform system host-swact while prestaging packages in progress
      - ansible-playbook is terminated
      - prestaging task is marked as prestaging-failed

    * Perform system host-swact while prestaging images in progress
      - ansible-playbook is terminated
      - prestaging task is marked as prestaging-failed

    * Restart dcmanager-orchestrator service for the same
      two cases as above
      - behaviour is the same as for swact

    * Kill dcmanager-orchestrator service while prestaging in progress

    Non-Orchestrated prestaging:

    * Perform host-swact and service restart for non-orchestrated prestaging
      - ansible-playbook is terminated
      - subcloud deploy status marked as prestaging-failed

    Swact during large-scale subcloud add
      - initiate large number of subcloud add operations
      - swact during 'installing' state
      - swact during 'bootstrapping' state
      - verify that ansible playbooks are killed
      - verify that deploy status is updated with -failed state

    Not covered:

    Tested a sudo 'pkill -9 dcmanager-manager' (ungraceful SIGKILL)
      - in this case the ansible subprocess tree is not cleaned up
      - this is expected - we aren't handling a non-clean shutdown

    Signed-off-by: Kyle MacLeod <email address hidden>
    Change-Id: I714398017b71c99edeeaa828933edd8163fb67cd

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.