StarlingX

Ansible playbooks running in subprocesses are not stopped when dcmanager/orchestrator is terminated

Bug #1972013 reported by Kyle MacLeod on 2022-05-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Kyle MacLeod

Bug Description

Brief Description
This issue was observed during the test of large-scale subcloud prestage. If dcmanager-orchestrator is abruptly restarted during orchestration (manual restart, uncontrolled host-swact, service crash), the ansible-playbooks running in sub-processes are not terminated leaving hundreds of processes running. As a result, retrying orchestration can lead multiple playbooks attempting to prestage the same subcloud at the same time.

Severity
Major

Steps to Reproduce
Create and apply a prestage strategy for a large number of subclouds
Perform host-swact while the strategy is being applied

Expected Behavior
The sub-processes are cleaned up/terminated

Actual Behavior
Playbooks running in sub-processes continue to prestage the subclouds

Reproducibility
100% reproducible

System Configuration
Distributed cloud

Load info
StarlingX master

Last Pass
This was not observed before

Timestamp/Logs
N/A. This issue is readily reproducible

Alarms
N/A

Test Activity
Developer Testing

Workaround
Manually kill all ansible-playbooks processes
pgrep -f ansible-playbook | xargs kill -9

Tags:

Kyle MacLeod (kmacleod) on 2022-05-06

Changed in starlingx:
assignee:	nobody → Kyle MacLeod (kmacleod)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-06: Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/840981

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-19: Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/840981
Committed: https://opendev.org/starlingx/distcloud/commit/b24837a73d41cf526a928beb281188b624e05f7a
Submitter: "Zuul (22348)"
Branch: master

commit b24837a73d41cf526a928beb281188b624e05f7a
Author: Kyle MacLeod <email address hidden>
Date: Fri May 6 13:51:19 2022 -0400

Registration-based subprocess cleanup on service shutdown

    Introduce a helper class SubprocessCleanup in dccommon
    which allows a worker to register a subprocess that must
    be cleaned up (killed) upon service exit.

    There are two parts to this mechanism:
    1. Registration:
        - The subprocess is registered for cleanup when
          spawned (see utils.run_playbook_with_timeout)
        - Suprocess is also spawned using setsid in order to
          start a new process group + session
    2. The Service calls subprocess_cleanup upon stopping.
        - All registered subprocesses are terminated
          using the os.killpg() call to terminate the
          entire subprocess process group.

    Caveat: This mechanism only handles clean process
    exit cases. If the process crashes or is is killed
    non-gracefully via SIGKILL, the cleanup will not happen.

Closes-Bug: 1972013

Test Plan:

PASS:

Orchestrated prestaging:

    * Perform system host-swact while prestaging packages in progress
      - ansible-playbook is terminated
      - prestaging task is marked as prestaging-failed

    * Perform system host-swact while prestaging images in progress
      - ansible-playbook is terminated
      - prestaging task is marked as prestaging-failed

    * Restart dcmanager-orchestrator service for the same
      two cases as above
      - behaviour is the same as for swact

* Kill dcmanager-orchestrator service while prestaging in progress

Non-Orchestrated prestaging:

    * Perform host-swact and service restart for non-orchestrated prestaging
      - ansible-playbook is terminated
      - subcloud deploy status marked as prestaging-failed

    Swact during large-scale subcloud add
      - initiate large number of subcloud add operations
      - swact during 'installing' state
      - swact during 'bootstrapping' state
      - verify that ansible playbooks are killed
      - verify that deploy status is updated with -failed state

Not covered:

    Tested a sudo 'pkill -9 dcmanager-manager' (ungraceful SIGKILL)
      - in this case the ansible subprocess tree is not cleaned up
      - this is expected - we aren't handling a non-clean shutdown

Signed-off-by: Kyle MacLeod <email address hidden>
Change-Id: I714398017b71c99edeeaa828933edd8163fb67cd

Reviewed:  https://review.opendev.org/c/starlingx/distcloud/+/840981
Committed: https://opendev.org/starlingx/distcloud/commit/b24837a73d41cf526a928beb281188b624e05f7a
Submitter: "Zuul (22348)"
Branch:    master

commit b24837a73d41cf526a928beb281188b624e05f7a
Author: Kyle MacLeod <kyle.macleod@windriver.com>
Date:   Fri May 6 13:51:19 2022 -0400

Registration-based subprocess cleanup on service shutdown
    
    Introduce a helper class SubprocessCleanup in dccommon
    which allows a worker to register a subprocess that must
    be cleaned up (killed) upon service exit.
    
    There are two parts to this mechanism:
    1. Registration:
        - The subprocess is registered for cleanup when
          spawned (see utils.run_playbook_with_timeout)
        - Suprocess is also spawned using setsid in order to
          start a new process group + session
    2. The Service calls subprocess_cleanup upon stopping.
        - All registered subprocesses are terminated
          using the os.killpg() call to terminate the
          entire subprocess process group.
    
    Caveat: This mechanism only handles clean process
    exit cases. If the process crashes or is is killed
    non-gracefully via SIGKILL, the cleanup will not happen.
    
    Closes-Bug: 1972013
    
    Test Plan:
    
    PASS:
    
    Orchestrated prestaging:
    
    * Perform system host-swact while prestaging packages in progress
      - ansible-playbook is terminated
      - prestaging task is marked as prestaging-failed
    
    * Perform system host-swact while prestaging images in progress
      - ansible-playbook is terminated
      - prestaging task is marked as prestaging-failed
    
    * Restart dcmanager-orchestrator service for the same
      two cases as above
      - behaviour is the same as for swact
    
    * Kill dcmanager-orchestrator service while prestaging in progress
    
    Non-Orchestrated prestaging:
    
    * Perform host-swact and service restart for non-orchestrated prestaging
      - ansible-playbook is terminated
      - subcloud deploy status marked as prestaging-failed
    
    Swact during large-scale subcloud add
      - initiate large number of subcloud add operations
      - swact during 'installing' state
      - swact during 'bootstrapping' state
      - verify that ansible playbooks are killed
      - verify that deploy status is updated with -failed state
    
    Not covered:
    
    Tested a sudo 'pkill -9 dcmanager-manager' (ungraceful SIGKILL)
      - in this case the ansible subprocess tree is not cleaned up
      - this is expected - we aren't handling a non-clean shutdown
    
    Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
    Change-Id: I714398017b71c99edeeaa828933edd8163fb67cd

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-05-20

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.7.0 stx.distcloud

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.