docker stop/restart kills child processes immediately

Bug #1799642 reported by Rabi Mishra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
Fix Released
High
Rabi Mishra
tripleo
Incomplete
High
Unassigned

Bug Description

Kolla uses dumb-init[1] as PID1 for service containers. When container is stopped/restarted, SIGTERM is sent to dumb-init which forwards it to the all children in the root session. So when a services has child processes (workers) it seems they also receive SIGTERM and killed abruptly[2], rather than the parent waiting for them to finish.

[1] https://github.com/openstack/kolla/blob/master/docker/base/Dockerfile.j2#L403
[2] https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L623

Here is what I've noticed with heat-engine container

1. Stopping/restarting the container

.7/site-packages/heat/engine/service.py:2343
2018-10-23 11:36:19.944 27 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 11:36:19.945 28 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 11:36:19.946 26 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 11:36:19.946 29 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 11:36:19.950 6 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2018-10-23 11:36:19.951 6 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-10-23 11:36:19.951 6 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
2018-10-23 11:36:19.951 6 DEBUG oslo_service.service [-] Stop services. stop /usr/lib/python2.7/site-packages/oslo_service/service.py:699
2018-10-23 11:36:19.951 6 INFO heat.engine.service [-] All threads were gone, terminating engine
2018-10-23 11:36:19.952 6 DEBUG oslo_service.service [-] Killing children. stop /usr/lib/python2.7/site-packages/oslo_service/service.py:704
2018-10-23 11:36:19.952 6 INFO oslo_service.service [-] Waiting on 4 children to exit
2018-10-23 11:36:19.981 6 INFO oslo_service.service [-] Child 27 killed by signal 15
2018-10-23 11:36:19.995 6 INFO oslo_service.service [-] Child 29 killed by signal 15
2018-10-23 11:36:20.006 6 INFO oslo_service.service [-] Child 28 killed by signal 15
2018-10-23 11:36:20.009 6 INFO oslo_service.service [-] Child 26 killed by signal 15

2. When SIGTERM is (kill) is sent only to main heat-engine process

.7/site-packages/heat/engine/service.py:2343
2018-10-23 12:00:56.221 6 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2018-10-23 12:00:56.221 6 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-10-23 12:00:56.221 6 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
2018-10-23 12:00:56.222 6 DEBUG oslo_service.service [-] Stop services. stop /usr/lib/python2.7/site-packages/oslo_service/service.py:699
2018-10-23 12:00:56.222 6 INFO heat.engine.service [-] All threads were gone, terminating engine
2018-10-23 12:00:56.222 6 DEBUG oslo_service.service [-] Killing children. stop /usr/lib/python2.7/site-packages/oslo_service/service.py:704
2018-10-23 12:00:56.222 6 INFO oslo_service.service [-] Waiting on 4 children to exit
2018-10-23 12:00:56.223 28 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 12:00:56.224 27 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 12:00:56.224 25 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 12:00:56.224 26 DEBUG heat.engine.service [-] Attempting to stop engine service... _stop_rpc_server /usr/lib/python2.7/site-packages/heat/engine/service.py:424
2018-10-23 12:01:03.638 25 INFO heat.engine.service [-] Engine service is stopped successfully
2018-10-23 12:01:03.638 25 DEBUG heat.engine.service [-] Attempting to stop engine listener... stop /usr/lib/python2.7/site-packages/heat/engine/service.py:306
2018-10-23 12:01:08.626 25 INFO heat.engine.service [-] Engine listener is stopped successfully
2018-10-23 12:01:08.627 25 INFO heat.engine.worker [-] Stopping engine_worker in engine e0a6bf2e-99d3-461d-abc6-ca44b9e1a2ef.
2018-10-23 12:01:08.678 25 INFO heat.engine.service [-] Waiting stack None processing to be finished
2018-10-23 12:01:08.722 25 INFO heat.engine.service [-] Stack None processing was finished
2018-10-23 12:01:08.732 27 INFO heat.engine.service [-] Engine service is stopped successfully
2018-10-23 12:01:08.732 25 INFO heat.engine.service [req-ff93becb-f183-44b7-b3f5-b8dfa47526d0 - - - - -] Service cafb13f5-241e-49dc-9830-3bb175286474 is deleted
2018-10-23 12:01:08.732 27 DEBUG heat.engine.service [-] Attempting to stop engine listener... stop /usr/lib/python2.7/site-packages/heat/engine/service.py:306
2018-10-23 12:01:08.732 25 INFO heat.engine.service [req-ff93becb-f183-44b7-b3f5-b8dfa47526d0 - - - - -] All threads were gone, terminating engine
2018-10-23 12:01:08.733 25 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-10-23 12:01:08.733 25 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228

I'm not sure if dumb-init with --single-child would help or we've to do some signal re-writing or something to be fixed in oslo.service?

Revision history for this message
Rabi Mishra (rabi) wrote :

Some experiment in kolla with https://review.openstack.org/#/c/612887/, not sure if it would actually fix the issue.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → stein-1
tags: added: containers pike-backport-potential queens-backport-potential rocky-backport-potential
Rabi Mishra (rabi)
Changed in kolla:
assignee: nobody → Rabi Mishra (rabi)
Changed in kolla:
status: New → In Progress
Changed in tripleo:
milestone: stein-1 → stein-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla (master)

Reviewed: https://review.openstack.org/612887
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=b06d8387f50ac9c536941ffddb00c5bdb45753b6
Submitter: Zuul
Branch: master

commit b06d8387f50ac9c536941ffddb00c5bdb45753b6
Author: Rabi Mishra <email address hidden>
Date: Wed Oct 24 11:24:07 2018 +0530

    Use dumb-init with --single-child

    We would probably like to forward signals only to the direct child
    which in turn takes care its children and not to all worker child
    processes.

    Change-Id: Id91ebb8b0ecc43946845de386350af0536dd661f
    Related-Bug: #1799642

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/620238

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/621709

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla (stable/rocky)

Reviewed: https://review.openstack.org/620238
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=e11c541b3f291c13c6ed553ef115fd74468dfc64
Submitter: Zuul
Branch: stable/rocky

commit e11c541b3f291c13c6ed553ef115fd74468dfc64
Author: Rabi Mishra <email address hidden>
Date: Wed Oct 24 11:24:07 2018 +0530

    Use dumb-init with --single-child

    We would probably like to forward signals only to the direct child
    which in turn takes care its children and not to all worker child
    processes.

    Change-Id: Id91ebb8b0ecc43946845de386350af0536dd661f
    Depends-On: https://review.openstack.org/620274/
    Related-Bug: #1799642
    (cherry picked from commit b06d8387f50ac9c536941ffddb00c5bdb45753b6)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla (master)

Reviewed: https://review.openstack.org/624967
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=6258a920fdf8432f39696815765f3320afe92fca
Submitter: Zuul
Branch: master

commit 6258a920fdf8432f39696815765f3320afe92fca
Author: Mark Goddard <email address hidden>
Date: Thu Dec 13 12:07:59 2018 +0000

    Clear ENTRYPOINT configuration for Bifrost

    Bifrost was broken by the recent kolla change [1] to use an ENTRYPOINT
    for dumb-init. The container failed to start because dumb-init was
    trying to run /sbin/init, which expects to be pid 1.

    [1] Id91ebb8b0ecc43946845de386350af0536dd661f

    Change-Id: Id77ecfca09dfda8da984589f70a26433214ee3af
    Closes-Bug: #1808326
    Related-Bug: #1799642

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/625930

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla (stable/rocky)

Reviewed: https://review.openstack.org/625930
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=fe7ff62f68689362799cc9147cd456dd633255fe
Submitter: Zuul
Branch: stable/rocky

commit fe7ff62f68689362799cc9147cd456dd633255fe
Author: Mark Goddard <email address hidden>
Date: Thu Dec 13 12:07:59 2018 +0000

    Clear ENTRYPOINT configuration for Bifrost

    Bifrost was broken by the recent kolla change [1] to use an ENTRYPOINT
    for dumb-init. The container failed to start because dumb-init was
    trying to run /sbin/init, which expects to be pid 1.

    [1] Id91ebb8b0ecc43946845de386350af0536dd661f

    Change-Id: Id77ecfca09dfda8da984589f70a26433214ee3af
    Closes-Bug: #1808326
    Related-Bug: #1799642
    (cherry picked from commit 6258a920fdf8432f39696815765f3320afe92fca)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla (stable/queens)

Reviewed: https://review.openstack.org/621709
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=1dacd1944e285ef70f60c7e9982fa2ab180b144e
Submitter: Zuul
Branch: stable/queens

commit 1dacd1944e285ef70f60c7e9982fa2ab180b144e
Author: Rabi Mishra <email address hidden>
Date: Wed Oct 24 11:24:07 2018 +0530

    Use dumb-init with --single-child

    We would probably like to forward signals only to the direct child
    which in turn takes care its children and not to all worker child
    processes.

    (cherry picked from commit 6258a920fdf8432f39696815765f3320afe92fca)

    This change also includes a cherry pick of a fix for bifrost:

    Clear ENTRYPOINT configuration for Bifrost

    Bifrost was broken by the recent kolla change [1] to use an ENTRYPOINT
    for dumb-init. The container failed to start because dumb-init was
    trying to run /sbin/init, which expects to be pid 1.

    [1] Id91ebb8b0ecc43946845de386350af0536dd661f

    (cherry picked from commit b06d8387f50ac9c536941ffddb00c5bdb45753b6)

    Change-Id: Id91ebb8b0ecc43946845de386350af0536dd661f
    Depends-On: https://review.openstack.org/621871/
    Closes-Bug: #1808326
    Related-Bug: #1799642
    Related-Bug: #1799642

tags: added: in-stable-queens
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Mark Goddard (mgoddard)
Changed in kolla:
status: In Progress → Fix Released
importance: Undecided → High
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Incomplete
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Revision history for this message
Mark Goddard (mgoddard) wrote :

This bug is marked incomplete for tripleo. Please remove the milestone to stop the spam with each RC :)

Changed in tripleo:
milestone: victoria-1 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.