"service_kill" should stop the process when the namespace does not exist

Bug #1868607 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Rodolfo Alonso

Bug Description

"service_kill" script should kill the running container even if the namespace where the container was executed was deleted. If during the execution of a container, executed inside a namespace (e.g.: Neutron dnsmasq sidecar container), the namespace is deleted, the container will continue running.

However, when the process is stopped [1], the script is called and will fail in [2] because the namespace returned is empty (root namespace). In this case, the script should avoid the namespace and change the signal to "SIGKILL", to terminate this unreferenced process.

If the script exits with 1, Neutron will retry to resync the DHCP agent and will fail, in an endless loop, while executing always the same script: http://paste.openstack.org/show/791043/

The kill scripts log reports the error:
"""
Tue Mar 3 13:59:29 UTC 2020 Deleting container neutron-dnsmasq-qdhcp-3d1b1411-f8d2-4348-a500-e7af340f2d1c (6a2ac5c799fb7175b20ffa768b2b297f2c9de8e3f5271d99cf41ccff3dc94b64)
6a2ac5c799fb7175b20ffa768b2b297f2c9de8e3f5271d99cf41ccff3dc94b64
Tue Mar 3 13:59:30 UTC 2020 No network namespace detected, exiting
Tue Mar 3 13:59:30 UTC 2020 No network namespace detected, exiting
Tue Mar 3 13:59:31 UTC 2020 No network namespace detected, exiting
"""

Related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1809634

This bug can be reproduced by creating a namespace, running a container inside it and the deleting the namespace:
$ ip netns add ns01
$ nsenter --net=/run/netns/ns01 podman run --name fistro -v /var/run/netns:/var/run/netns --privileged -it fedora sleep 20000
$ ps aux | grep 20000
root 3855 0.0 0.1 1256440 30960 pts/2 Sl+ 17:45 0:00 podman run --name fistro -v /var/run/netns:/var/run/netns --privileged -it fedora sleep 20000
root 3967 0.0 0.0 2340 760 pts/0 Ss+ 17:45 0:00 sleep 20000
root 4299 0.0 0.0 15780 1244 pts/3 S+ 18:03 0:00 ag 20000
$ ip netns identify 3855
ns01
$ ip netns delete ns01
$ ip netns identify 3855
<empty string>

[1] https://github.com/openstack/neutron/blob/b319d64388cd4e9c1a26247aaf7da0556bcd476e/neutron/agent/linux/external_process.py#L102
[2] https://github.com/openstack/tripleo-ansible/blob/e556e74d7672f37bb454e0046fecf915f32dc94e/tripleo_ansible/roles/tripleo_systemd_wrapper/templates/service_kill.j2#L21

Changed in tripleo:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/714517

Changed in tripleo:
status: New → In Progress
Changed in tripleo:
importance: Undecided → High
milestone: none → ussuri-3
tags: added: train-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/714517
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=c516df9e519745eaea1d8699e8374dff47fb6d24
Submitter: Zuul
Branch: master

commit c516df9e519745eaea1d8699e8374dff47fb6d24
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Mar 23 18:36:38 2020 +0000

    Force container deletion if namespace does not exist in service_kill

    When a service is stopped using "service_kill" script and the
    namespace where the container is running does not exist, the
    container processi should be forced to stop from the root namespace.

    A namespace where a process is running, can be deleted whitout
    stopping the mentioned process. "ip netns identify <PID>" then
    returns an empty string (root namespace).

    This patch will prevent an endless loop in Neutron DHCP agent. As
    reported in the related bug, when a DHCP agent is resync, the DHCP
    helper (metadata proxy) is stopped. In case this process stop raises
    an exception (for example if the namespace does not exist), schedules
    again a resync, creating an endless loop.

    Change-Id: I9bac918fcde80e6a2336bc3cf1e6972512298118
    Closes-Bug: #1868607

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/715019

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/715019
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=9d5ef54e1b7edd4602a6831b3733cf4277b77c83
Submitter: Zuul
Branch: stable/train

commit 9d5ef54e1b7edd4602a6831b3733cf4277b77c83
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Mar 23 18:36:38 2020 +0000

    Force container deletion if namespace does not exist in service_kill

    When a service is stopped using "service_kill" script and the
    namespace where the container is running does not exist, the
    container processi should be forced to stop from the root namespace.

    A namespace where a process is running, can be deleted whitout
    stopping the mentioned process. "ip netns identify <PID>" then
    returns an empty string (root namespace).

    This patch will prevent an endless loop in Neutron DHCP agent. As
    reported in the related bug, when a DHCP agent is resync, the DHCP
    helper (metadata proxy) is stopped. In case this process stop raises
    an exception (for example if the namespace does not exist), schedules
    again a resync, creating an endless loop.

    Change-Id: I9bac918fcde80e6a2336bc3cf1e6972512298118
    Closes-Bug: #1868607
    (cherry picked from commit c516df9e519745eaea1d8699e8374dff47fb6d24)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 0.5.0

This issue was fixed in the openstack/tripleo-ansible 0.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 1.3.0

This issue was fixed in the openstack/tripleo-ansible 1.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/730657

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/730831

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: master
Review: https://review.opendev.org/730657
Reason: Superseded by https://review.opendev.org/#/c/730831/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/731120

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/731121

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/731120
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=05f19f2c596149c19f5005b4b31ccfdb11bc388d
Submitter: Zuul
Branch: stable/ussuri

commit 05f19f2c596149c19f5005b4b31ccfdb11bc388d
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue May 26 13:59:42 2020 +0000

    Force container deletion if namespace does not exist in service_kill

    When a service is stopped using "service_kill" script and the
    namespace where the container is running does not exist, the
    container process should be forced to stop from the root namespace.

    A namespace where a process is running, can be deleted whitout
    stopping the mentioned process. "ip netns identify <PID>" then
    returns an empty string (root namespace).

    If the namespace where a container was executed is deleted,
    "service_kill" script should execute a container related command
    from the root namespace. To access to the root namespace from
    inside a container, running in another namespace, it is necessary
    to gain access via "nsenter", specifying the parameter "--all" to
    access to all namespaces of the target process.

    This patch will prevent an endless loop in Neutron DHCP agent. As
    reported in the related bug, when a DHCP agent is resync, the DHCP
    helper (metadata proxy) is stopped. In case this process stop raises
    an exception (for example if the namespace does not exist), schedules
    again a resync, creating an endless loop.

    This patch combines [1] and [2] in this repository.
    [1]https://review.opendev.org/#/c/714517/
    [2]https://review.opendev.org/#/c/730657/

    Change-Id: Ifb7dbfb93a7cf0b50ef15652d83d87f65bdb6221
    Closes-Bug: #1868607
    (cherry picked from commit 0bc1383a60c4ab249d16402c37adcea988b84c53)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/730831
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0bc1383a60c4ab249d16402c37adcea988b84c53
Submitter: Zuul
Branch: master

commit 0bc1383a60c4ab249d16402c37adcea988b84c53
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue May 26 13:59:42 2020 +0000

    Force container deletion if namespace does not exist in service_kill

    When a service is stopped using "service_kill" script and the
    namespace where the container is running does not exist, the
    container process should be forced to stop from the root namespace.

    A namespace where a process is running, can be deleted whitout
    stopping the mentioned process. "ip netns identify <PID>" then
    returns an empty string (root namespace).

    If the namespace where a container was executed is deleted,
    "service_kill" script should execute a container related command
    from the root namespace. To access to the root namespace from
    inside a container, running in another namespace, it is necessary
    to gain access via "nsenter", specifying the parameter "--all" to
    access to all namespaces of the target process.

    This patch will prevent an endless loop in Neutron DHCP agent. As
    reported in the related bug, when a DHCP agent is resync, the DHCP
    helper (metadata proxy) is stopped. In case this process stop raises
    an exception (for example if the namespace does not exist), schedules
    again a resync, creating an endless loop.

    This patch combines [1] and [2] in this repository.
    [1]https://review.opendev.org/#/c/714517/
    [2]https://review.opendev.org/#/c/730657/

    Change-Id: Ifb7dbfb93a7cf0b50ef15652d83d87f65bdb6221
    Closes-Bug: #1868607

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/731121
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bfbb55e145b4f71811653af54aedf382a8635f9c
Submitter: Zuul
Branch: stable/train

commit bfbb55e145b4f71811653af54aedf382a8635f9c
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue May 26 13:59:42 2020 +0000

    Force container deletion if namespace does not exist in service_kill

    When a service is stopped using "service_kill" script and the
    namespace where the container is running does not exist, the
    container process should be forced to stop from the root namespace.

    A namespace where a process is running, can be deleted whitout
    stopping the mentioned process. "ip netns identify <PID>" then
    returns an empty string (root namespace).

    If the namespace where a container was executed is deleted,
    "service_kill" script should execute a container related command
    from the root namespace. To access to the root namespace from
    inside a container, running in another namespace, it is necessary
    to gain access via "nsenter", specifying the parameter "--all" to
    access to all namespaces of the target process.

    This patch will prevent an endless loop in Neutron DHCP agent. As
    reported in the related bug, when a DHCP agent is resync, the DHCP
    helper (metadata proxy) is stopped. In case this process stop raises
    an exception (for example if the namespace does not exist), schedules
    again a resync, creating an endless loop.

    This patch combines [1] and [2] in this repository.
    [1]https://review.opendev.org/#/c/714517/
    [2]https://review.opendev.org/#/c/730657/

    Change-Id: Ifb7dbfb93a7cf0b50ef15652d83d87f65bdb6221
    Closes-Bug: #1868607
    (cherry picked from commit 0bc1383a60c4ab249d16402c37adcea988b84c53)
    (cherry picked from commit 05f19f2c596149c19f5005b4b31ccfdb11bc388d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.