Service kill script has bad substitution

Bug #1860155 reported by Nate Johnston
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Undecided
Nate Johnston

Bug Description

L3 HA routers aren't cleaned properly as there is problem with killing keepalived containers.

Error in L3 agent logs:

2020-01-16 04:46:55.521 121536 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['keepalived-kill', '15', '888052'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2020-01-16 04:46:55.686 121536 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: + exec
+ trap 'exec 2>&4 1>&3' 0 1 2 3
+ exec

2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent [-] Error while deleting router 82565834-bf99-431f-9092-e68fae912344: neutron_lib.exceptions.ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: + exec
+ trap 'exec 2>&4 1>&3' 0 1 2 3
+ exec
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 506, in _safe_router_removed
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent self._router_removed(ri, router_id)
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 542, in _router_removed
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent self.router_info[router_id] = ri
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent self.force_reraise()
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent six.reraise(self.type_, self.value, self.tb)
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent raise value
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 539, in _router_removed
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent ri.delete()
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/ha_router.py", line 479, in delete
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent self.disable_keepalived()
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/ha_router.py", line 190, in disable_keepalived
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent self.keepalived_manager.disable()
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/keepalived.py", line 453, in disable
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent pm.disable(sig='15')
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 113, in disable
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent utils.execute(cmd, run_as_root=self.run_as_root)
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 147, in execute
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent returncode=returncode)
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: + exec
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent + trap 'exec 2>&4 1>&3' 0 1 2 3
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent + exec
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent
2020-01-16 04:46:55.687 121536 ERROR neutron.agent.l3.agent

Error in kill-script.log:
+ SIG=15
+ PID=881230
++ ip netns identify 881230
+ NETNS=qrouter-82565834-bf99-431f-9092-e68fae912344
+ '[' xqrouter-82565834-bf99-431f-9092-e68fae912344 == x ']'
+ CLI='nsenter --net=/run/netns/qrouter-82565834-bf99-431f-9092-e68fae912344 --preserve-credentials -m -t 1 podman'
+ '[' -f /proc/881230/cgroup ']'
++ awk 'BEGIN {FS="[-.]"} /name=/{print $3}' /proc/881230/cgroup
+ CT_ID=31d9a79a18faa70cff94cbe1ea96073ff2a96932f45b31aaa6792792c7e589e3
++ nsenter --net=/run/netns/qrouter-82565834-bf99-431f-9092-e68fae912344 --preserve-credentials -m -t 1 podman inspect -f '{{.Name}}' 31d9a79a18faa70cff94cbe1ea96073ff2a96932f45b31aaa6792792c7e589e3
+ CT_NAME=neutron-keepalived-qrouter-82565834-bf99-431f-9092-e68fae912344
+ case $SIG in
/etc/neutron/kill_scripts/keepalived-kill: line 50: Unknown action ${SIG} for ${$CT_NAME} ${CT_ID}: bad substitution

I'm not sure what real problems it may cause for users. For sure there will be not killed keepalived processes on host but it may potentially be also the reason of failures of some of tests from tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.

Changed in tripleo:
assignee: nobody → Nate Johnston (nate-johnston)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/703123

Changed in tripleo:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/703128

Changed in tripleo:
milestone: none → ussuri-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/703123
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=b45d4c6d219e8e27219bca341acdfd634155d6f6
Submitter: Zuul
Branch: master

commit b45d4c6d219e8e27219bca341acdfd634155d6f6
Author: Nate Johnston <email address hidden>
Date: Fri Jan 17 11:46:26 2020 -0500

    Fix substitution in kill-script

    In the kill-script there is a string "Unknown action ${SIG} for
    ${$CT_NAME} ${CT_ID}" which results in a "bad substitution" error, as
    there is no variable named with what the contents of the CT_NAME
    environment variable contains. Remove the extraneous '$'.

    Change-Id: I4c76071083bf5cb4f876d3b78c379822a8bd8db1
    Fixes-Bug: #1860155

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/703128
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=06dc258a28784db98e44a3de488c204f01b97613
Submitter: Zuul
Branch: master

commit 06dc258a28784db98e44a3de488c204f01b97613
Author: Nate Johnston <email address hidden>
Date: Fri Jan 17 12:11:29 2020 -0500

    Add handling of signal 15 in kill script

    The reason bug #1860155 was triggered was because the kill script did
    not have a stanza for handling the signal that was passed in, which is
    signal 15. Since signal 15 is unhandled, keepalived processes will
    still stick around. Add handling for signal 15.

    Change-Id: I632a3ef5ec137df10f647335f6354589c2316fd0
    Related-bug: #1860155

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/704463

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/704463
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=e204f16e5f3ca238a9c8157fab7ba673efb71db9
Submitter: Zuul
Branch: stable/train

commit e204f16e5f3ca238a9c8157fab7ba673efb71db9
Author: Nate Johnston <email address hidden>
Date: Mon Jan 27 18:04:52 2020 -0500

    Fix kill-script

    This change squashes two changes that are being backported from the
    tripleo-ansible repo. They are to the same file, which was relocated to
    tripleo-ansible between Train and Ussuri.

    Change https://review.opendev.org/703123: In the kill-script there is a
    string "Unknown action ${SIG} for ${$CT_NAME} ${CT_ID}" which results in
    a "bad substitution" error, as there is no variable named with what the
    contents of the CT_NAME environment variable contains. Remove the
    extraneous '$'.

    Change https://review.opendev.org/703128: The reason bug #1860155 was
    triggered was because the kill script did not have a stanza for handling
    the signal that was passed in, which is signal 15. Since signal 15 is
    unhandled, keepalived processes will still stick around. Add handling
    for signal 15.

    Change-Id: Ib47fc73b498b6366efa4ae5b16855bd45cb3ec91
    Fixes-Bug: #1860155

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/706379

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/706379
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=60342041a3eb2c091916e36e4bf66199937db335
Submitter: Zuul
Branch: stable/train

commit 60342041a3eb2c091916e36e4bf66199937db335
Author: Alex Schultz <email address hidden>
Date: Thu Nov 7 16:06:54 2019 -0700

    [TRAIN] Backport tripleo-systemd-wrapper (squash)

    This is a combination of 4 commits.

    This is the 1st commit message:

    Implement tripleo-systemd-wrapper role

    This patch adds a new role that will be used to manage side containers
    with systemd instead of docker.socket or nsenter. The main use case here
    is Neutron, although this role is designed to work with any service.

    This role will create a series of systemd files to monitor a file which
    gets mounted into a container. Additionally a wrapper script is
    generated which is mounted in the container that will provide the
    arguments that should be used to launch new containers.

    Blueprint: safe-side-containers
    Change-Id: I4821b7ca0260e4dfd1717ba976cef700d160f84f
    Co-Authored-By: Dan Prince <email address hidden>
    Co-Authored-By: Emilien Macchi <email address hidden>
    Co-Authored-By: Alex Schultz <email address hidden>
    (cherry picked from commit 699249f1790dd5646556173bf5331e7e71135ad4)

    This is the commit message #2:

    Remove --rm=true from sidecar container sync

    Neutron uses kill-scripts which remove the container after stopping it.
    If the container is launched with docker and --rm=true, the container
    will automatically be cleaned up and the $(CLI) rm <container id> in the
    kill script with error out because the container can't be found.

    Related-Bug: #1858662
    Change-Id: I3d7940cb0816adce58e0fa778469dcec95302f67
    (cherry picked from commit 17d97f2618e56be606c9307551efdaadda8dff69)

    This is the commit message #3:

    Fix substitution in kill-script

    In the kill-script there is a string "Unknown action ${SIG} for
    ${$CT_NAME} ${CT_ID}" which results in a "bad substitution" error, as
    there is no variable named with what the contents of the CT_NAME
    environment variable contains. Remove the extraneous '$'.

    Change-Id: I4c76071083bf5cb4f876d3b78c379822a8bd8db1
    Fixes-Bug: #1860155
    (cherry picked from commit b45d4c6d219e8e27219bca341acdfd634155d6f6)

    This is the commit message #4:

    Add handling of signal 15 in kill script

    The reason bug #1860155 was triggered was because the kill script did
    not have a stanza for handling the signal that was passed in, which is
    signal 15. Since signal 15 is unhandled, keepalived processes will
    still stick around. Add handling for signal 15.

    Change-Id: I632a3ef5ec137df10f647335f6354589c2316fd0
    Related-bug: #1860155
    (cherry picked from commit 06dc258a28784db98e44a3de488c204f01b97613)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/730872

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/730872
Reason: oops, sorry, I had the old base :)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.