systemd sync script for sidecar containers is unable to spawn new processes

Bug #1868082 reported by Daniel Alvarez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Daniel Alvarez

Bug Description

Currently, the sync script searches for running processes and if the target one is not running, it'll start it.

This is the logic:

IFS=$'\n'
for LINE in $(cat {{ tripleo_systemd_wrapper_service_dir }}/{{ tripleo_systemd_wrapper_service_name }}/processes); do
    NETNS=$(echo $LINE | awk '{ print $1 }')
    IFS=$' ' ARGS=$(echo $LINE | sed -e "s|$NETNS ||" | xargs)
    # TODO(emilien) investigate if we should rather run docker/podman ps instead of ps on the host
    if ! ps -e -o pid,command | grep "$(echo $NETNS | sed 's|^[^-]*\-||')" | grep -v grep &> /dev/null; then
        start_service $NETNS $ARGS
    fi
done

However, the command itself invoked by the Neutron agent may still show up in the 'ps' output which makes the sync script to think it's running and hence skip its start. This may delay the execution of the sidecar container until the next iteration (1 minute) which in the case of the metadata container (for both ML2/OVS and ML2/OVN) it may be already too late as the cloud-init of the instance had given up.

Example of the 'ps' output when the issue is being hit:

Mar 19 11:12:31 compute-0 sync[92924]: 92914 /usr/bin/python3 /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec ovnmeta-31c07b24-f920-4a24-a499-8a5bc16ed44d haproxy -f /var/lib/neutron/ovn-metadata-proxy/31c07b24-f920-4a24-a499-8a5bc16ed44d.conf

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/713852

Changed in tripleo:
assignee: nobody → Daniel Alvarez (dalvarezs)
status: New → In Progress
Changed in tripleo:
importance: Undecided → Critical
milestone: none → ussuri-3
tags: added: train-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/714099

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/713852
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=0df3ebd7592eac90f76650770f59c89fba4ebe39
Submitter: Zuul
Branch: master

commit 0df3ebd7592eac90f76650770f59c89fba4ebe39
Author: Daniel Alvarez <email address hidden>
Date: Thu Mar 19 12:21:15 2020 +0100

    Filter out wrapper commands from the ps output

    This patch is filtering out the wrapper execution from the ps output
    in the sync script. By doing this, it'll effectively detect when
    the target process is not running and start it. Otherwise, there might
    be cases where the process start is postponed until next iteration
    of the sync script (1 minute) and it may be already too late.

    This is causing tests to fail as the metadata service is not provisioned
    in time for instances to fetch their SSH keys.

    Change-Id: I530e257f343ffc551db9e984f9a27b20c397bfb1
    Co-Authored-By: Jakub Libosvar <email address hidden>
    Closes-Bug: #1868082
    Signed-off-by: Daniel Alvarez <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/714099
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=546bc0d0f14c52114a31dceb47f7501682731ef0
Submitter: Zuul
Branch: stable/train

commit 546bc0d0f14c52114a31dceb47f7501682731ef0
Author: Daniel Alvarez <email address hidden>
Date: Thu Mar 19 12:21:15 2020 +0100

    Filter out wrapper commands from the ps output

    This patch is filtering out the wrapper execution from the ps output
    in the sync script. By doing this, it'll effectively detect when
    the target process is not running and start it. Otherwise, there might
    be cases where the process start is postponed until next iteration
    of the sync script (1 minute) and it may be already too late.

    This is causing tests to fail as the metadata service is not provisioned
    in time for instances to fetch their SSH keys.

    Change-Id: I530e257f343ffc551db9e984f9a27b20c397bfb1
    Co-Authored-By: Jakub Libosvar <email address hidden>
    Closes-Bug: #1868082
    Signed-off-by: Daniel Alvarez <email address hidden>
    (cherry-picked from 0df3ebd7592eac90f76650770f59c89fba4ebe39)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 0.5.0

This issue was fixed in the openstack/tripleo-ansible 0.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 1.3.0

This issue was fixed in the openstack/tripleo-ansible 1.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.