ansible-2.14: content-provider are broken

Bug #1996612 reported by Cédric Jeanneret
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Apparently, with the new ansible-2.14, the content-provider jobs are broken and crash with timeouts:

tripleo-ci-centos-9-content-provider TIMED_OUT
tripleo-ci-centos-9-content-provider-zed TIMED_OUT

This can be seen for instance in those 2 changes:
https://review.opendev.org/c/openstack/tripleo-ansible/+/864392
https://review.opendev.org/c/openstack/tripleo-operator-ansible/+/864409

"fun" thing: both are aiming at correcting the compatibility issues.... But failing, apparently.

While digging in the log[1], we can see the job isn't doing anything but this:
"Attempting python interpreter discovery"
It does so for every single container to build, and apparently, it's taking a lot of time.
The only task that seems to be actually done is:
"tripleo_container_image_build : Ensure {{ tcib_path }} exists"

While checking what was done, I also stumbled on a weird thing with the generated script[2], leading to a quick fix[3] - but I don't think it's related... Is it?

[1] https://e500f5844c497d7c1455-bb0af7d0ed113130252cfd767637324e.ssl.cf2.rackcdn.com/864409/2/check/tripleo-ci-centos-9-content-provider/9fbeef1/logs/undercloud/home/zuul/container_image_build.log
[2] https://e500f5844c497d7c1455-bb0af7d0ed113130252cfd767637324e.ssl.cf2.rackcdn.com/864409/2/check/tripleo-ci-centos-9-content-provider/9fbeef1/logs/undercloud/home/zuul/tripleo_container_image_build.sh
[3] https://review.opendev.org/c/openstack/tripleo-operator-ansible/+/864519

Revision history for this message
Marios Andreou (marios-b) wrote :

Noting that for now we are pinning to ansible 2.13 with https://review.opendev.org/c/openstack/tripleo-quickstart/+/864498 to prevent this bug from blocking gates

Once this is fixed we can unpin with revert for tripleo-quickstart/+/864498

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (4.4 KiB)

While testing locally:
running the command to build things[1] launches apparently 99 processes such as:

/usr/bin/python3.9 /usr/bin/ansible-playbook -i /tmp/tripleos0h7wdgf/hosts.yaml -vvv /tmp/tripleos0h7wdgf/tripleo-multi-playbook.yaml

Now, the content of the files involved here isn't THAT weird imho - the playbook looks like:
- connection: local
  gather_facts: true
  hosts: localhost
  name: Generate localhost facts
- connection: local
  gather_facts: false
  hosts: all
  name: Generate container file(s)
  roles:
  - role: tripleo_container_image_build
  vars:
    tcib_args:
      TRIPLEO_ANSIBLE_REQ: /usr/share/openstack-tripleo-common-containers/container-images/kolla/tripleo-ansible-ee/requirements.yaml

which makes sense.

The inventory, on the other hand.... it explains why ansible tries to get the python_interpreter so many times[2]. I'm wondering if we couldn't pass the python interpreter as a host var via the inventory - especially since we're passing, right now, the tcib_python_version to the CLI... Maybe something to consider?

In any cases: the build is stuck here. Nothing is showing up in the log. The last line therein is:
2022-11-15 13:50:32,019 p=68068 u=stack n=ansible | <designate-backend-bind9> EXEC /bin/sh -c '/usr/bin/python3.9 && sleep 0'

and it's been like that for the time it took to make this comment here - over 7 minutes (with data gathering and so on).

The ansible.log is in the same state, so it's not doing things in the background I'd say... And the process listing doesn't show anything like that.

I see the following possibilities:
- either ensure we're not running too many things in parallel with the second "playbook" (using "serial: X" option)
- see what happens if we pass the ansible_python_interpreter to the inventory
- ... any other ideas?

[1] openstack tripleo container image build \
        --base registry.access.redhat.com/ubi9:latest \
        --debug --distro centos \
        --exclude neutron-mlnx-agent \
        --extra-config /home/stack/extra_config.yaml \
        --namespace tripleomastercentos9 \
        --prefix openstack \
        --push --registry 127.0.0.1:5000 \
        --tag adaac75f69ae93d6ae76ee320b90dc0a \
        --volume /etc/yum.repos.d:/etc/distro.repos.d:z \
        --volume /etc/pki/rpm-gpg:/etc/pki/rpm-gpg:z \
        --volume /etc/dnf/vars:/etc/dnf/vars:z \
        --work-dir /home/stack/container-builds \
        --tcib-extras tcib_release=9 \
        --tcib-extras tcib_python_version=3.9 >/home/stack/container_image_build.log 2>&1

all:
  hosts:
    aodh-api:
      ansible_connection: local
      tcib_actions:
      - run: dnf -y install {{ tcib_packages.common | join(' ') }} && dnf clean all
          && rm -rf /var/cache/dnf
      - run: mkdir -p /var/www/cgi-bin/aodh && chmod 755 /var/www/cgi-bin/aodh &&
          cp -a /usr/bin/aodh-api /var/www/cgi-bin/aodh/ && sed -i -r 's,^(Listen
          80),#\1,' /etc/httpd/conf/httpd.conf && sed -i -r 's,^(Listen 443),#\1,'
          /etc/httpd/conf.d/ssl.conf
      - run: ln -s /usr/share/openstack-tripleo-common/healthcheck/aodh-api /openstack/healthcheck
          && chmod a+rx /openstack/healthcheck
      tcib_distro: ...

Read more...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

After some more testing with Takashi++, we found the following:
- 2.14 seems to explode the "serial" limit, and runs everything up to a point it can't do anything
- 2.13 is far more conservative with the amount of running things - especially hosts.

A proposal is to link the "serial" option for the playbook to the amount of CPU, with a way to override it via an option to the container build CLI. At least, this is working fine locally, we'll do some more tests within the CI.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/python-tripleoclient/+/864717

Changed in tripleo:
assignee: nobody → Cédric Jeanneret (cjeanner)
assignee: Cédric Jeanneret (cjeanner) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/864803

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ci (master)

Change abandoned by "Takashi Kajinami <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/864803
Reason: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/864838 is the right one it seems.

Revision history for this message
Rabi Mishra (rabi) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-tripleoclient (master)

Change abandoned by "Takashi Kajinami <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/python-tripleoclient/+/864717

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/864838
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/c625becad164a5ad8f719c738520cb7e0e115694
Submitter: "Zuul (22348)"
Branch: master

commit c625becad164a5ad8f719c738520cb7e0e115694
Author: Takashi Kajinami <email address hidden>
Date: Thu Nov 17 13:22:15 2022 +0900

    Do not use --debug for image build

    Since Ansible was bumped to 2.14, we've observed the container image
    build process gets stuck in the middle of ansible tasks to generate
    Docker/Buildah files, because of a bug[1] with ansible-runner.

    This removes --debug option from the build command to avoid -vvv option
    in the ansible command, to workaround the above bug.

    [1] https://github.com/ansible/ansible-runner/issues/1164

    Related-Bug: #1996612
    Change-Id: I53c688077c65da03d8c3cf104862e02cefc2c615

Alan Pevec (apevec)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/865540

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote (last edit ):

We are hitting same issue in master/zed container build now:

Proposed same workaround for container build job in build-container role

[0] https://logserver.rdoproject.org/57/42657/20/check/periodic-tripleo-ci-build-containers-ubi-9-quay-push-master/c964a70/logs/build.log
[1] https://review.opendev.org/c/openstack/tripleo-ci/+/865540

Changed in tripleo:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ci (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ci/+/865540
Committed: https://opendev.org/openstack/tripleo-ci/commit/909c83bc625fac1479b281d325e29bb7b2388c3f
Submitter: "Zuul (22348)"
Branch: master

commit 909c83bc625fac1479b281d325e29bb7b2388c3f
Author: Sandeep Yadav <email address hidden>
Date: Thu Nov 24 18:20:24 2022 +0530

    Do not use --debug for image build

    Since Ansible was bumped to 2.14, we've observed the container image
    build process gets stuck in the middle of ansible tasks to generate
    Docker/Buildah files, because of a bug[1] with ansible-runner.

    This removes --debug option from the build command to avoid -vvv option
    in the ansible command, to workaround the above bug.

    Same workaround is added for content-provider already[2], adding same
    workaround for build-containers role.

    [1] https://github.com/ansible/ansible-runner/issues/1164
    [2] https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/864838

    Related-Bug: #1996612
    Change-Id: I498c9cac7815d3d0682835d2bf943594dad2203c

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.