standalone job deploying ceph failing with Error: container-init binary not found on the host: stat /usr/libexec/podman/catatonit: no such file or directory"

Bug #1985981 reported by Sandeep Yadav
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Unassigned

Bug Description

We are running Sc010 kvm in both vexx cloud and in the internal cloud.

The job which runs in the internal cloud fails with the below error:-

~~~
2022-08-12 07:38:27,832 p=89450 u=root n=ansible | 2022-08-12 07:38:27.831192 | fa163e0d-40f2-7933-9109-000000000070 | FATAL | Run cephadm bootstrap
.
.
Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint ceph --init -e CONTAINER_IMAGE=quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph -e NODE_NAME=standalone.localdomain -e CEPH_USE_RANDOM_NONCE=1 quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph --version
ceph: stderr Error: container-init binary not found on the host: stat /usr/libexec/podman/catatonit: no such file or directory
Traceback (most recent call last):
  File "/usr/sbin/cephadm", line 9106, in <module>
    main()
  File "/usr/sbin/cephadm", line 9094, in main
    r = ctx.func(ctx)
  File "/usr/sbin/cephadm", line 1969, in _default_image
    return func(ctx)
  File "/usr/sbin/cephadm", line 4707, in command_bootstrap
    image_ver = CephContainer(ctx, ctx.image, 'ceph', ['--version']).run().strip()
  File "/usr/sbin/cephadm", line 3739, in run
    out, _, _ = call_throws(self.ctx, self.run_cmd(),
  File "/usr/sbin/cephadm", line 1636, in call_throws
    raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
RuntimeError: Failed command: /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint ceph --init -e CONTAINER_IMAGE=quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph -e NODE_NAME=standalone.localdomain -e CEPH_USE_RANDOM_NONCE=1 quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph --version: Error: container-init binary not found on the host: stat /usr/libexec/podman/catatonit: no such file or directory", "stderr_lines": ["Verifying podman|docker is present...", "Verifying lvm2 is present...", "Verifying time synchronization is in place...", "Unit chronyd.service is enabled and running", "Repeating the final host check...", "podman (/bin/podman) version 4.1.1 is present", "systemctl is present", "lvcreate is present", "Unit chronyd.service is enabled and running", "Host looks OK", "Cluster fsid: e1f5356e-8579-59d7-a01c-bd09ff028582", "Verifying IP 192.168.42.1 port 3300 ...", "Verifying IP 192.168.42.1 port 6789 ...", "Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network", "Adjusting default settings to suit single-host cluster...", "Pulling container image quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph...", "Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint ceph --init -e CONTAINER_IMAGE=quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph -e NODE_NAME=standalone.localdomain -e CEPH_USE_RANDOM_NONCE=1 quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph --version", "ceph: stderr Error: container-init binary not found on the host: stat /usr/libexec/podman/catatonit: no such file or directory", "Traceback (most recent call last):", " File "/usr/sbin/cephadm", line 9106, in <module>", " main()", " File "/usr/sbin/cephadm", line 9094, in main", " r = ctx.func(ctx)", " File "/usr/sbin/cephadm", line 1969, in _default_image", " return func(ctx)", " File "/usr/sbin/cephadm", line 4707, in command_bootstrap", " image_ver = CephContainer(ctx, ctx.image, 'ceph', ['--version']).run().strip()", " File "/usr/sbin/cephadm", line 3739, in run", " out, _, _ = call_throws(self.ctx, self.run_cmd(),", " File "/usr/sbin/cephadm", line 1636, in call_throws", " raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')", "RuntimeError: Failed command: /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint ceph --init -e CONTAINER_IMAGE=quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph -e NODE_NAME=standalone.localdomain -e CEPH_USE_RANDOM_NONCE=1 quay.rdoproject.org/tripleomastercentos9/daemon:current-ceph --version: Error: container-init binary not found on the host: stat /usr/libexec/podman/catatonit: no such file or directory"], "stdout": "", "stdout_lines": []}
~~~

Same sc010 kvm job is passing in vexx Cloud.

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-scenario010-kvm-standalone-master&skip=0

As per blog[1] This can happen due to the missing catatonit package which is a weak dependency of podman.

[1] https://unix.stackexchange.com/questions/619212/podman-run-with-init-gives-me-error-container-init-binary-not-found-on-the-h

From logs, I can confirm podman-catatonit.x86_64 missing in the internal job but present in the job running in vexx cloud.

Another difference is in the podman version and the source repo of the podman package:-

In vexx job:-
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario010-kvm-standalone-master/97aa4c5/logs/undercloud/var/log/extra/package-list-installed.txt.gz

~~~
podman.x86_64 2:4.1.1-3.el9 @appstream
podman-catatonit.x86_64 2:4.1.1-3.el9 @appstream
~~~

Internal job:-
~~~
podman.x86_64 2:4.1.1-6.el9 @quickstart-centos-appstreams
~~~

Now it is seen in most of the standalone jobs where ceph is deployed.

Revision history for this message
chandan kumar (chkumar246) wrote :
summary: - Sc010 kvm internal job failing with Error: container-init binary not
- found on the host: stat /usr/libexec/podman/catatonit: no such file or
+ standalone ceph job failing with Error: container-init binary not found
+ on the host: stat /usr/libexec/podman/catatonit: no such file or
directory"
description: updated
tags: added: promotion-blocker
Revision history for this message
chandan kumar (chkumar246) wrote : Re: standalone ceph job failing with Error: container-init binary not found on the host: stat /usr/libexec/podman/catatonit: no such file or directory"

https://gitlab.com/redhat/centos-stream/rpms/podman/-/commit/1868ce5b4b3c39b0e65a1351472cf65813e9fa07 - - convert catatonit dependency to soft dep as catatonit is
broke the jobs.
https://gitlab.com/redhat/centos-stream/rpms/podman/-/commit/d75618b2e1aefe123e4d9860fcf6d6e817ede4e9 - adds require catatonit for gating tests but not sure it will fix the issue.

summary: - standalone ceph job failing with Error: container-init binary not found
- on the host: stat /usr/libexec/podman/catatonit: no such file or
- directory"
+ standalone job deploying ceph failing with Error: container-init binary
+ not found on the host: stat /usr/libexec/podman/catatonit: no such file
+ or directory"
Revision history for this message
chandan kumar (chkumar246) wrote :

proposed https://review.opendev.org/c/openstack/tripleo-quickstart/+/853142 to install catatonit.
But currently blocked on vexxhost node_failure issues https://bugs.launchpad.net/tripleo/+bug/1986502 to test this patch.

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart/+/853142
Committed: https://opendev.org/openstack/tripleo-quickstart/commit/4a440d17212dd4d6f69a59a41fccdd08a844995a
Submitter: "Zuul (22348)"
Branch: master

commit 4a440d17212dd4d6f69a59a41fccdd08a844995a
Author: Chandan Kumar (raukadah) <email address hidden>
Date: Mon Aug 15 11:27:43 2022 +0530

    Temporarily install catatonit

    https://gitlab.com/redhat/centos-stream/rpms/podman/-/commit/1868ce5b4b3c39b0e65a1351472cf65813e9fa07
    convert catatonit dependency to soft dep as catatonit.

    It means catatonit or podman-catatonit is not installed
    as a part of podman installation leading to
    stat /usr/libexec/podman/catatonit: no such file or directory
    failure.

    Installing catatonit fixes the issue.

    Related-Bug: #1985981

    Signed-off-by: Chandan Kumar (raukadah) <email address hidden>
    Change-Id: I2d3cf750840f32a35b850fddd620207786dc120b

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Just a note that https://gitlab.com/redhat/centos-stream/rpms/podman/-/commit/d75618b2e1aefe123e4d9860fcf6d6e817ede4e9 will add back catatonit as require, we will remove our workaround once this comes via rpm.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart/+/855587
Committed: https://opendev.org/openstack/tripleo-quickstart/commit/b10da3f993b1be0709cfe047b292c091fa7f3554
Submitter: "Zuul (22348)"
Branch: master

commit b10da3f993b1be0709cfe047b292c091fa7f3554
Author: Chandan Kumar (raukadah) <email address hidden>
Date: Fri Sep 2 11:02:28 2022 +0530

    Downgrade containers-common to 1-40

    containers-common-1-43 adds the new keypath[1] which will
    work with latest podman[2] which is not available in
    podman-4.1.1-6. It breaks the deployment.

    Downgrading containers-common to 1-40 fixes the issue
    till we get a new podman version.

    On release file changes, wallaby jobs are failing with
    ```
    Depsolve Error occurred: \n Problem: problem with installed package catatonit-3:0.1.7-7.el9.x86_64\n
    - package podman-2:4.2.0-3.el9.x86_64 conflicts with catatonit provided by catatonit-3:0.1.7-7.el9.x86_64
    ```
    during overcloud deployment. It blocks the above changes.

    We need to revert https://review.opendev.org/c/openstack/tripleo-quickstart/+/853142
    the change in this patch itself and get this patch in.

    Links:
    [1]. https://gitlab.com/redhat/centos-stream/rpms/containers-common/-/commit/04645c4a84442da3324eea8f6538a5768e69919a
    [2]. https://github.com/containers/image/commit/d218ff3d4611d35295615adf0913352a76684220

    Related-Bug: #1988500
    Related-Bug: #1988514
    Closes-Bug: #1985981

    Signed-off-by: Chandan Kumar (raukadah) <email address hidden>
    Change-Id: Ie0aea674228f011881f42b9515a2e0a73198abed

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.