deployed ceph attempts to download container from 192.168.24.1 before it's ready

Bug #1978998 reported by John Fulton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
John Fulton

Bug Description

periodic-tripleo-ci-centos-9-scenario001-standalone failed to download the ceph container during bootstrap.

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/105b6ec/logs/undercloud/home/zuul/ansible.log.txt.gz

2022-06-16 13:15:42,927 p=78755 u=root n=ansible | 2022-06-16 13:15:42.926608 | fa163e75-edd1-8fa8-5429-00000000006d | FATAL | Run cephadm bootstrap | standalone.localdomain | error={"changed": true, "cmd": "/usr/sbin/cephadm --image 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph \\bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid cc82bd4c-d566-5468-9fa7-51072fb08780 --config /home/ceph-admin/assimilate_ceph.conf \\--single-host-defaults \\--skip-monitoring-stack --skip-dashboard --log-to-file --skip-mon-network \\--mon-ip 192.168.42.1\n", "delta": "0:01:06.369211", "end": "2022-06-16 13:15:42.898100", "msg": "non-zero return code", "rc": 1, "start": "2022-06-16 13:14:36.528889", "stderr": "Verifying podman|docker is present...\nVerifying lvm2 is present...\nVerifying time synchronization is in place...\nUnit chronyd.service is enabled and running\nRepeating the final host check...\npodman (/bin/podman) version 4.1.0 is present\nsystemctl is present\nlvcreate is present\nUnit chronyd.service is enabled and running\nHost looks OK\nCluster fsid: cc82bd4c-d566-5468-9fa7-51072fb08780\nVerifying IP 192.168.42.1 port 3300 ...\nVerifying IP 192.168.42.1 port 6789 ...\nInternal network (--cluster-network) has not been provided, OSD replication will default to the public_network\nAdjusting default settings to suit single-host cluster...\nPulling container image 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph...\nNon-zero exit code 125 from /bin/podman pull 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph\n/bin/podman: stderr Trying to pull 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph...\n/bin/podman: stderr time=\"2022-06-16T13:14:44-04:00\" level=warning msg=\"Failed, retrying in 1s ... (1/3). Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \\\"https://192.168.24.1:8787/v2/\\\": dial tcp 192.168.24.1:8787: connect: network is unreachable\"\n/bin/podman: stderr time=\"2022-06-16T13:14:59-04:00\" level=warning msg=\"Failed, retrying in 1s ... (2/3). Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \\\"https://192.168.24.1:8787/v2/\\\": dial tcp 192.168.24.1:8787: connect: network is unreachable\"\n/bin/podman: stderr time=\"2022-06-16T13:15:04-04:00\" level=warning msg=\"Failed, retrying in 1s ... (3/3). Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \\\"https://192.168.24.1:8787/v2/\\\": dial tcp 192.168.24.1:8787: connect: network is unreachable\"\n/bin/podman: stderr Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \"https://192.168.24.1:8787/v2/\": dial tcp 192.168.24.1:8787: connect: network is unreachable\nERROR: Failed command: /bin/podman pull 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph", "stderr_lines": ["Verifying podman|docker is present...", "Verifying lvm2 is present...", "Verifying time synchronization is in place...", "Unit chronyd.service is enabled and running", "Repeating the final host check...", "podman (/bin/podman) version 4.1.0 is present", "systemctl is present", "lvcreate is present", "Unit chronyd.service is enabled and running", "Host looks OK", "Cluster fsid: cc82bd4c-d566-5468-9fa7-51072fb08780", "Verifying IP 192.168.42.1 port 3300 ...", "Verifying IP 192.168.42.1 port 6789 ...", "Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network", "Adjusting default settings to suit single-host cluster...", "Pulling container image 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph...", "Non-zero exit code 125 from /bin/podman pull 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph", "/bin/podman: stderr Trying to pull 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph...", "/bin/podman: stderr time=\"2022-06-16T13:14:44-04:00\" level=warning msg=\"Failed, retrying in 1s ... (1/3). Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \\\"https://192.168.24.1:8787/v2/\\\": dial tcp 192.168.24.1:8787: connect: network is unreachable\"", "/bin/podman: stderr time=\"2022-06-16T13:14:59-04:00\" level=warning msg=\"Failed, retrying in 1s ... (2/3). Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \\\"https://192.168.24.1:8787/v2/\\\": dial tcp 192.168.24.1:8787: connect: network is unreachable\"", "/bin/podman: stderr time=\"2022-06-16T13:15:04-04:00\" level=warning msg=\"Failed, retrying in 1s ... (3/3). Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \\\"https://192.168.24.1:8787/v2/\\\": dial tcp 192.168.24.1:8787: connect: network is unreachable\"", "/bin/podman: stderr Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph: pinging container registry 192.168.24.1:8787: Get \"https://192.168.24.1:8787/v2/\": dial tcp 192.168.24.1:8787: connect: network is unreachable", "ERROR: Failed command: /bin/podman pull 192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph"], "stdout": "", "stdout_lines": []}

Revision history for this message
John Fulton (jfulton-org) wrote :

192.168.24.1 cannot be used as a container registry to bootstrap ceph because the 24.1 address is not yet configured on the server (that happens after ceph is deployed).

When https://review.opendev.org/834352 was written this was known and I observed I didn't need to worry about this because a local mirror is used in place of the undercloud's registry.

The question is why is 192.168.24.1 being used instead of the local mirror in the periodic job?

Relevant Ansible to start reading to understand what's happening:

https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/standalone/tasks/main.yml#L2-L3

https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/standalone/tasks/containers.yml

Revision history for this message
John Fulton (jfulton-org) wrote :

Why do I see this:

  Error: initializing source docker://192.168.24.1:8787/tripleomastercentos9/daemon:current-ceph

When I see this [1]

  ceph_namespace: 198.72.124.73:5001/tripleomastercentos9

[1] https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/home/zuul/containers-prepare-parameters.yaml

Revision history for this message
John Fulton (jfulton-org) wrote :

Why is containers-prepare-parameters.yaml [1] of a passing non-perioidic job different from a periodic job [2]?

Specifically: "push_destination: 192.168.24.1:8787" in [2] caused the reported error.

How did that get in there?

When the TQE standalone role ran the containers task in the passing job we saw [3]

2022-06-15 19:12:57.738167 | primary | changed: [undercloud] => (item={'original': 'ceph_namespace', 'replace': 'ceph_namespace: 104.130.132.214:5001/tripleomastercentos9'})

When the same role ran in the failing periodic job [4] we saw instead:

2022-06-16 13:13:21.588887 | primary | skipping: [undercloud] => (item={'original': 'ceph_namespace', 'replace': 'ceph_namespace: quay.rdoproject.org/tripleomastercentos9'})

[1] https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/home/zuul/containers-prepare-parameters.yaml

[2] https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/standalone/tasks/containers.yml

[3] https://2f69333936e3feb7cea6-be6253c0e82f1539fed391a5717e06a0.ssl.cf5.rackcdn.com/834352/81/gate/tripleo-ci-centos-9-scenario001-standalone/f66353d/job-output.txt

[4] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-master/105b6ec/job-output.txt

Revision history for this message
John Fulton (jfulton-org) wrote :

The non-periodic jobs consume containers from a content-provider. It's enabled here:

https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/standalone/tasks/containers.yml#L128

periodic jobs were configured to test newer ceph containers which is fine, but we need to not use push_destination with them, i.e. we can't push them to an undercloud registry that is not yet set up.

I think we can use standalone_container_ceph_updates|bool introduced from the patch below to identify the periodic_job and deal with this.

https://github.com/openstack/tripleo-quickstart-extras/commit/78b758851678a4df86248185dbd0fae0979b1494

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart-extras (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/846231
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/0b586c6dcee71297fd875ec6b96fc0492bb09e7b
Submitter: "Zuul (22348)"
Branch: master

commit 0b586c6dcee71297fd875ec6b96fc0492bb09e7b
Author: John Fulton <email address hidden>
Date: Thu Jun 16 18:07:54 2022 -0400

    Override Ceph --container-namespace for periodic jobs

    If the standalone job is periodic, use 'openstack overcloud
    ceph deploy --container-namespace' to pull the container
    directly from quay.rdoproject.org.

    Standalone scenarios usually consume containers from a
    content-provider but periodic jobs pull them directly
    from quay.rdoproject.org. The periodic jobs then push
    them to the undercloud as a container registry. Non-
    periodic jobs do not push to the undercloud registry.

    When I982dedb53582fbd76391165c3ca72954c129b84a merged,
    periodic standalone jobs broke because the undercloud
    container registry was not configured when 'openstack
    overcloud ceph deploy' was run. Because the container
    prepare file has a push_destination, deployed ceph
    assumes the containers were prepared in advance so it
    swaps out the container namespace as indicated by
    push_destination directive. Though this still happens
    we override it again with --container-namespace.

    Change-Id: I1abfbbd23ca93c01393d05057806ba9cc846fbed
    Closes-bug: #1978998

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.