tripleo-ci-centos-9-scenario010-standalone and ovn master jobs are failing deploy "Error: /etc/ceph/ceph.conf contained an empty fsid definition"

Bug #1981634 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Ronelle Landy

Bug Description

tripleo-ci-centos-9-scenario010-ovn-provider-standalone and tripleo-ci-centos-9-scenario010-standalone jobs are failing on mater for check and gate deploy with:

 WARNING | ERROR: Can't run container nova_libvirt_init_secret
stderr: time="2022-07-13T16:33:53Z" level=info msg="podman filtering at log level info"
time="2022-07-13T16:33:53Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
time="2022-07-13T16:33:53Z" level=info msg="Setting parallel job count to 25"
time="2022-07-13T16:33:54Z" level=info msg="Sysctl net.ipv4.ping_group_range=0 0 ignored in containers.conf, since Network Namespace set to host"
time="2022-07-13T16:33:54Z" level=info msg="User mount overriding libpod mount at \"/etc/hosts\""
time="2022-07-13T16:33:54Z" level=info msg="Running conmon under slice machine.slice and unitName libpod-conmon-88e127f88f2f1a5800af0ee9d1b157445059d50c1fc28688cbff0df2d41617df.scope"
time="2022-07-13T16:33:54Z" level=info msg="Got Conmon PID as 161096"
time="2022-07-13T16:33:54Z" level=info msg="Received shutdown.Stop(), terminating!" PID=161001
2022-07-13 16:34:22.216396 | fa163e39-8da3-dfab-de6e-00000000387d | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_4 | standalone | error={"changed": false, "msg": "Failed containers: nova_libvirt_init_secret"}

https://a9f1aef221b9e8d1cf76-922433d163de5a07cac84d974d42345f.ssl.cf1.rackcdn.com/849688/2/check/tripleo-ci-centos-9-scenario010-ovn-provider-standalone/e93e550/logs/undercloud/var/log/extra/podman/containers/nova_libvirt_init_secret/stdout.log shows:

------------------------------------------------
Initializing virsh secrets for: ceph:openstack
Error: /etc/ceph/ceph.conf contained an empty fsid definition
Check your ceph configuration

Job history shows the error started on 03/13:

https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-scenario010-ovn-provider-standalone&skip=0

https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-scenario010-ovn-provider-standalone&job_name=tripleo-ci-centos-9-scenario010-standalone&skip=0

Other logs:

https://4cec70d2de8d73a9678a-a966260cdcfda0650aae15fc442adef2.ssl.cf1.rackcdn.com/776942/1/check/tripleo-ci-centos-9-scenario010-standalone/8ceddd7/logs/undercloud/var/log/extra/podman/containers/nova_libvirt_init_secret/stdout.log

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → zed-1
importance: Undecided → Critical
status: New → Triaged
tags: added: promotion-blocker
summary: - ripleo-ci-centos-9-scenario010-standalone and ovn master jobs are
+ tripleo-ci-centos-9-scenario010-standalone and ovn master jobs are
failing deploy "Error: /etc/ceph/ceph.conf contained an empty fsid
definition"
Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/849755

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

https://review.opendev.org/c/openstack/tripleo-ansible/+/849732 was merged to unblock gate - leaving bug open for ceph team to continue working here

Revision history for this message
Ronelle Landy (rlandy) wrote :

<gfidente> well I think the gate blocker bug can be closed
<gfidente> but #1981467 is the original bug we still need to fix

Changed in tripleo:
assignee: nobody → Ronelle Landy (rlandy)
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by "Ronelle Landy <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/849755

Revision history for this message
John Fulton (jfulton-org) wrote :
Download full text (3.3 KiB)

Analysis

We know /etc/ceph/ceph.conf contained an empty fsid definition [1] but why?
Both fsid and mon_host are missing from /etc/ceph/ceph.conf [2]
The missing values are present in config-download dir's ceph_client.yml [3]
As per cephadm-extra-vars-ansible.yml [4] the path to ceph_client.yml is set [5] to [3]
The deployed ceph ansible tasks and config-download ansible tasks logged to logged to /home/zuul/ansible.log [6]
The external deploy steps tasks logged to config-download/cephadm/cephadm_command.log [7]
We see in [7] that "Save tripleo_ceph_client_vars file | standalone -> localhost" from [8] ran
We see in [6] that "Render ceph config for the Ceph Clients" [9] ran with the template [10]
Rendered template behaves as if tripleo_ceph_client_fsid and tripleo_ceph_client_mon_ips are empty
Those variables should have been set by [11] but [6] shows "Load variables produced by.." was skipped
So during config-download deploy steps (not external deploy steps) tripleo_ceph_client_vars was empty.
When the job was green task "Load variables produced by the cephadm provisioning process" wasn't skipped [12]

[1] https://a9f1aef221b9e8d1cf76-922433d163de5a07cac84d974d42345f.ssl.cf1.rackcdn.com/849688/2/check/tripleo-ci-centos-9-scenario010-ovn-provider-standalone/e93e550/logs/undercloud/var/log/extra/podman/containers/nova_libvirt_init_secret/stdout.log

[2] https://a9f1aef221b9e8d1cf76-922433d163de5a07cac84d974d42345f.ssl.cf1.rackcdn.com/849688/2/check/tripleo-ci-centos-9-scenario010-ovn-provider-standalone/e93e550/logs/undercloud/etc/ceph/ceph.conf

[3] https://a9f1aef221b9e8d1cf76-922433d163de5a07cac84d974d42345f.ssl.cf1.rackcdn.com/849688/2/check/tripleo-ci-centos-9-scenario010-ovn-provider-standalone/e93e550/logs/undercloud/home/zuul/tripleo-deploy/standalone-ansible-35o06oew/cephadm/ceph_client.yml

[4] https://a9f1aef221b9e8d1cf76-922433d163de5a07cac84d974d42345f.ssl.cf1.rackcdn.com/849688/2/check/tripleo-ci-centos-9-scenario010-ovn-provider-standalone/e93e550/logs/undercloud/home/zuul/tripleo-deploy/standalone-ansible-35o06oew/cephadm/cephadm-extra-vars-ansible.yml

[5] tripleo_ceph_client_vars: "/home/zuul/tripleo-deploy/standalone-ansible-35o06oew/cephadm/ceph_client.yml"

[6] https://a9f1aef221b9e8d1cf76-922433d163de5a07cac84d974d42345f.ssl.cf1.rackcdn.com/849688/2/check/tripleo-ci-centos-9-scenario010-ovn-provider-standalone/e93e550/logs/undercloud/home/zuul/ansible.log

[7] https://a9f1aef221b9e8d1cf76-922433d163de5a07cac84d974d42345f.ssl.cf1.rackcdn.com/849688/2/check/tripleo-ci-centos-9-scenario010-ovn-provider-standalone/e93e550/logs/undercloud/home/zuul/tripleo-deploy/standalone-ansible-35o06oew/cephadm/cephadm_command.log

[8] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/export.yaml#L86

[9] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_ceph_client/tasks/main.yml#L62-L73

[10] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_ceph_client/templates/ceph_conf.j2#L14-L19

[11] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_ceph_client/tasks/main.yml#L27-L34

[12] ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)
Revision history for this message
John Fulton (jfulton-org) wrote :

During overcloud deployment ...
1. tripleo_run_cephadm is run by config-download [1] and logs to ~/ansible.log
2. tripleo_cephadm is run in a nested ansible [1] and logs to ~/config-download/<stack>/cephadm/cephadm_command.log
3. tripleo_ceph_client is run by config-download [1] and logs to ~/ansible.log

The FSID and MonIPs are exported by role tripleo_cephadm [2] (during step 2 above)
The FSID and MonIPs are imported by role tripleo_ceph_client [3] (during step 3 above)
Thus, both deploy_steps and external_deploy_steps MUST reference the same tripleo_ceph_client_vars path
However, {{ playbook_dir }} is not the same path in both 1 and 2
Everything was fine until [4] which wasn't really necessary because of [5]

[1] https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/deployment/ansible_config_download.html#deploy-steps-playbook-yaml

[2] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/export.yaml#L86

[3] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_ceph_client/tasks/main.yml#L32-L34

[4] https://github.com/openstack/tripleo-ansible/commit/93cc215909db617aece85cb53c0279d04c04c69a
[5] https://github.com/openstack/python-tripleoclient/commit/fda3233451db678642fd57f920487db7ca63dc30

Revision history for this message
John Fulton (jfulton-org) wrote :

I had written:
> Everything was fine until [4] which wasn't really necessary because of [5]

However, that's not true.

If we revert [4] then we won't be bale to support multiple stack exports. stack2 would overwrite stack1 and then you couldn't export stack1.

stack1: /home/USER/ceph_client.yaml
stack2: /home/USER/ceph_client.yaml

Instead we need to go back to the config-download/<stack> solution.

stack1: /home/USER/config-download/stack1/ceph_client.yaml
stack2: /home/USER/config-download/stack2/ceph_client.yaml

We just need the config-download/<stack> solution to be implemented consistently across roles. I have a solution in mind I will link next.

[4] https://github.com/openstack/tripleo-ansible/commit/93cc215909db617aece85cb53c0279d04c04c69a
[5] https://github.com/openstack/python-tripleoclient/commit/fda3233451db678642fd57f920487db7ca63dc30

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by "John Fulton <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/850037
Reason: https://bugs.launchpad.net/tripleo/+bug/1981634/comments/10

Revision history for this message
John Fulton (jfulton-org) wrote :

https://bugs.launchpad.net/tripleo/+bug/1981634 tracked the gate blocker and is resolved by https://review.opendev.org/c/openstack/tripleo-ansible/+/849732

https://bugs.launchpad.net/tripleo/+bug/1981467 tracks correct exports and a new fix is needed. Future updates will be in 1981467. Analysis here was to avoid re-introducing 1981634 but that now seems to be understood.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.