multinode overcloud deploy with internal ceph fails: client configured before server

Bug #1925373 reported by John Fulton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Rabi Mishra

Bug Description

Deployment with cephadm for internal ceph [1] failed during client configuration [2]. Client configuration was running before server configuration and according to the generated external_deploy_steps_tasks_step2.yaml the order was not as desired. I.e. the client block was before the server block [3]. They need to be switched back to the previous order.

[1]
    time openstack overcloud \-v deploy \
          --disable-validations \
          --deployed-server \
          --libvirt-type qemu \
          --stack $STACK \
          --templates ~/templates \
          -r roles.yaml \
          -n ../network-data.yaml \
          -e ~/templates/environments/deployed-server-environment.yaml \
          -e ~/templates/environments/network-isolation.yaml \
          -e ~/templates/environments/network-environment.yaml \
          -e ~/templates/environments/disable-telemetry.yaml \
          -e ~/templates/environments/low-memory-usage.yaml \
          -e ~/templates/environments/docker-ha.yaml \
          -e ~/templates/environments/podman.yaml \
          -e ~/containers-prepare-parameter.yaml \
          -e ~/re-generated-container-prepare.yaml \
          -e ~/templates/environments/cephadm/cephadm.yaml \
          -e ~/oc0-domain.yaml \
          -e deployed-metal-$STACK.yaml \
          -e overrides.yaml \
          -e cephadm-overrides.yaml

[2]
2021-04-21 15:46:23,140 p=365448 u=stack n=ansible | PLAY [External deployment step 2] **********************************************
2021-04-21 15:46:23,146 p=365448 u=stack n=ansible | host: undercloud, task: TASK: meta (flush_handlers)
2021-04-21 15:46:23,153 p=365448 u=stack n=ansible | META: ran handlers
2021-04-21 15:46:23,168 p=365448 u=stack n=ansible | host: undercloud, task: TASK: External deployment step 2
2021-04-21 15:46:23,175 p=365448 u=stack n=ansible | 2021-04-21 15:46:23.174617 | 24420180-c6bf-5879-2864-0000000000ad | TASK | External deployment step 2
2021-04-21 15:46:23,192 p=365448 u=stack n=ansible | host: undercloud, task: TASK: include_tasks
2021-04-21 15:46:23,192 p=365448 u=stack n=ansible | undercloud still blocked
2021-04-21 15:46:23,194 p=365448 u=stack n=ansible | 2021-04-21 15:46:23.193657 | 24420180-c6bf-5879-2864-0000000000ad | OK | External deployment step 2 | undercloud | result={
    "changed": false,
    "msg": "Use --start-at-task 'External deployment step 2' to resume from this task"
}
2021-04-21 15:46:23,206 p=365448 u=stack n=ansible | host: undercloud, task: TASK: include_tasks
2021-04-21 15:46:23,230 p=365448 u=stack n=ansible | host: undercloud, task: TASK: meta (flush_handlers)
2021-04-21 15:46:23,231 p=365448 u=stack n=ansible | undercloud still blocked
2021-04-21 15:46:23,245 p=365448 u=stack n=ansible | host: undercloud, task: TASK: meta (flush_handlers)
2021-04-21 15:46:23,246 p=365448 u=stack n=ansible | undercloud still blocked
2021-04-21 15:46:23,270 p=365448 u=stack n=ansible | 2021-04-21 15:46:23.269730 | ad304063-b45d-4176-b098-19fc87f0bfde | INCLUDED | /home/stack/overcloud-deploy/oc0/config-download/oc0/external_deploy_steps_tasks_step2.yaml | undercloud
2021-04-21 15:46:23,284 p=365448 u=stack n=ansible | host: undercloud, task: TASK: configure ceph clients
2021-04-21 15:46:23,288 p=365448 u=stack n=ansible | 2021-04-21 15:46:23.287804 | 24420180-c6bf-5879-2864-000000006425 | TASK | configure ceph clients
2021-04-21 15:46:23,308 p=365448 u=stack n=ansible | host: undercloud, task: TASK: tripleo client role

[3]
- block:
  - include_role:
      name: tripleo_ceph_client
    name: configure ceph clients
    vars:
      tripleo_ceph_client_config_home: /var/lib/tripleo-config/ceph
      tripleo_ceph_client_vars: /home/stack/ceph_client.yml
  - include_role:
      name: tripleo_ceph_client
    loop: '{{ ceph_external_multi_config }}'
    name: tripleo client role
    vars:
      multiple: '{{ item }}'
      tripleo_ceph_client_config_home: /var/lib/tripleo-config/ceph
    when:
    - ceph_external_multi_config is defined
  name: Configure Ceph Clients
  tags:
  - ceph
  when: step|int == 2
- block:
  - include_role:
      name: tripleo_run_cephadm
      tasks_from: prepare.yml
    name: create cephadm working directory and related files
    vars:
      ceph_admin_extra_vars:
        distribute_private_key: true
        ssh_servers: '{{ groups[''ceph_mon''] | union(groups[''ceph_osd'']|default([]))
          | union(groups[''ceph_mgr'']|default([])) | union(groups[''ceph_rgw'']|default([]))
          | union(groups[''ceph_mds'']|default([])) | union(groups[''ceph_nfs'']|default([]))
          | union(groups[''ceph_rbdmirror'']|default([])) | unique }}'
        tripleo_admin_generate_key: false
        tripleo_admin_user: ceph-admin
      ceph_config_overrides: {}
      ceph_default_overrides:
        global:
          osd_pool_default_pg_num: 16
          osd_pool_default_pgp_num: 16
          osd_pool_default_size: 3
      ceph_keys:
        extra_keys: []
        manila:
          key: AQCaQYBgAAAAABAAcgRMOQO0wti+gh4Z0vIuIA==
          name: manila
        openstack_client:
          key: AQCaQYBgAAAAABAAyujI++5rTY8mgDGtS55kUw==
          name: openstack
        radosgw:
          key: AQCaQYBgAAAAABAAhxv8fJYqE4EVngXNmPOX1w==
          name: radosgw
      ceph_osd_spec:
        data_devices:
          all: true
      ceph_pools:
        cinder_backup_pool:
          enabled: true
          name: backups
        cinder_pool:
          cinder_extra_pools: []
          enabled: true
          name: volumes
        extra_pools: []
        glance_pool:
          enabled: true
          name: images
        gnocchi_pool:
          enabled: false
          name: ''
        nova_pool:
          enabled: true
          name: vms
        pg_num: 16
      ceph_spec_fqdn: false
      cephadm_extra_vars:
        ceph_container_registry_auth: false
        ceph_container_registry_password: ''
        ceph_container_registry_username: ''
        cephfs: cephfs
        cluster_network: 172.16.12.0/24
        public_network: 172.16.11.0/24
        tripleo_ceph_client_vars: /home/stack/ceph_client.yml
        tripleo_cephadm_cluster: ceph
        tripleo_cephadm_container_cli: podman
        tripleo_cephadm_container_image: ceph-ci/daemon
        tripleo_cephadm_container_ns: undercloud.ctlplane.mydomain.tld:8787
        tripleo_cephadm_container_tag: v6.0.0-stable-6.0-pacific-centos-8-x86_64
        tripleo_cephadm_crush_rules: []
        tripleo_cephadm_dashboard_enabled: false
        tripleo_cephadm_fsid: 2ffdc3ca-cf4d-4b4b-9b80-a645bdb5fc30
      manila_pools:
        data: manila_data
        data_pg_num: 16
        metadata: manila_metadata
        metadata_pg_num: 16
      tripleo_cephadm_dynamic_spec: true
      tripleo_run_cephadm_spec_path: '{{ playbook_dir }}/cephadm/ceph_spec.yaml'
  - include_role:
      name: tripleo_run_cephadm
      tasks_from: enable_ceph_admin_user.yml
    name: Prepare cephadm user and keys
  - include_role:
      name: tripleo_run_cephadm
    name: Deploy the ceph cluster using cephadm
  name: ceph_base_external_deploy_task
  tags:
  - ceph
  when: step|int == 2

Revision history for this message
John Fulton (jfulton-org) wrote :

ceph server tasks [1] happen before ceph client tasks [2] until I hit LP 1925373 today.
list_concat has CephBase external_deploy_tasks before ceph client [3].
list_concat uses python_extend [4] which preserves order and always appends to the end of a list.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/cephadm/ceph-base.yaml#L599
[2] https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/cephadm/ceph-client.yaml#L120
[3] https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/cephadm/ceph-client.yaml#L110
[4] https://github.com/openstack/heat/blob/stable/wallaby/heat/engine/hot/functions.py#L1644

Revision history for this message
John Fulton (jfulton-org) wrote :

The incorrectly generated external_deploy_steps_tasks_step2.yaml is made by this:

 https://github.com/openstack/tripleo-common/blob/master/tripleo_common/utils/config.py#L122

Revision history for this message
John Fulton (jfulton-org) wrote :

The tasks are already in the wrong order when tripleo-common gets them [1] from Heat [2].

So why did the list order change?

[1] http://paste.openstack.org/show/804701/
[2] https://github.com/openstack/tripleo-common/blob/master/tripleo_common/utils/config.py#L207-L212

Revision history for this message
John Fulton (jfulton-org) wrote :

It seems that the config list returned from Heat [1] is in a different order than usual.
So my external deploy steps tasks are not running in the order I need them to (both are step 2).
Is there anything that might have changed the order of self.stack_outputs.get('RoleConfig')?

It will be difficult if I can only use be assured of the order by step number because of the following deps:

step1: network
step2: ceph and ceph clients
step3: clients like cinder which use ceph client config

[1] https://github.com/openstack/tripleo-common/blob/master/tripleo_common/utils/config.py#L295

Revision history for this message
Rabi Mishra (rabi) wrote :

List orders are preserved in heat and python. Where do you notice the order changed? There is no link to any logs other than a pasted traceback here.

Few upstream ones I checked seem correct.

https://58e68a3f865bbcbd837c-fe0602cb6638bdb19de60d870f6f964c.ssl.cf1.rackcdn.com/786000/1/check/tripleo-ci-centos-8-scenario001-standalone/9ea8c35/logs/undercloud/home/zuul/standalone-ansible-ddpwtdi4/external_deploy_steps_tasks_step2.yaml

Revision history for this message
John Fulton (jfulton-org) wrote :

Unfortunately CI is not reproducing it.

A log is a file containing the history of events which happened. I can give you the files directly from my machine instead of pasting them. Please find attached a tarball containing:

1. external_deploy_steps_tasks_step2.yaml
This was generated directly by config-download you can see the ordering has the client configuration before the server configuration which is the opposite of what was in THT

https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/cephadm/ceph-client.yaml#L110

2. config.py
This is a small modification to tripleo_common/utils/config.py to log the contents of what is returned from Heat [1]. I then re-ran the deployment and it generated two logs.

3. task.log
4. config.log

[1] https://github.com/openstack/tripleo-common/blob/master/tripleo_common/utils/config.py#L113

Revision history for this message
John Fulton (jfulton-org) wrote :
Revision history for this message
John Fulton (jfulton-org) wrote :

I reset config.py and added only three lines [1] which save the config for external_deploy_steps_tasks in the following:

 https://github.com/openstack/tripleo-common/blob/master/tripleo_common/utils/config.py#L295

I've attached the genereated config.yaml and config.py in the tarball.

[1]
(undercloud) [CentOS-8 - stack@undercloud tripleo-common]$ git diff
diff --git a/tripleo_common/utils/config.py b/tripleo_common/utils/config.py
index 57b0883d..f6ee6c43 100644
--- a/tripleo_common/utils/config.py
+++ b/tripleo_common/utils/config.py
@@ -293,6 +293,9 @@ class Config(object):

         role_config = self.get_role_config()
         for config_name, config in six.iteritems(role_config):
+ if config_name == 'external_deploy_steps_tasks':
+ with self._open_file("/home/stack/config.yaml") as my_log:
+ yaml.safe_dump(config, my_log, default_flow_style=False)

             # External tasks are in RoleConfig and not defined per role.
             # So we don't use the RoleData to create the per step playbooks.
(undercloud) [CentOS-8 - stack@undercloud tripleo-common]$

Revision history for this message
Rabi Mishra (rabi) wrote :

OK, this looks like a regression from https://github.com/openstack/tripleo-heat-templates/commit/ef240c1f62a6afb584ef111fbef2f027a474414f

'list_concat_unique' does not seem to preserve the order as it removes duplicates from the beginning of the list.

I've proposed a heat patch https://review.opendev.org/c/openstack/heat/+/787662 to fix it.

Revision history for this message
John Fulton (jfulton-org) wrote :

Rabi,

I tested [1] your patch [2] and confirm it fixed the bug for me. People using TripleO can use [1] as a workaround until it lands. Since you wrote the fixing patch I've changed the bug assignee to you and I'm updating the status of the bug too.

Thanks,
  John

[1] https://github.com/fultonj/wallaby/tree/main/workarounds/heat
[2] https://review.opendev.org/c/openstack/heat/+/787662

Changed in tripleo:
assignee: John Fulton (jfulton-org) → nobody
assignee: nobody → Rabi Mishra (rabi)
status: Triaged → In Progress
Revision history for this message
Rabi Mishra (rabi) wrote :

Heat patch merged and backported to stable/wallaby.

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 14.2.0

This issue was fixed in the openstack/heat 14.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 17.0.0.0rc1

This issue was fixed in the openstack/heat 17.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 15.1.0

This issue was fixed in the openstack/heat 15.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 16.1.0

This issue was fixed in the openstack/heat 16.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat train-eol

This issue was fixed in the openstack/heat train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.