Deployment of multiple overcloud stacks using single undercloud failed with "msg": "Cloud overcloud-two was not found."

Bug #1905667 reported by Sandeep Yadav
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
New
Low
Unassigned

Bug Description

Description:

During the creation of a new job to deploy multiple overcloud using a single undercloud we hit an error where deployment failed with "msg": "Cloud overcloud-two was not found."

Logs:-

~~~
2020-11-24 11:29:55,147 p=22669 u=mistral n=ansible | TASK [Clean up legacy Cinder keystone catalog entries] *************************
2020-11-24 11:29:55,147 p=22669 u=mistral n=ansible | Tuesday 24 November 2020 11:29:55 -0500 (0:00:00.398) 0:42:19.289 ******
2020-11-24 11:29:57,500 p=22669 u=mistral n=ansible | fatal: [undercloud]: FAILED! => {"changed": false, "msg": "Cloud overcloud-two was not found."}
~~~

~~~
TASK [tripleo-keystone-resources : Create default domain] **********************
Monday 23 November 2020 05:01:32 -0500 (0:00:00.191) 0:30:41.385 *******
fatal: [undercloud]: FAILED! => {"changed": false, "msg": "Cloud overcloud-two was not found."}
~~~

~~~
TASK [Manage Cinder Volume Type] ***********************************************
Wednesday 25 November 2020 10:07:43 -0500 (0:00:00.222) 0:18:56.253 ****
fatal: [undercloud]: FAILED! => {"changed": false, "cmd": "if ! openstack volume type show \"tripleo\"; then\n openstack volume type create --public \"trip
leo\"\nfi\n", "delta": "0:00:03.360400", "end": "2020-11-25 10:07:47.360509", "msg": "non-zero return code", "rc": 1, "start": "2020-11-25 10:07:44.000109", "
stderr": "Cloud overcloud-two was not found.\nCloud overcloud-two was not found.", "stderr_lines": ["Cloud overcloud-two was not found.", "Cloud overcloud-two was not found."], "stdo
ut": "", "stdout_lines": []}
~~~

This was because below was happening in ci job :
* Deploy overcloud stack
* Run tempest on overcloud stack (which created /root/.config/openstack/clouds.yaml)
  https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/playbooks/tasks/tempest.yml#L137
* Deploy overcloud-two stack which failed with "msg": "Cloud overcloud-two was not found." ( because /root/.config/openstack/clouds.yaml had only overcloud entry in it), this happened because some ansible tasks have unneccesary become:true which caused ansible to first looks for clouds.yaml in the root user directory.

~~~
[zuul@undercloud ~]$ openstack --os-cloud sandeep endpoint list
+----------------------------------+-----------+--------------+--------------+---------+-----------+---------------------------+
| ID | Region | Service Name | Service Type | Enabled | Interface | URL |
+----------------------------------+-----------+--------------+--------------+---------+-----------+---------------------------+
| 0a864587b43641a0a70e8759268de30b | regionOne | keystone | identity | True | admin | http://10.9.122.22:35357 |
| 216ec335717848e7817c31aecc0a0955 | regionOne | keystone | identity | True | public | http://10.9.122.93:5000 |
| 5bacb43660794ebab48081ee393a9296 | regionOne | keystone | identity | True | internal | http://172.21.33.167:5000 |
+----------------------------------+-----------+--------------+--------------+---------+-----------+---------------------------+
[zuul@undercloud ~]$ sudo -i
[root@undercloud ~]# openstack --os-cloud sandeep endpoint list
Cloud sandeep was not found.
[root@undercloud ~]# cat /root/.config/openstack/clouds.yaml
clouds:
  overcloud:
    auth:
      auth_url: http://10.9.122.93:5000
      password: UQKWmvfZsI3Weni9yTuzP6jkj
      project_domain_name: Default
      project_name: admin
      user_domain_name: Default
      username: admin
    cacert: ''
    identity_api_version: '3'
    region_name: regionOne
  undercloud:
    auth:
      auth_url: https://10.9.122.2:13000
      password: fbZ6YXx42kUPN9JGBRViIiBaQ
      project_domain_name: Default
      project_name: admin
      user_domain_name: Default
      username: admin
    cacert: /etc/pki/ca-trust/source/anchors/cm-local-ca.pem
    identity_api_version: '3'
    region_name: regionOne
~~~

1) We need to modify in tripleo-quickstart-extras either to not copy clouds.yaml to root user directory or change our job definition to trigger tempest after deploying both the stacks.

2) on a side note: we're using become: true on some tasks like when creating keystone resources, we probably should not.

We can probably remove "become: true" from the following places.
~~~
https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/cinder/cinder-api-container-puppet.yaml#L419
https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/cinder/cinder-api-container-puppet.yaml#L484
https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/keystone/keystone-container-puppet.yaml#L773
~~~

We have tested removing become: true from the above places in a local environment, stack deployment is successful with these changes.

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
milestone: xena-1 → xena-2
Changed in tripleo:
milestone: xena-2 → xena-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by "James Slagle <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/764283
Reason: Abandoning this patch per the TripleO Patch Abandonment guidelines
(https://specs.openstack.org/openstack/tripleo-specs/specs/policy/patch-abandonment.html).
If you wish to have this restored and cannot do so yourself, please reach out
via #tripleo on OFTC or the OpenStack Dev mailing list.

Revision history for this message
swogat pradhan (swogat) wrote :

I too am facing the same issue when trying DCN deployment.
Reference ticket: https://bugs.launchpad.net/tripleo/+bug/2003919
do i need to remove the become true section in nova deployment templet as well for this??
or copying the clouds.yaml to root and changing the cloud name it will solve the issue.

Revision history for this message
swogat pradhan (swogat) wrote :

I too am facing similar issue but in Nova section, i am trying to setup DCN:
2023-01-27 20:07:21.523012 | 48d539a1-1679-deee-75e4-0000000000fe | TASK | Nova: Manage aggregate and availability zone and add hosts to the zone
Using module file /usr/lib/python3.6/site-packages/ansible/modules/cloud/openstack/os_nova_host_aggregate.py
Pipelining is enabled.
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: stack
<localhost> EXEC /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-urbvkhqfmaljfracvpejeknzcglnfdxd ; OS_CLOUD=dcn01 /usr/bin/python3'"'"' && sleep 0'
The full traceback is:
  File "/tmp/ansible_os_nova_host_aggregate_payload_xmki19vw/ansible_os_nova_host_aggregate_payload.zip/ansible/module_utils/openstack.py", line 159, in openstack_cloud_from_module
    interface=module.params['interface'],
  File "/usr/lib/python3.6/site-packages/openstack/__init__.py", line 63, in connect
    options=options, **kwargs)
  File "/usr/lib/python3.6/site-packages/openstack/config/__init__.py", line 36, in get_cloud_region
    return config.get_one(options=parsed_options, **kwargs)
  File "/usr/lib/python3.6/site-packages/openstack/config/loader.py", line 1107, in get_one
    config = self._get_base_cloud_config(cloud, profile)
  File "/usr/lib/python3.6/site-packages/openstack/config/loader.py", line 509, in _get_base_cloud_config
    name=name))
2023-01-27 20:07:24.783699 | 48d539a1-1679-deee-75e4-0000000000fe | FATAL | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud | error={
    "changed": false,
    "invocation": {
        "module_args": {
            "api_timeout": null,
            "auth": null,
            "auth_type": null,
            "availability_zone": "dcn01",
            "ca_cert": null,
            "client_cert": null,
            "client_key": null,
            "hosts": [
                "dcn01-hci-0.bdxworld.com",
                "dcn01-hci-1.bdxworld.com",
                "dcn01-hci-2.bdxworld.com"
            ],
            "interface": "public",
            "metadata": null,
            "name": "dcn01",
            "region_name": null,
            "state": "present",
            "timeout": 180,
            "validate_certs": null,
            "wait": true
        }
    },
    "msg": "Cloud dcn01 was not found."
}

Reference case: https://bugs.launchpad.net/tripleo/+bug/2003919

Do i need to remove become: true parameter in nova deployment template as well or do i need to copy the clouds.yaml file to root and change the cloud details to my edge site?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.