RoleNetIpMap missing ctlplane entries for computes from DeployedServerPortMap

Bug #1903775 reported by John Fulton
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Wishlist
Harald Jensås

Bug Description

`openstack overcloud admin authorize` is only creating tripleo-admin on my controller nodes but not my computes (or ceph-storage): http://paste.openstack.org/show/799889/

My ansible logs show only the controller node IPs when running tripleo_ansible/playbooks/cli-enable-ssh-admin.yaml : http://paste.openstack.org/show/799890/

As a result, the deploy fails when it needs to do something with nodes besides the controllers
because ansible can't reach them: http://paste.openstack.org/show/799888/

Why might get_overcloud_hosts() be returning an incomplete list?

https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/deployment.py#L200

Tags: metalsmith
Revision history for this message
John Fulton (jfulton-org) wrote :

I don't think it's the functions that query Heat as the attached stack output shows that the RoleNetIpMap IS missing the ctlplane IPs for the compute and ceph-storage nodes but not the controllers. So why is Heat missing it?

$ openstack stack show oc0 -f value > stack_show_oc0
$ less stack_show_oc0
...
  {
    "output_key": "BlacklistedIpAddresses",
    "description": "List of blacklisted ctlplane IP addresses",
    "output_value": [
      "",
      "",
      "",
      "",
      "",
      "",
      "",
      ""
    ]
  },
$
  {
    "output_key": "RoleNetIpMap",
    "description": "Mapping of each network to a list of IPs for each role",
    "output_value": {
      "Compute": {
        "ctlplane": [
          "",
          ""
        ],
        "storage_cloud_0": [
          "172.16.11.93",
          "172.16.11.227"
        ],
        "internal_api_cloud_0": [
          "172.16.13.92",
          "172.16.13.200"
        ],
        "tenant_cloud_0": [
          "172.16.14.71",
          "172.16.14.110"
        ]
      },
      "Controller": {
        "ctlplane": [
          "192.168.24.12",
          "192.168.24.8",
          "192.168.24.18"
        ],
        "storage_cloud_0": [
          "172.16.11.166",
          "172.16.11.94",
          "172.16.11.184"
        ],
        "storage_mgmt_cloud_0": [
          "172.16.12.143",
          "172.16.12.75",
          "172.16.12.175"
        ],
        "internal_api_cloud_0": [
          "172.16.13.198",
          "172.16.13.221",
          "172.16.13.108"
        ],
        "tenant_cloud_0": [
          "172.16.14.116",
          "172.16.14.188",
          "172.16.14.64"
        ],
        "external_cloud_0": [
          "192.168.100.81",
          "192.168.100.34",
          "192.168.100.84"
        ]
      },
      "CephStorage": {
        "ctlplane": [
          "",
          "",
          ""
      "CephStorage": {
        "ctlplane": [
          "",
          "",
          ""
        ],
        "storage_cloud_0": [
          "172.16.11.89",
          "172.16.11.114",
          "172.16.11.159"
        ],
        "storage_mgmt_cloud_0": [
          "172.16.12.94",
          "172.16.12.168",
          "172.16.12.210"
        ]
      }
    }
  },
...
$

Revision history for this message
John Fulton (jfulton-org) wrote :

The same stack_show_oc0's DeployedServerPortMap has the missing control-plane IPs for the computes and ceph storage nodes.

It looks like it got it from the overcloud-baremetal-deployed-0.yaml. Which was generated by running 'openstack overcloud node provision ... --output overcloud-baremetal-deployed-0.yaml' and then passed as input to 'openstack overcloud deploy ... -e overcloud-baremetal-deployed-0.yaml'

Revision history for this message
John Fulton (jfulton-org) wrote :

Is something wrong with my HostnameMap which is causing certain merges to get missed and thus data omitted?

overcloud-baremetal-deployed-0.yaml contains
~~~
CephStorageHostnameFormat: '%stackname%-cephstorage-%index%'
...
DeployedServerPortMap:
  oc0-ceph-0-ctlplane:
...
HostnameMap:
  overcloud-0-cephstorage-0: oc0-ceph-0
~~~

`metalsmith list` contains:

+------------------+------------------+------------------------+
| Node Name | Hostname | IP Addresses |
| oc0-ceph-0 | oc0-ceph-0 | ctlplane=192.168.24.11 |
+------------------+------------------+------------------------+

https://github.com/openstack/tripleo-heat-templates/tree/master/deployed-server

Revision history for this message
John Fulton (jfulton-org) wrote :

I was able to get around this after I modified the overcloud-baremetal-deployed-0.yaml generated by 'openstack overcloud node provision' to replace "cephstorage" with "ceph" and "novacompue" with "compute" [1].

The nodes did not get the hostname specified by CephStorageHostnameFormat or ComputeHostnameFormat so those variables were set to what the host actually got. I assume they could then correctly line up during overcloud deployment.

I'm not sure we should consider this user error on my part though. If the answer to either of the following questions is yes, then work could be done under this bug to close it.

Since 'openstack overcloud node provision' was passed ~/metalsmith-0.yaml as input, then should it have set genereated a hostname like the one in that file?

Should we put a check in tripleoclient so that if you define values in the hostnamemap that cannot be matched against a role hostname, then the deploymenmt fails with an error that these two didn't line up?

[1] http://paste.openstack.org/show/799974/
[2]
(undercloud) [CentOS-8.2 - stack@undercloud ceph-ansible]$ cat ~/metalsmith-0.yaml
---
- name: Controller
  count: 3
  instances:
    - hostname: oc0-controller-0
      name: oc0-controller-0
    - hostname: oc0-controller-1
      name: oc0-controller-1
    - hostname: oc0-controller-2
      name: oc0-controller-2
- name: Compute
  count: 2
  instances:
    - hostname: oc0-compute-0
      name: oc0-compute-0
    - hostname: oc0-compute-1
      name: oc0-compute-1
- name: CephStorage
  count: 3
  instances:
    - hostname: oc0-ceph-0
      name: oc0-ceph-0
    - hostname: oc0-ceph-1
      name: oc0-ceph-1
    - hostname: oc0-ceph-2
      name: oc0-ceph-2
(undercloud) [CentOS-8.2 - stack@undercloud ceph-ansible]$

tags: added: metalsmith
summary: - tripleo-admin user is created on controllers but not computes
+ RoleNetIpMap missing hosts from DeployedServerPortMap
summary: - RoleNetIpMap missing hosts from DeployedServerPortMap
+ RoleNetIpMap missing ctlplane entries for computes from
+ DeployedServerPortMap
Revision history for this message
John Fulton (jfulton-org) wrote :

A user shouldn't have to modify the file generated by 'openstack overcloud node provision'

 sed -i -e s/novacompute/compute/g -e s/cephstorage/ceph/g deployed-metal.yaml

Changed in tripleo:
assignee: nobody → Harald Jensås (harald-jensas)
Revision history for this message
Harald Jensås (harald-jensas) wrote :

- name: Controller
  count: 1
  defaults:
    profile: control
  instances:
    - hostname: oc0-controller0
      name: baremetal-88166-leaf1-0
- name: Compute
  count: 1
  defaults:
    profile: compute-leaf2
  instances:
    - hostname: oc0-compute0
      name: baremetal-88166-leaf2-0

Resulting overcloud-baremetal-environment.yaml:

parameter_defaults:
  ComputeCount: 1
  ComputeHostnameFormat: '%stackname%-novacompute-%index%'
  ControllerCount: 1
  ControllerHostnameFormat: '%stackname%-controller-%index%'
  DeployedServerPortMap:
    oc0-compute0-ctlplane:
      fixed_ips:
      - ip_address: 192.168.26.25
    oc0-controller0-ctlplane:
      fixed_ips:
      - ip_address: 192.168.25.12
  HostnameMap:
    overcloud-controller-0: oc0-controller0
    overcloud-novacompute-0: oc0-compute0
resource_registry:
  OS::TripleO::DeployedServer::ControlPlanePort: /usr/share/openstack-tripleo-heat-templates/deployed-server/deployed-neutron-port.yaml

This looks fine, and also deploys just fine in my environment, no need to 'sed -i -e s/novacompute/compute/g'

If the '%ROLE%HostnameFormat' is modified in an envornment provided after the overcloud-baremetal-environment.yaml, i.e another override setting """ ComputeHostnameFormat: '%stackname%-compute-%index%' """ then the HostnameMap entries would be incorrect.

[1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/roles_data.yaml

Revision history for this message
John Fulton (jfulton-org) wrote :

I tried to reproduce yesterday using a fresh checkout of tripleo main branch and was not able to [1]

The only thing I can think to do at this point is to add a validation.

Should we put a check in tripleoclient so that if you define values in the hostnamemap that cannot be matched against a role hostname, then the deploymenmt fails with an error that these two didn't line up?

[1]
input:

$ cat metal-big.yaml | curl -F 'sprunge=<-' http://sprunge.us
http://sprunge.us/BrCtum

output:

$ cat deployed-metal-big.yaml | curl -F 'sprunge=<-' http://sprunge.us
http://sprunge.us/XpkS7D

deployed overcloud fine

$ openstack stack show oc0 -f value > stack_show_oc0

$ egrep "ctlplane|CephStorage" -A 3 stack_show_oc0 | less
...
--
      "CephStorage": {
        "ctlplane": [
          "192.168.24.11",
          "192.168.24.22",
          "192.168.24.9"
...

Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
importance: High → Wishlist
Revision history for this message
John Fulton (jfulton-org) wrote :

I'm unable to reproduce the bug at this point.

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.