[TLS-E] missing hosts in freeIPA for internalapi network

Bug #1886915 reported by Cédric Jeanneret
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Unassigned

Bug Description

Hello,

While trying to deploy a master overcloud with TLS-E, using the new tripleo-ipa ansible thingy, I hit the following error:

<13>Jul 9 06:40:38 puppet-user: (file & line not available)
<13>Jul 9 06:40:38 puppet-user: Warning: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5
<13>Jul 9 06:40:38 puppet-user: (file: /etc/puppet/hiera.yaml)
<13>Jul 9 06:40:38 puppet-user: Warning: Undefined variable '::deploy_config_name';
<13>Jul 9 06:40:38 puppet-user: (file & line not available)
<13>Jul 9 06:40:38 puppet-user: Warning: Undefined variable '::nova::params::vncproxy_service_name'; class nova::params has not been evaluated
<13>Jul 9 06:40:38 puppet-user: (file & line not available)
<13>Jul 9 06:40:38 puppet-user: Warning: Unknown variable: '::deployment_type'. (file: /etc/puppet/modules/tripleo/manifests/profile/base/database/mysql/client.pp, line: 89, column: 8)
<13>Jul 9 06:40:38 puppet-user: error: Could not connect to cluster (is it running?)
<13>Jul 9 06:40:39 puppet-user: Notice: Compiled catalog for oc-0-ctl-0.mydomain.tld in environment production in 1.54 seconds
<13>Jul 9 06:40:39 puppet-user: Notice: /Stage[main]/Main/Package_manifest[/var/lib/tripleo/installed-packages/overcloud_Controller1]/ensure: created
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Apache_dirs/File[/etc/pki/tls/certs/httpd]/ensure: created
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Apache_dirs/File[/etc/pki/tls/private/httpd]/ensure: created
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Libvirt_vnc_dirs/File[/etc/pki/libvirt-vnc]/ensure: created
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Haproxy_dirs/File[/etc/pki/tls/certs/haproxy]/ensure: created
<13>Jul 9 06:40:40 puppet-user:
Notice: /Stage[main]/Tripleo::Certmonger::Haproxy_dirs/File[/etc/pki/tls/private/haproxy]/ensure: created
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Rabbitmq/File[/usr/bin/certmonger-rabbitmq-refresh.sh]/ensure: defined content as '{md5}9228c38b6f9fdaf73919c2802cb062af'
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Novnc_proxy/File[/usr/bin/certmonger-novnc-proxy-refresh.sh]/ensure: defined content as '{md5}0abda7696e15def437a4169f35377be8'
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Profile::Base::Database::Mysql::Client/File[/etc/my.cnf.d/tripleo.cnf]/ensure: created
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Tripleo::Profile::Base::Database::Mysql::Client/Augeas[tripleo-mysql-client-conf]/returns: executed successfully
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/File_line[pcsd_bind_addr]/ensure: created
<13>Jul 906:40:40 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]/password: changed [redacted] to [redacted]
<13>Jul 9 06:40:40 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]/groups: groups changed to ['haclient']
<13>Jul 9 06:40:43 puppet-user: Notice: /Stage[main]/Pacemaker::Service/Service[pcsd]/ensure: ensure changed 'stopped' to 'running'
<13>Jul 9 06:40:44 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]/returns: executed successfully
<13>Jul 9 06:40:45 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Triggered 'refresh' from 3 events
<13>Jul 9 06:40:48 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: executed successfully
<13>Jul 9 06:40:50 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: executed successfully
<13>Jul 9 06:40:50 puppet-user: Notice: /Stage[main]/Pacemaker::Service/Service[corosync]/enable: enable changed 'false' to 'true'
<13>Jul 9 06:40:51 puppet-user: Notice: /Stage[main]/Pacemaker::Service/Service[pacemaker]/enable: enable changed 'false' to 'true'
<13>Jul 9 06:41:13 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: executed successfully
<13>Jul 9 06:41:13 puppet-user: Notice: /Stage[main]/Tripleo::Trusted_cas/Tripleo::Trusted_ca[undercloud-ca]/File[/etc/pki/ca-trust/source/anchors/undercloud-ca.pem]/ensure: defined content as '{md5}f949572e3b1a6e342d112737b08382cb'
<13>Jul 9 06:41:13 puppet-user: Notice: /Stage[main]/Tripleo::Trusted_cas/Tripleo::Trusted_ca[undercloud-ca]/Exec[trust-ca-undercloud-ca]: Triggered 'refresh' from 1 event
<13>Jul 9 06:41:13 puppet-user: Notice: /Stage[main]/Tripleo::Profile::Base::Certmonger_user/Tripleo::Certmonger::Haproxy[haproxy-ctlplane]/File[/usr/bin/certmonger-haproxy-refresh.sh]/ensure: defined content as '{md5}b1dae1387c0941dd9d18d05f011ef371'
<13>Jul 9 06:41:14 puppet-user: Notice: /Stage[main]/Certmonger/Service[certmonger]/enable: enable changed 'false' to 'true'
<13>Jul 9 06:41:14 puppet-user: Notice: /Stage[main]/Certmonger/Service[certmonger]: Triggered 'refresh' from 3 events
<13>Jul 9 06:41:14 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Mysql/Certmonger_certificate[mysql]/ensure: created
<13>Jul 9 06:41:15 puppet-user: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I mysql -f /etc/pki/tls/certs/mysql.crt -c IPA -N CN=oc-0-ctl-0.internalapi.mydomain.tld -K mysql/oc-0-ctl-0.internalapi.m
ydomain.tld -D overcloud.internalapi.mydomain.tld -D oc-0-ctl-0.internalapi.mydomain.tld -w -k /etc/pki/tls/private/mysql.key' returned 3: New signing request \"mysql\" added.
<13>Jul 9 06:41:15 puppet-user: Error: /Stage[main]/Tripleo::Certmonger::Mysql/Certmonger_certificate[mysql]: Could not evaluate: Could not get certificate: Server at https://lab-nat-vm.mydomain.tld/ipa/xml failed request, will retry: 4001 (RPC failed at server. The host 'oc-0-ctl-0.internalapi.mydomain.tld' does not exist to add a service to.).
<13>Jul 9 06:41:15 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Rabbitmq/Certmonger_certificate[rabbitmq]/ensure: created
<13>Jul 9 06:41:15 puppet-user: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I rabbitmq -f /etc/pki/tls/certs/rabbitmq.crt -c IPA -N CN=oc-0-ctl-0.internalapi.mydomain.tld -K rabbitmq/oc-0-ctl-0.internalapi.mydomain.tld -D oc-0-ctl-0.internalapi.mydomain.tld -C /usr/bin/certmonger-rabbitmq-refresh.sh -w -k /etc/pki/tls/private/rabbitmq.key' returned 3: New signing request \"rabbitmq\" added.
<13>Jul 9 06:41:15 puppet-user: Error: /Stage[main]/Tripleo::Certmonger::Rabbitmq/Certmonger_certificate
[rabbitmq]: Could not evaluate: Could not get certificate: Server at https://lab-nat-vm.mydomain.tld/ipa/xml failed request, will retry: 4001 (RPC failed at server. The host 'oc-0-ctl-0.internalapi.mydomain.tld' does not exist to add a se
rvice to.).
<13>Jul 9 06:41:15 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Rabbitmq/File[/etc/pki/tls/certs/rabbitmq.crt]: Dependency Certmonger_certificate[rabbitmq] has failures: true
<13>Jul 9 06:41:15 puppet-user: Warning: /Stage[main]/Tripleo::Certmonger::Rabbitmq/File[/etc/pki/tls/certs/rabbitmq.crt]: Skipping because of failed dependencies
<13>Jul 9 06:41:15 puppet-user: Warning: /Stage[main]/Tripleo::Certmonger::Rabbitmq/File[/etc/pki/tls/private/ra
bbitmq.key]: Skipping because of failed dependencies
<13>Jul 9 06:41:15 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Novnc_proxy/Certmonger_certificate[novnc-proxy]/ensure: created
<13>Jul 9 06:41:15 puppet-user: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I novnc-proxy -f /etc/pki/tls/certs/novnc_proxy.crt -c IPA -N CN=oc-0-ctl-0.internalapi.mydomain.tld -K novnc-proxy/oc-0-ctl-0.internalapi.mydomain.tld -D oc-0-ctl-0.internalapi.mydomain.tld -w -k /etc/pki/tls/private/novnc_proxy.key' returned 3: New signing request \"novnc-proxy\" added.
<13>Jul 9 06:41:15 puppet-user: Error: /Stage[main]/Tripleo::Certmonger::Novnc_proxy/Certmonger_certificate[novnc-proxy]: Could not evaluate: Could not get certificate: Server at https://lab-nat-vm.mydomain.tld/ipa/xml failed request, will retry: 4001 (RPC failed at server. The host 'oc-0-ctl-0.internalapi.mydomain.tld' does not exist to add a service to.).
<13>Jul 9 06:41:15 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Novnc_proxy/File[/etc/pki/tls/certs/novnc_proxy.crt]: Dependency Certmonger_certificate[novnc-proxy] has failures: true
<13>Jul 9 06:41:15 puppet-user: Warning: /Stage[main]/Tripleo::Certmonger::Novnc_proxy/File[/etc/pki/tls/certs/novnc_proxy.crt]: Skipping because of failed dependencies
<13>Jul 9 06:41:15 puppet-user: Warning: /Stage[main]/Tripleo::Certmonger::Novnc_proxy/File[/etc/pki/tls/private/novnc_proxy.key]: Skipping because of failed dependencies
<13>Jul 9 06:41:15 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Ovn_dbs/Certmonger_certificate[ovn_dbs]/ensure: created
<13>Jul 9 06:41:16 puppet-user: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I ovn_dbs -f /etc/pki/tls/certs/ovn_dbs.crt -c IPA -N CN=oc-0-ctl-0.internalapi.mydomain.tld -K ovn_dbs/oc-0-ctl-0.internalapi.mydomain.tld -D oc-0-ctl-0.internalapi.mydomain.tld -w -k /etc/pki/tls/private/ovn_dbs.key' returned 3: New signing request \"ovn_dbs\" added.
<13>Jul 9 06:41:16 puppet-user: Error: /Stage[main]/Tripleo::Certmonger::Ovn_dbs/Certmonger_certificate[ovn_dbs]: Could not evaluate: Could not get certificate: Server at https://lab-nat-vm.mydomain.tld/ipa/xml failed request, will retry: 4001 (RPC failed at server. The host 'oc-0-ctl-0.internalapi.mydomain.tld' does not exist to add a service to.).
<13>Jul 9 06:41:16 puppet-user: Notice: /Stage[main]/Tripleo::Certmonger::Ovn_dbs/File[/etc/pki/tls/certs/ovn_dbs.crt]: Dependency Certmonger_certificate[ovn_dbs] has failures: true

When checking the IPA content, I indeed don't see the host used to get the certificate:

[CentOS-8.2 - stack@undercloud ~]$ sudo ipa host-find
---------------
5 hosts matched
---------------
  Host name: lab-nat-vm.mydomain.tld
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  SSH public key fingerprint: SHA256:MQLCxozGf0OkGhEvdyq0lI3yDrEgfcvYchcK6i81KBY (ssh-ed25519), SHA256:fOlibTW4MAGLn33NXg4Aer8r4BXlHkChLTPGYvWs9YY (ecdsa-sha2-nistp256), SHA256:+haFyWKGWYGnKfPhrk+RkSIP2Yne1m461ZJalVWtpNA (ssh-rsa)

  Host name: oc-0-ctl-0.ctlplane.mydomain.tld
  Principal name: <email address hidden>
  Principal alias: <email address hidden>

  Host name: oc-0-ctl-0.mydomain.tld
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  SSH public key fingerprint: SHA256:YpZL1iuwZ1DRXouxot8lGTeKodkQDnQuQw2b9NO3khU (ecdsa-sha2-nistp256), SHA256:JuGjnt5phB3SjYLHL0LvdFDuy7NlFa7uq2PzEDMy7TE (ssh-ed25519), SHA256:PXwgn/suCMzema1kk2z2U04vQex9Iv8KoQYANk8dhnQ (ssh-rsa)

  Host name: overcloud.ctlplane.mydomain.tld
  Principal name: <email address hidden>
  Principal alias: <email address hidden>

  Host name: undercloud.mydomain.tld
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  SSH public key fingerprint: SHA256:UdMwT6gdiC2ZwN3cKlw+O9YcnhcaGCixk5cy1SO5phg (ssh-ed25519), SHA256:7VxkzcOa7GC98DCmirsP3v7POtvAWOkI+z6qehyerkY (ecdsa-sha2-nistp256), SHA256:4gh6K2xfB/2V0ZZq2TxLIuWeC1IJdtX04HSqexdqIvw (ssh-rsa)
----------------------------
Number of entries returned 5
----------------------------

My guess is, we're missing at some point the creation of the hosts for all the subnets, leading to this error.

Cheers,

C.

Revision history for this message
Grzegorz Grasza (xek) wrote :

Please also attach the contents of service_metadata_settings ansible group var and the output of ipa service-find

That should provide us with more info.

/ Greg

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (18.4 KiB)

Used command:
#!/bin/bash
# This file is managed by ansible
set -ex

export DEPLOY_TEMPLATES=/usr/share/openstack-tripleo-heat-templates/
export DEPLOY_STACK=overcloud-0
export DEPLOY_TIMEOUT_ARG=90
export DEPLOY_NETWORKS_FILE=/home/stack/oc0-network-data.yaml
source /home/stack/stackrc;
openstack overcloud deploy --templates $DEPLOY_TEMPLATES\
 --stack $DEPLOY_STACK --timeout $DEPLOY_TIMEOUT_ARG \
 -e /usr/share/openstack-tripleo-heat-templates/environments/net-multiple-nics.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/enable-swap.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \
 -e /home/stack/ipa.yaml \
 -e /home/stack/containers-prepare-parameter.yaml \
 -e /home/stack/generated-container-prepare.yaml \
 -e /home/stack/domain.yaml \
 --environment-directory /home/stack/overcloud-0-yml -n $DEPLOY_NETWORKS_FILE

The "ipa.yaml" contains:
resource_registry:
  OS::TripleO::Services::IpaClient: /usr/share/openstack-tripleo-heat-templates/deployment/ipa/ipaservices-baremetal-ansible.yaml

parameter_defaults:
  IdMServer: lab-nat-vm.mydomain.tld
  IdMDomain: mydomain.tld
  DnsServers:
    - "192.168.122.35"

Output of ipa service-find:
------------------ 9 services matched ------------------ Principal name: <email address hidden> Principal alias: <email address hidden>
  Keytab: True ...

Revision history for this message
Grzegorz Grasza (xek) wrote :

The service_metadata_settings contain the following, so the services are not created:

service_metadata_settings:
  compact_service_HTTP:
  - ctlplane
  compact_service_haproxy:
  - ctlplane
  managed_service_haproxyctlplane: haproxy/overcloud.ctlplane.mydomain.tld

Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
milestone: xena-1 → xena-2
Changed in tripleo:
milestone: xena-2 → xena-3
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (3.3 KiB)

Hey Xek,

Sorry for the long silence.

I'm back on that topic, and am wondering about the potential missing things in my env. Though I would think it's more something related to the generation of that service_metadata_settings dict.

my env implies custom roles, using:
openstack overcloud roles generate -o /tmp/roledata.yaml Compute Controller CephStorage

It then updates the network data for each role, in order to match custom network names. For instance, "InternalApi" is named "internal_api_cloud_0", and the mapping is done like this:

- name: InternalApi
  vip: true
  name_lower: internal_api_cloud_0
  service_net_map_replace: internal_api
  subnets:
    internal_api_cloud_0_subnet:
      ip_subnet: '172.16.13.0/24'
      allocation_pools: [{'start': '172.16.13.4', 'end': '172.16.13.250'}]
      vlan: 13

So goes for ALL the other networks - the only network that isn't affected is the ctlplane.
At the role level, my custom role-data has the following, for instance:

- name: Compute
  description: |
    Basic Compute Node role
  CountDefault: 1
  # Create external Neutron bridge (unset if using ML2/OVS without DVR)
  tags:
    - compute
    - external_bridge
  networks:
    InternalApi:
      subnet: internal_api_cloud_0_subnet
    Tenant:
      subnet: tenant_cloud_0_subnet
    Storage:
      subnet: storage_cloud_0_subnet

So I think there's something fishy with the following code:

  IncomingMetadataSettings:
    type: OS::Heat::Value
    properties:
      value:
        yaql:
          # Filter null values and values that contain don't contain
          # 'metadata_settings', get the values from that key and get the
          # unique ones. Also, filter values for networks not associated with
          # this role.
          expression: let(role_networks => $.data.role_networks) -> list(coalesce($.data.role_data, []).where($ != null).where($.containsKey('metadata_settings')).metadata_settings.flatten().distinct().where($ != null and $.containsKey('network')).where($role_networks.contains($.network)))
          data:
            role_data: {get_param: RoleData}
            role_networks:
              - ctlplane
{%- for network in networks if network.name in role.networks %}
  {%- if network.service_net_map_replace is defined %}
              - {{network.service_net_map_replace}}
  {%- else %}
              - {{network.name_lower}}
  {%- endif %} ...

Read more...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Revision history for this message
Harald Jensås (harald-jensas) wrote :

In comment #2 (https://bugs.launchpad.net/tripleo/+bug/1886915/comments/2) I don't see a customer roles file used? No '-r' or '--roles-file' parameter?

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Ah, right. that was the old thing.

New command is nearly the same though:
#!/bin/bash
# This file is managed by ansible
set -xeo pipefail

export DEPLOY_TEMPLATES=/usr/share/openstack-tripleo-heat-templates/
export DEPLOY_STACK=overcloud-0
export DEPLOY_TIMEOUT_ARG=90
export DEPLOY_ROLES_FILE=/home/stack/oc0-role-data.yaml
export DEPLOY_NETWORKS_FILE=/home/stack/oc0-network-data.yaml
source /home/stack/stackrc; openstack overcloud deploy --templates $DEPLOY_TEMPLATES --stack $DEPLOY_STACK --timeout $DEPLOY_TIMEOUT_ARG \
 -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/enable-swap.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \
 -e /home/stack/ipa.yaml \
 -e /home/stack/containers-prepare-parameter.yaml \
 -e /home/stack/generated-container-prepare.yaml \
 -e /home/stack/oc0-domain.yaml \
 -e /home/stack/overcloud-baremetal-deployed-0.yaml \
 -e /home/stack/overcloud-networks-provisioned-0.yaml \
 -e /home/stack/overcloud-vips-provisioned-0.yaml \
 --environment-directory /home/stack/overcloud-0-yml \
 -r $DEPLOY_ROLES_FILE -n $DEPLOY_NETWORKS_FILE \
 --disable-validations --skip-nodes-and-networks --deployed-server >/home/stack/overcloud_deploy_overcloud-0.log 2>&1

But in the end, I think it's, once again, the mix with "service_net_map_replace" in the network env... I'm running a new test right now, back in a few.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (13.1 KiB)

Some more data:
- removing the "service_net_map_replace" allows to create a lot more things in freeIPA
- but it still fails on some certificate requests, apparently the domain used isn't the right one

The failing certificate requests are about the following names/services:
ca-error: Server at https://lab-nat-vm.mydomain.tld/ipa/json failed request, will retry: 4001 (The host 'oc0-controller-0.storage.mydomain.tld' does not exist to add a service to.).
ca-error: Server at https://lab-nat-vm.mydomain.tld/ipa/json failed request, will retry: 4001 (The host 'oc0-controller-0.storagemgmt.mydomain.tld' does not exist to add a service to.).
ca-error: Server at https://lab-nat-vm.mydomain.tld/ipa/json failed request, will retry: 4001 (The host 'oc0-controller-0.internalapi.mydomain.tld' does not exist to add a service to.).
ca-error: Server at https://lab-nat-vm.mydomain.tld/ipa/json failed request, will retry: 4001 (The host 'oc0-controller-0.external.mydomain.tld' does not exist to add a service to.).

Here, we can see the "external", "internalapi" and so on. Those are the default names of the networks. In my env, they are different:
- external_cloud_0
- internal_api_cloud_0
- (basically, all but ctlplane are suffixed with _cloud_0)
- ctlplane

Those custom names are then changed, the "_" is removed, and we therefore end with:
- externalcloud0
- internalapicloud0
- ....
- ctlplane

We can see, in the "FreeIPA registered hosts", that we have the correct names, with the correct *cloud0.mydomain.tld.
Same goes for the known services.

So there's something, somewhere, using the default names, preventing the certificate generation to happen with the correct names.

I'll dig in there and try to find where this issue happens.

FreeIPA registered hosts:
[CentOS-8 - stack@undercloud ~]$ ipa host-find | grep Principal
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: <email address hidden>
  Principal alias: <email address hidden>
  Principal name: host/oc0-controller-0.storagemgmtcloud...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (3.5 KiB)

Digging further.

After some searches and grepping, it seems the certificate requests are using a pre-generated "fqdn_*" variable, such as:

dns: '{{ fqdn_storage_cloud_0 }}'

In this case, that variable value is:
fqdn_storage_cloud_0: oc0-compute-0.storage.mydomain.tld

We can see the network "storage". It's the default name, not the name_lower it claims to be in the actual generation:
puppet/role.role.j2.yaml: fqdn_{{network.name_lower}}: {get_attr: [NetHostMap, value, {{network.name_lower}}, fqdn]}

And this is the case for all the generated things:
config-download/overcloud-0/host_vars/oc0-compute-0:fqdn_canonical: oc0-compute-0.mydomain.tld
config-download/overcloud-0/host_vars/oc0-compute-0:fqdn_ctlplane: oc0-compute-0.ctlplane.mydomain.tld
config-download/overcloud-0/host_vars/oc0-compute-0:fqdn_internal_api_cloud_0: oc0-compute-0.internalapi.mydomain.tld
config-download/overcloud-0/host_vars/oc0-compute-0:fqdn_storage_cloud_0: oc0-compute-0.storage.mydomain.tld
config-download/overcloud-0/host_vars/oc0-compute-0:fqdn_tenant_cloud_0: oc0-compute-0.tenant.mydomain.tld
config-download/overcloud-0/host_vars/oc0-controller-0:fqdn_canonical: oc0-controller-0.mydomain.tld
config-download/overcloud-0/host_vars/oc0-controller-0:fqdn_ctlplane: oc0-controller-0.ctlplane.mydomain.tld
config-download/overcloud-0/host_vars/oc0-controller-0:fqdn_external_cloud_0: oc0-controller-0.external.mydomain.tld
config-download/overcloud-0/host_vars/oc0-controller-0:fqdn_internal_api_cloud_0: oc0-controller-0.internalapi.mydomain.tld
config-download/overcloud-0/host_vars/oc0-controller-0:fqdn_storage_cloud_0: oc0-controller-0.storage.mydomain.tld
config-download/overcloud-0/host_vars/oc0-controller-0:fqdn_storage_mgmt_cloud_0: oc0-controller-0.storagemgmt.mydomain.tld
confi...

Read more...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Found the issue:
https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/puppet/role.role.j2.yaml#L373-L388

here, it uses network.name.lower() instead of, say, network.name_lower|replace('_', '')

dang.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Hmm, so the question is. Should we use network.name.lower() in '/extraconfig/nova_metadata/krb-service-principals/role.role.j2.yaml' as well?

It seems we do the replace('_', '') here[3] instead ...

I think there has been an assumption that "name -> name_lower" is a computed "camelcase -> underscore" translation. (I.e "NetworkName -> network_name" is valid, "NetworkName -> network_name_cloud%index%" is not ...)

I think we may want to use network.name.lower() in all the places instead? [1] and [2] etc ?

I have feeling changing NetHostMap to use network.name_lower|replace('_', '') might raise other issues?

[1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/apache/apache-baremetal-puppet.j2.yaml#L77
[2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/extraconfig/nova_metadata/krb-service-principals/role.role.j2.yaml#L64-L68
[3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/extraconfig/nova_metadata/krb-service-principals/role.role.j2.yaml#L117

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Hey Harald,

Well, if we discard that name_lower, what would be its use? Maybe my env is doing it wrong (again), but here's the goal:

since my tool supports multi-overcloud, it has to create per-overcloud dedicated networks. In order to do this, it defines the different networks in a custom "network-data", as follow:

- name: Storage
  vip: true
  name_lower: storage_cloud_{{ item }}
  subnets:
    storage_cloud_{{ item }}_subnet:
      ip_subnet: '172.16.{{ base_ip }}1.0/24'
      allocation_pools: [{'start': '172.16.{{ base_ip }}1.4', 'end': '172.16.{{ base_ip }}1.250'}]
      vlan: {{ base_ip }}1

Then, I generate a custom role-data and update the network associated to each role, basically replacing mentioned subnets by the value actually set in the network-data.

So if we're using the "name.lower()", it means every OC will get the same name, right? Or is that "name_lower" just useless with network-v2?

If so, we might want to do some deprecation tests, and flag potential issues beforehand.

At least, using "service_net_map_replace" may create issues (I stumbled on that first), at least in the krb-service-principals/role.role.j2.yaml role you mention.

I can bootstrap some env today if you want to have a look with me.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (6.3 KiB)

Sooo. A bit more digging. There aren't THAT many occurrences of name.lower() - even better, there are some that are actually nice:

{{ network.name_lower|default(network.name.lower()) }}

See the listing bellow.

It would be pretty nice doing that same {{ network.name_lower|default(network.name.lower()), while ensure we're replacing the "_" - note that it's not always the case either, it probably depends on the actual usage.

That |replace('_', '') is needed for actual fqdn or host names, for instance:

overcloud.j2.yaml: default: overcloud.{{network.name.lower()}}.localdomain
overcloud.j2.yaml: 'ci-overcloud.{{network.name.lower()}}.tripleo.org'.
overcloud.j2.yaml: default: overcloud.{{network.name.lower()}}.localdomain
overcloud.j2.yaml: 'ci-overcloud.{{network.name.lower()}}.tripleo.org'.
overcloud.j2.yaml: default: overcloud.{{network.name.lower()}}.localdomain
overcloud.j2.yaml: 'ci-overcloud.{{network.name.lower()}}.tripleo.org'.

And, seeing this, yes, we'll have to use "name_lower|replace()|default()", especially for the multi-overcloud part, UNLESS we're able to change the actual "name" - this is part of my big question regarding my own env...

Full listing:
environments/deployed-ports.j2.yaml: OS::TripleO::Network::Ports::{{network.name}}VipPort: ../network/ports/deployed_vip_{{network.name_lower|default(network.name.lower())}}.yaml
environments/deployed-ports.j2.yaml: OS::TripleO::{{role.name}}::Ports::{{network.name}}Port: ../network/ports/deployed_{{network.name_lower|default(network.name.lower())}}.yaml
environments/network-isolation-no-tunneling.j2.yaml: OS::TripleO::Network::{{network.name}}: ../network/{{network.name_lower|default(network.name.lower())}}.yaml
environments/network-isolation-no-tunneling.j2.yaml: OS::TripleO::Network::Ports::{{network.name}}VipPort: ../network/ports/{{network.name_lower|default(network.name.lower())}}.yaml
environments/network-isolation-no-tunneling.j2.yaml: OS::TripleO::{{role.name}}::Ports::{{network.name}}Port: ../network/ports/{{network.name_lower|default(network.name.lower())}}.yaml
environments/network-isolation-v6-all.j2.yaml: OS::TripleO::Network::{{network.name}}: ../network/{{network.name_lower|default(network.name.lower())}}_v6.yaml
environments/network-isolation-v6-all.j2.yaml: OS::TripleO::Network::Ports::{{network.name}}VipPort: ../network/ports/external_resource_{{network.name_lower|default(network.name.lower())}}_v6.yaml
environments/network-isolation-v6-all.j2.yaml: OS::TripleO::Network::Ports::{{network.name}}VipPort: ../network/ports/{{network.name_lower|default(network.name.lower())}}_v6.yaml
environments/network-isolation-v6-all.j2.yaml: OS::TripleO::{{role.name}}::Ports::{{network.name}}Port: ../network/ports/{{network.name_lower|default(network.name.lower())}}_v6.yaml
environments/network-isolation-v6.j2.yaml: OS::TripleO::Network::{{network.name}}: ../network/{{network.name_lower|default(network.name.lower())}}_v6.yaml
environments/network-isolation-v6.j2.yaml: OS::TripleO::Network::{{network.name}}: ../network/{{network.name_lower|default(network.name.lower())}}.yaml
environments/network-isolation-v6.j2.yaml: OS::TripleO::Network::Po...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Harald Jensås (harald-jensas) wrote :

FQDN's was always based on "network.name" passed trough lower() filter. If we change that to use network.name_lower we may end up changing FQDNs on upgrade? That seems scary to me.

It seems to me your multi-overcloud env created neutron networks with per-cloud names. But the hostnames/fqdns are still the same accross all the different overclouds? Based on [1] that would be the case?

afict the "network.name_lower" used in [2] is just there to do some filtering.
I think if we changed [2] and [3] to use "network.name" instead of "network.name_lower" it would fix the issue? (I proposed a fix doing that) ... OR, did non TLS-e deployments always use "network.name" with a "toLower()" filter, while TLSe deployments use "network.name_lower" with a replace('_', '') filter?

Did/Does this really work pre-Network-V2?

[1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/puppet/role.role.j2.yaml#L373-L388
[2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/extraconfig/nova_metadata/krb-service-principals/role.role.j2.yaml#L64-L68
[3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/apache/apache-baremetal-puppet.j2.yaml#L77

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

I can't say how it worked before network-v2 - for me, it didn't work back then (probably due to the service_net_map_replace in my network-data? can't say...).

I suspect it worked because only default names are used in CI. Once we get something more specific/custom, it's not new we're hitting issues (I already found such things months/years ago when investigating new features :)).

Right now, I'd expect the fqdn to be constructed with the following:
HOSTNAME.{{network.name_lower|default(network.name.lower())|replace('_', '')}}.{{CloudDomain}}

In my case, this would look like this:
oc0-controller-0.internalapicloud0.mydomain.tld
oc0-compute-1.internalapicloud0.mydomain.tld
oc1-controller-0.internalapicloud1.mydomain.tld
...

Maybe I can just use "InternalApiCloud0" as name in the network-data.yaml? Though I'm pretty sure it will hit some other issues...

Revision history for this message
Harald Jensås (harald-jensas) wrote :

> Maybe I can just use "InternalApiCloud0" as name in the network-data.yaml?
> Though I'm pretty sure it will hit some other issues...

I think that can work, just ensure name_lower is internal_api so that the ServiceNetMap does not need an update as well.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote (last edit ):
Download full text (5.3 KiB)

So, with custom network.name, custom network.name_lower, custom network subnet name, and a custom role-data that's supposed to link everything together, I have almost a success.

The error is now this (check at the end for the full trace):
'fqdn_storagecloud0' is undefined

After some digging, it's true: it doesn't exist with this form, but with another one:
fqdn_storage_cloud_0
This format is the same for all the others.

Attached here is my "oc0-network-data.yaml" - showing what I mean by "custom names". I'll attach the custom role-data content right after it.

error={"msg": "{% set unique_providers = [] %}
{% for item in certificate_requests %}
{% set _ = unique_providers.append(item.provider |d(__certificate_provider_default)) %}
{% endfor %}
{{ unique_providers | unique }} : [
  {'ca': 'ipa',
  'dns': '{{ fqdn_ctlplane }}',
  'key_size': '2048',
  'name': 'httpd-ctlplane',
  'principal': 'HTTP/{{ fqdn_ctlplane }}@{{ idm_realm }}',
  'run_after': 'cp /etc/pki/tls/certs/httpd-ctlplane.crt /etc/pki/tls/certs/httpd/httpd-ctlplane.crt \
                cp /etc/pki/tls/private/httpd-ctlplane.key /etc/pki/tls/private/httpd/httpd-ctlplane.key \
                pkill -USR1 httpd\
               '
  },
  {'ca': 'ipa',
  'dns': '{{ fqdn_storagecloud0 }}',
  'key_size': '2048',
  'name': 'httpd-storagecloud0',
  'principal': 'HTTP/{{ fqdn_storagecloud0 }}@{{ idm_realm }}',
  'run_after': 'cp /etc/pki/tls/certs/httpd-storagecloud0.crt /etc/pki/tls/certs/httpd/httpd-storagecloud0.crt \
                cp /etc/pki/tls/private/httpd-storagecloud0.key /etc/pki/tls/private/httpd/httpd-storagecloud0.key \
                pkill -USR1 httpd \
                '
  },
  {'ca': 'ipa',
  'dns': '{{ fqdn_storagemgmtcloud0 }}',
  'key_size': '2048',
  'name': 'httpd-storagemgmtcloud0',
  'principal': 'HTTP/{{ fqdn_storagemgmtcloud0 }}@{{ idm_realm }}',
  'run_after': 'cp /etc/pki/tls/certs/httpd-storagemgmtcloud0.crt /etc/...

Read more...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Custom role-data, as mentioned in the previous comment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/812286
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/6bae260bcfa369365d9cfb3112c686f468fcef55
Submitter: "Zuul (22348)"
Branch: master

commit 6bae260bcfa369365d9cfb3112c686f468fcef55
Author: Harald Jensås <email address hidden>
Date: Mon Oct 4 09:52:09 2021 +0200

    Fix TLS-e with custom network names

    Also correct how internal-tls detects external/tenant

    This reverts commit f708ab7a827cc0db211b4709447f77126087347e.
    This partial reverts commit 578bcb2ffad32c6a39d68b5dc360504e95972ffa.

    Reason for revert: https://bugs.launchpad.net/tripleo/+bug/1886915/comments/8

    Closes-Bug: #1886915
    Change-Id: I8c692ae8419c8e537ec05ebc5d670202c57506ac

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815598
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/f6eddad78c5ae3aec266428ce768d65320785348
Submitter: "Zuul (22348)"
Branch: master

commit f6eddad78c5ae3aec266428ce768d65320785348
Author: Harald Jensås <email address hidden>
Date: Wed Oct 27 08:23:03 2021 +0200

    Don't use service_net_map_replace in krb-svc-principals

    Using the service_net_map replace in role_networks and
    as networks keys causes the actual custom networks to
    be filtered. I.e the service principals are not created.

    Not using service_net_map fixes the problem.

    This reverts commit f708ab7a827cc0db211b4709447f77126087347e.
    This partial reverts commit 578bcb2ffad32c6a39d68b5dc360504e95972ffa.

    Releted-Bug: #1946239
    Closes-Bug: #1886915
    Change-Id: I76a87473b2f21576570a55d0a5ef19f642521336

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815649

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815649
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/a45cc6fb532c1d6c49b3bf53b17e2d2fa45bd7a2
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit a45cc6fb532c1d6c49b3bf53b17e2d2fa45bd7a2
Author: Harald Jensås <email address hidden>
Date: Wed Oct 27 08:23:03 2021 +0200

    Don't use service_net_map_replace in krb-svc-principals

    Using the service_net_map replace in role_networks and
    as networks keys causes the actual custom networks to
    be filtered. I.e the service principals are not created.

    Not using service_net_map fixes the problem.

    This reverts commit f708ab7a827cc0db211b4709447f77126087347e.
    This partial reverts commit 578bcb2ffad32c6a39d68b5dc360504e95972ffa.

    Releted-Bug: #1946239
    Closes-Bug: #1886915
    Change-Id: I76a87473b2f21576570a55d0a5ef19f642521336
    (cherry picked from commit f6eddad78c5ae3aec266428ce768d65320785348)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 16.0.0

This issue was fixed in the openstack/tripleo-heat-templates 16.0.0 release.

Revision history for this message
Cristian Le (lecris) wrote :

@cjeanner @harald-jensas, please take a look at https://bugs.launchpad.net/tripleo/+bug/1988550.

The fixes introduced here are conflicting with how the service principals are requested for `linux-system-roles.certificate` using `fqdn_$NETWORK`. More info in the linked bug

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.