Partitioned RabbitMQ Cluster: Hostname controller-0.internalapi.xxx.local is illegal - Could not auto-cluster with...

Bug #1845806 reported by Cagri Ersen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
High
Unassigned

Bug Description

Description
===========
I've deployed an openstack platform on a 10 baremetal nodes farm (3 controller + 7 compute).

Though deployment is finished successfully, compute nodes couldn't register themself to the controllers. (openstack hypervisor list and openstack server list outputs are empty.)

When I checked the nova-compute.log on compute nodes there are lot of error which says:

```
2019-09-28 22:07:36.142 8 ERROR oslo_service.service [req-fcc471fe-4f1c-4a82-b6ca-7254fe004b2f - - - - -] Error starting thread.: MessagingTimeout: Timed out waiting for a reply to message ID 9de103a900054d57a4d7090ba415a386
```

Since, the errors are obviously related to RabbitMQ, I have checked RabbitMQ cluster on the controllers, and it seems the cluster couldn't have been formed-up during the deployment.

Here is the rabbitmq logs from the controllers:

```
============> controller-0 <============
=ERROR REPORT==== 28-Sep-2019::14:50:49 ===
** System NOT running to use fully qualified hostnames **
** Hostname controller-0.internalapi.test.local is illegal **

=WARNING REPORT==== 28-Sep-2019::14:50:49 ===
Could not auto-cluster with <email address hidden>: {badrpc,
                                                                       nodedown}
============> controller-1 <============
=ERROR REPORT==== 28-Sep-2019::14:50:50 ===
** System NOT running to use fully qualified hostnames **
** Hostname controller-0.internalapi.test.local is illegal **

=WARNING REPORT==== 28-Sep-2019::14:50:50 ===
Could not auto-cluster with <email address hidden>: {badrpc,
                                                                       nodedown}
============> controller-2 <============
=ERROR REPORT==== 28-Sep-2019::14:50:50 ===
** System NOT running to use fully qualified hostnames **
** Hostname controller-0.internalapi.test.local is illegal **

=WARNING REPORT==== 28-Sep-2019::14:50:50 ===
Could not auto-cluster with <email address hidden>: {badrpc,
                                                                       nodedown}
```

And this is cluster_status outputs:

```
============> controller-0 <============
[root@controller-0 ~]# podman exec -it rabbitmq /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-0'
[{nodes,[{disc,['rabbit@controller-0']}]},
 {running_nodes,['rabbit@controller-0']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller-0',[]}]}]

============> controller-1 <============
[root@controller-1 ~]# podman exec -it rabbitmq /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-1'
[{nodes,[{disc,['rabbit@controller-1']}]},
 {running_nodes,['rabbit@controller-1']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller-1',[]}]}]

============> controller-2 <============
[root@controller-2 ~]# podman exec -it rabbitmq /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-2'
[{nodes,[{disc,['rabbit@controller-2']}]},
 {running_nodes,['rabbit@controller-2']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller-2',[]}]}]
```

So it seems, auto cluster function for rabbitmq fails because of an FQDN issue.

Here is the rabbitmq.config file related to cluster_node configuration

```
[root@controller-0 ~]# cat /var/lib/config-data/rabbitmq/etc/rabbitmq/rabbitmq.config |egrep cluster_nodes
    {cluster_nodes, {['<email address hidden>', '<email address hidden>', '<email address hidden>'], disc}},

[root@controller-1 ~]# cat /var/lib/config-data/rabbitmq/etc/rabbitmq/rabbitmq.config |egrep cluster_nodes
    {cluster_nodes, {['<email address hidden>', '<email address hidden>', '<email address hidden>'], disc}},

[root@controller-2 ~]# cat /var/lib/config-data/rabbitmq/etc/rabbitmq/rabbitmq.config |egrep cluster_nodes
    {cluster_nodes, {['<email address hidden>', '<email address hidden>', '<email address hidden>'], disc}},
```

I think the /usr/share/openstack-puppet/modules/tripleo/manifests/profile/base/rabbitmq.pp manifest can't configure rabbitmq services properly.

Steps to reproduce
==================
1. Install undercloud
2. Deploy overcloud with:
openstack overcloud deploy \
  --timeout 120 \
  --templates \
  -r ~/templates/roles_data.yaml \
  -n ~/templates/network_data.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/network-isolation.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/network-environment.yaml \
  -e ~/templates/nic-mapping.yaml \
  -e ~/templates/network.yaml \
  -e ~/templates/node-info.yaml \
  -e ~/templates/scheduler_hints_env.yaml \
  -e ~/templates/ips-from-pool-all.yaml \
  -e ~/templates/fixed-ip-vips.yaml \
  -e ~/templates/ceph-custom-config.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/ceph-ansible/ceph-ansible.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/ceph-ansible/ceph-rgw.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/disable-telemetry.yaml \
  -e ~/templates/misc-settings.yaml \
  -e ~/templates/timezone.yaml
3. After deployment check openstack {hypervisor,server} list outputs.
4. Check rabbitmqctl cluster_info in controller's rabbitmq pods.

Expected result
===============
Rabbitmq cluster should be formed-up properly.

Actual result
=============
Three rabbitmq pod which run standalone mode.

Environment
===========
1. Stein

2. 10 baremetal node (3 controller + 7 compute(HCI) node)

3. My network environment file includes:
  CloudName: overcloud0001.test.local
  CloudDomain: test.local

Changed in tripleo:
importance: Undecided → High
milestone: none → ussuri-1
status: New → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe we do not configure RABBITMQ_NODENAME in puppets.
There had been (long time ago tho) also some naming conventions weirdness IIUC, see https://tickets.puppetlabs.com/browse/MODULES-1673

Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.