tripleo

Partitioned RabbitMQ Cluster: Hostname controller-0.internalapi.xxx.local is illegal - Could not auto-cluster with...

Bug #1845806 reported by Cagri Ersen on 2019-09-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Triaged	High	Unassigned	tripleo victoria-3 "tripleo victoria"

Bug Description

Description
===========
I've deployed an openstack platform on a 10 baremetal nodes farm (3 controller + 7 compute).

Though deployment is finished successfully, compute nodes couldn't register themself to the controllers. (openstack hypervisor list and openstack server list outputs are empty.)

When I checked the nova-compute.log on compute nodes there are lot of error which says:

```
2019-09-28 22:07:36.142 8 ERROR oslo_service.service [req-fcc471fe-4f1c-4a82-b6ca-7254fe004b2f - - - - -] Error starting thread.: MessagingTimeout: Timed out waiting for a reply to message ID 9de103a900054d57a4d7090ba415a386
```

Since, the errors are obviously related to RabbitMQ, I have checked RabbitMQ cluster on the controllers, and it seems the cluster couldn't have been formed-up during the deployment.

Here is the rabbitmq logs from the controllers:

```
============> controller-0 <============
=ERROR REPORT==== 28-Sep-2019::14:50:49 ===
** System NOT running to use fully qualified hostnames **
** Hostname controller-0.internalapi.test.local is illegal **

=WARNING REPORT==== 28-Sep-2019::14:50:49 ===
Could not auto-cluster with <email address hidden>: {badrpc,
nodedown}
============> controller-1 <============
=ERROR REPORT==== 28-Sep-2019::14:50:50 ===
** System NOT running to use fully qualified hostnames **
** Hostname controller-0.internalapi.test.local is illegal **

=WARNING REPORT==== 28-Sep-2019::14:50:50 ===
Could not auto-cluster with <email address hidden>: {badrpc,
nodedown}
============> controller-2 <============
=ERROR REPORT==== 28-Sep-2019::14:50:50 ===
** System NOT running to use fully qualified hostnames **
** Hostname controller-0.internalapi.test.local is illegal **

=WARNING REPORT==== 28-Sep-2019::14:50:50 ===
Could not auto-cluster with <email address hidden>: {badrpc,
nodedown}
```

And this is cluster_status outputs:

```
============> controller-0 <============
[root@controller-0 ~]# podman exec -it rabbitmq /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-0'
[{nodes,[{disc,['rabbit@controller-0']}]},
{running_nodes,['rabbit@controller-0']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{'rabbit@controller-0',[]}]}]

============> controller-1 <============
[root@controller-1 ~]# podman exec -it rabbitmq /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-1'
[{nodes,[{disc,['rabbit@controller-1']}]},
{running_nodes,['rabbit@controller-1']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{'rabbit@controller-1',[]}]}]

============> controller-2 <============
[root@controller-2 ~]# podman exec -it rabbitmq /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-2'
[{nodes,[{disc,['rabbit@controller-2']}]},
{running_nodes,['rabbit@controller-2']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{'rabbit@controller-2',[]}]}]
```

So it seems, auto cluster function for rabbitmq fails because of an FQDN issue.

Here is the rabbitmq.config file related to cluster_node configuration

```
[root@controller-0 ~]# cat /var/lib/config-data/rabbitmq/etc/rabbitmq/rabbitmq.config |egrep cluster_nodes
{cluster_nodes, {['<email address hidden>', '<email address hidden>', '<email address hidden>'], disc}},

[root@controller-1 ~]# cat /var/lib/config-data/rabbitmq/etc/rabbitmq/rabbitmq.config |egrep cluster_nodes
{cluster_nodes, {['<email address hidden>', '<email address hidden>', '<email address hidden>'], disc}},

[root@controller-2 ~]# cat /var/lib/config-data/rabbitmq/etc/rabbitmq/rabbitmq.config |egrep cluster_nodes
{cluster_nodes, {['<email address hidden>', '<email address hidden>', '<email address hidden>'], disc}},
```

I think the /usr/share/openstack-puppet/modules/tripleo/manifests/profile/base/rabbitmq.pp manifest can't configure rabbitmq services properly.

Steps to reproduce
==================
1. Install undercloud
2. Deploy overcloud with:
openstack overcloud deploy \
  --timeout 120 \
  --templates \
  -r ~/templates/roles_data.yaml \
  -n ~/templates/network_data.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/network-isolation.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/network-environment.yaml \
  -e ~/templates/nic-mapping.yaml \
  -e ~/templates/network.yaml \
  -e ~/templates/node-info.yaml \
  -e ~/templates/scheduler_hints_env.yaml \
  -e ~/templates/ips-from-pool-all.yaml \
  -e ~/templates/fixed-ip-vips.yaml \
  -e ~/templates/ceph-custom-config.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/ceph-ansible/ceph-ansible.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/ceph-ansible/ceph-rgw.yaml \
  -e ~/custom-tripleo-heat-templates-generated/environments/disable-telemetry.yaml \
  -e ~/templates/misc-settings.yaml \
  -e ~/templates/timezone.yaml
3. After deployment check openstack {hypervisor,server} list outputs.
4. Check rabbitmqctl cluster_info in controller's rabbitmq pods.

Expected result
===============
Rabbitmq cluster should be formed-up properly.

Actual result
=============
Three rabbitmq pod which run standalone mode.

Environment
===========
1. Stein

2. 10 baremetal node (3 controller + 7 compute(HCI) node)

3. My network environment file includes:
CloudName: overcloud0001.test.local
CloudDomain: test.local

Bogdan Dobrelya (bogdando) on 2019-09-30

Changed in tripleo:
importance:	Undecided → High
milestone:	none → ussuri-1
status:	New → Triaged

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-09-30:

I believe we do not configure RABBITMQ_NODENAME in puppets.
There had been (long time ago tho) also some naming conventions weirdness IIUC, see https://tickets.puppetlabs.com/browse/MODULES-1673

Emilien Macchi (emilienm) on 2019-12-19

Changed in tripleo:
milestone:	ussuri-1 → ussuri-2

wes hayutin (weshayutin) on 2020-02-10

Changed in tripleo:
milestone:	ussuri-2 → ussuri-3

wes hayutin (weshayutin) on 2020-04-13

Changed in tripleo:
milestone:	ussuri-3 → ussuri-rc3

wes hayutin (weshayutin) on 2020-05-26

Changed in tripleo:
milestone:	ussuri-rc3 → victoria-1

Emilien Macchi (emilienm) on 2020-07-28

Changed in tripleo:
milestone:	victoria-1 → victoria-3

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.