3.9/stable + yoga/stable charms are not getting amq relation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
Fix Released
|
High
|
Alex Kavanagh |
Bug Description
On a deployment with 3.9/stable and yoga/stable charms deploying on focal,
The services aodh, barbican, designate, and octavia are all reporting that "amqp" is incomplete.
Looking at the logs for aodh, the aodh-listener.log displays that it is getting a connection refused. There are similar messages in the logs for the other services.
There are also messages in the logs about missing certificate relations, but these are expected since vault has not yet been unsealed.
the bundle for the deployment can be found at:
https:/
and the crashdump can be found at:
https:/
Other occurrences of this bug can be found at: https:/
description: | updated |
Changed in charm-rabbitmq-server: | |
assignee: | nobody → Alex Kavanagh (ajkavanagh) |
status: | New → Triaged |
importance: | Undecided → High |
Changed in charm-rabbitmq-server: | |
status: | Fix Committed → Fix Released |
Okay, I've finally worked out what's going on with this bug. The issue is in the cluster code (which I'm still tracking down, but I do have a solid way to reproduce it). The symptom of this bug report is very similar to https:/ /bugs.launchpad .net/charm- nova-cloud- controller/ +bug/1971451 but has a very different cause.
This bug is due to the rabbitmq charms not thinking the rabbitmq server instances are clustered and thus not setting the relation data to the client units. The lack of clustering is shown (from the crashdump) in the juju-show-unit of the missing server:
relation-info: endpoint: cluster -data: {} 4-lxd-11
coordinator: '{}'
egress- subnets: 10.246.168.155/32 4-lxd-11
ingress- address: 10.246.168.155
private- address: 10.246.168.155 7154403" server/ 0:
clustered: juju-19cd36- 3-lxd-10 GIUVO
coordinator: '{}'
egress- subnets: 10.246.169.138/32 3-lxd-10
ingress- address: 10.246.169.138
private- address: 10.246.169.138
timestamp: "1653364664. 9672046" server/ 2: GIUVO
coordinator: '{}'
egress- subnets: 10.246.168.142/32 5-lxd-11
ingress- address: 10.246.168.142
private- address: 10.246.168.142
- relation-id: 40
endpoint: cluster
related-
application
local-unit:
in-scope: true
data:
clustered: juju-19cd36-
hostname: juju-19cd36-
timestamp: "1653364407.
related-units:
rabbitmq-
in-scope: true
data:
cookie: QZQPPMHUUZEEMDR
hostname: juju-19cd36-
rabbitmq-
in-scope: true
data:
cookie: QZQPPMHUUZEEMDR
hostname: juju-19cd36-
Notice that rabbitmq-server/2 is missing the 'clustered' key which is used in the parts of the clustering code to determine whether the the rabbitmq instance is clustered and whether to send the data to the clients.
This code is in the hooks/rabbitmq_ utils.py
def update_ peer_cluster_ status( ): with_leader( ):
"""Inform peers that this unit is clustered if it is."""
# check the leader and try to cluster with it
if clustered_
log('Host already clustered with %s.' % leader_node())
cluster_rid = relation_ id('cluster' , local_unit())
is_clustered = relation_ get(attribute= 'clustered' ,
rid=cluster_ rid,
unit=local_ unit())
log("is_ clustered: type(%s), value(%s)" % (type(is_ clustered) ,
is_clustered) , level=DEBUG)
log('am I clustered?: %s' % bool(is_clustered), level=DEBUG)
relation_ set(relation_ id=cluster_ rid,
clustered= get_unit_ hostname( ),
timestamp= time.time( ))
if not is_clustered:
# NOTE(freyes): this node needs to be marked as clustered, it's
# part of the cluster according to 'rabbitmqctl cluster_status'
# (LP: #1691510)
Essentially, it looks like this code is not run on the failing unit. Still working out why this is the case.