Okay, I've finally worked out what's going on with this bug. The issue is in the cluster code (which I'm still tracking down, but I do have a solid way to reproduce it). The symptom of this bug report is very similar to https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1971451 but has a very different cause.
This bug is due to the rabbitmq charms not thinking the rabbitmq server instances are clustered and thus not setting the relation data to the client units. The lack of clustering is shown (from the crashdump) in the juju-show-unit of the missing server:
Notice that rabbitmq-server/2 is missing the 'clustered' key which is used in the parts of the clustering code to determine whether the the rabbitmq instance is clustered and whether to send the data to the clients.
This code is in the hooks/rabbitmq_utils.py
def update_peer_cluster_status():
"""Inform peers that this unit is clustered if it is."""
# check the leader and try to cluster with it
if clustered_with_leader():
log('Host already clustered with %s.' % leader_node())
log('am I clustered?: %s' % bool(is_clustered), level=DEBUG)
if not is_clustered:
# NOTE(freyes): this node needs to be marked as clustered, it's
# part of the cluster according to 'rabbitmqctl cluster_status'
# (LP: #1691510) relation_set(relation_id=cluster_rid, clustered=get_unit_hostname(), timestamp=time.time())
Essentially, it looks like this code is not run on the failing unit. Still working out why this is the case.
Okay, I've finally worked out what's going on with this bug. The issue is in the cluster code (which I'm still tracking down, but I do have a solid way to reproduce it). The symptom of this bug report is very similar to https:/ /bugs.launchpad .net/charm- nova-cloud- controller/ +bug/1971451 but has a very different cause.
This bug is due to the rabbitmq charms not thinking the rabbitmq server instances are clustered and thus not setting the relation data to the client units. The lack of clustering is shown (from the crashdump) in the juju-show-unit of the missing server:
relation-info: endpoint: cluster -data: {} 4-lxd-11
coordinator: '{}'
egress- subnets: 10.246.168.155/32 4-lxd-11
ingress- address: 10.246.168.155
private- address: 10.246.168.155 7154403" server/ 0:
clustered: juju-19cd36- 3-lxd-10 GIUVO
coordinator: '{}'
egress- subnets: 10.246.169.138/32 3-lxd-10
ingress- address: 10.246.169.138
private- address: 10.246.169.138
timestamp: "1653364664. 9672046" server/ 2: GIUVO
coordinator: '{}'
egress- subnets: 10.246.168.142/32 5-lxd-11
ingress- address: 10.246.168.142
private- address: 10.246.168.142
- relation-id: 40
endpoint: cluster
related-
application
local-unit:
in-scope: true
data:
clustered: juju-19cd36-
hostname: juju-19cd36-
timestamp: "1653364407.
related-units:
rabbitmq-
in-scope: true
data:
cookie: QZQPPMHUUZEEMDR
hostname: juju-19cd36-
rabbitmq-
in-scope: true
data:
cookie: QZQPPMHUUZEEMDR
hostname: juju-19cd36-
Notice that rabbitmq-server/2 is missing the 'clustered' key which is used in the parts of the clustering code to determine whether the the rabbitmq instance is clustered and whether to send the data to the clients.
This code is in the hooks/rabbitmq_ utils.py
def update_ peer_cluster_ status( ): with_leader( ):
"""Inform peers that this unit is clustered if it is."""
# check the leader and try to cluster with it
if clustered_
log('Host already clustered with %s.' % leader_node())
cluster_rid = relation_ id('cluster' , local_unit())
is_clustered = relation_ get(attribute= 'clustered' ,
rid=cluster_ rid,
unit=local_ unit())
log("is_ clustered: type(%s), value(%s)" % (type(is_ clustered) ,
is_clustered) , level=DEBUG)
log('am I clustered?: %s' % bool(is_clustered), level=DEBUG)
relation_ set(relation_ id=cluster_ rid,
clustered= get_unit_ hostname( ),
timestamp= time.time( ))
if not is_clustered:
# NOTE(freyes): this node needs to be marked as clustered, it's
# part of the cluster according to 'rabbitmqctl cluster_status'
# (LP: #1691510)
Essentially, it looks like this code is not run on the failing unit. Still working out why this is the case.