Comment 1 for bug 1975605

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Okay, I've finally worked out what's going on with this bug. The issue is in the cluster code (which I'm still tracking down, but I do have a solid way to reproduce it). The symptom of this bug report is very similar to https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1971451 but has a very different cause.

This bug is due to the rabbitmq charms not thinking the rabbitmq server instances are clustered and thus not setting the relation data to the client units. The lack of clustering is shown (from the crashdump) in the juju-show-unit of the missing server:

  relation-info:
  - relation-id: 40
    endpoint: cluster
    related-endpoint: cluster
    application-data: {}
    local-unit:
      in-scope: true
      data:
        clustered: juju-19cd36-4-lxd-11
        coordinator: '{}'
        egress-subnets: 10.246.168.155/32
        hostname: juju-19cd36-4-lxd-11
        ingress-address: 10.246.168.155
        private-address: 10.246.168.155
        timestamp: "1653364407.7154403"
    related-units:
      rabbitmq-server/0:
        in-scope: true
        data:
          clustered: juju-19cd36-3-lxd-10
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.169.138/32
          hostname: juju-19cd36-3-lxd-10
          ingress-address: 10.246.169.138
          private-address: 10.246.169.138
          timestamp: "1653364664.9672046"
      rabbitmq-server/2:
        in-scope: true
        data:
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.168.142/32
          hostname: juju-19cd36-5-lxd-11
          ingress-address: 10.246.168.142
          private-address: 10.246.168.142

Notice that rabbitmq-server/2 is missing the 'clustered' key which is used in the parts of the clustering code to determine whether the the rabbitmq instance is clustered and whether to send the data to the clients.

This code is in the hooks/rabbitmq_utils.py

def update_peer_cluster_status():
    """Inform peers that this unit is clustered if it is."""
    # check the leader and try to cluster with it
    if clustered_with_leader():
        log('Host already clustered with %s.' % leader_node())

        cluster_rid = relation_id('cluster', local_unit())
        is_clustered = relation_get(attribute='clustered',
                                    rid=cluster_rid,
                                    unit=local_unit())
        log("is_clustered: type(%s), value(%s)" % (type(is_clustered),
                                                   is_clustered), level=DEBUG)

        log('am I clustered?: %s' % bool(is_clustered), level=DEBUG)
        if not is_clustered:
            # NOTE(freyes): this node needs to be marked as clustered, it's
            # part of the cluster according to 'rabbitmqctl cluster_status'
            # (LP: #1691510)
            relation_set(relation_id=cluster_rid,
                         clustered=get_unit_hostname(),
                         timestamp=time.time())

Essentially, it looks like this code is not run on the failing unit. Still working out why this is the case.