Bug #1975605 “3.9/stable + yoga/stable charms are not getting am...” : Bugs : OpenStack RabbitMQ Server Charm

Moises Emilio Benzan Mora (moisesbenzan) on 2022-05-25

description:

updated

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2022-12-08:

#1

Okay, I've finally worked out what's going on with this bug. The issue is in the cluster code (which I'm still tracking down, but I do have a solid way to reproduce it). The symptom of this bug report is very similar to https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1971451 but has a very different cause.

This bug is due to the rabbitmq charms not thinking the rabbitmq server instances are clustered and thus not setting the relation data to the client units. The lack of clustering is shown (from the crashdump) in the juju-show-unit of the missing server:

  relation-info:
  - relation-id: 40
    endpoint: cluster
    related-endpoint: cluster
    application-data: {}
    local-unit:
      in-scope: true
      data:
        clustered: juju-19cd36-4-lxd-11
        coordinator: '{}'
        egress-subnets: 10.246.168.155/32
        hostname: juju-19cd36-4-lxd-11
        ingress-address: 10.246.168.155
        private-address: 10.246.168.155
        timestamp: "1653364407.7154403"
    related-units:
      rabbitmq-server/0:
        in-scope: true
        data:
          clustered: juju-19cd36-3-lxd-10
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.169.138/32
          hostname: juju-19cd36-3-lxd-10
          ingress-address: 10.246.169.138
          private-address: 10.246.169.138
          timestamp: "1653364664.9672046"
      rabbitmq-server/2:
        in-scope: true
        data:
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.168.142/32
          hostname: juju-19cd36-5-lxd-11
          ingress-address: 10.246.168.142
          private-address: 10.246.168.142

Notice that rabbitmq-server/2 is missing the 'clustered' key which is used in the parts of the clustering code to determine whether the the rabbitmq instance is clustered and whether to send the data to the clients.

This code is in the hooks/rabbitmq_utils.py

def update_peer_cluster_status():
    """Inform peers that this unit is clustered if it is."""
    # check the leader and try to cluster with it
    if clustered_with_leader():
        log('Host already clustered with %s.' % leader_node())

        cluster_rid = relation_id('cluster', local_unit())
        is_clustered = relation_get(attribute='clustered',
                                    rid=cluster_rid,
                                    unit=local_unit())
        log("is_clustered: type(%s), value(%s)" % (type(is_clustered),
                                                   is_clustered), level=DEBUG)

        log('am I clustered?: %s' % bool(is_clustered), level=DEBUG)
        if not is_clustered:
            # NOTE(freyes): this node needs to be marked as clustered, it's
            # part of the cluster according to 'rabbitmqctl cluster_status'
            # (LP: #1691510)
            relation_set(relation_id=cluster_rid,
                         clustered=get_unit_hostname(),
                         timestamp=time.time())

Essentially, it looks like this code is not run on the failing unit. Still working out why this is the case.

Okay, I've finally worked out what's going on with this bug.  The issue is in the cluster code (which I'm still tracking down, but I do have a solid way to reproduce it).  The symptom of this bug report is very similar to https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1971451 but has a very different cause.

This bug is due to the rabbitmq charms not thinking the rabbitmq server instances are clustered and thus not setting the relation data to the client units.  The lack of clustering is shown (from the crashdump) in the juju-show-unit of the missing server:

relation-info:
  - relation-id: 40
    endpoint: cluster
    related-endpoint: cluster
    application-data: {}
    local-unit:
      in-scope: true
      data:
        clustered: juju-19cd36-4-lxd-11
        coordinator: '{}'
        egress-subnets: 10.246.168.155/32
        hostname: juju-19cd36-4-lxd-11
        ingress-address: 10.246.168.155
        private-address: 10.246.168.155
        timestamp: "1653364407.7154403"
    related-units:
      rabbitmq-server/0:
        in-scope: true
        data:
          clustered: juju-19cd36-3-lxd-10
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.169.138/32
          hostname: juju-19cd36-3-lxd-10
          ingress-address: 10.246.169.138
          private-address: 10.246.169.138
          timestamp: "1653364664.9672046"
      rabbitmq-server/2:
        in-scope: true
        data:
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.168.142/32
          hostname: juju-19cd36-5-lxd-11
          ingress-address: 10.246.168.142
          private-address: 10.246.168.142

Notice that rabbitmq-server/2 is missing the 'clustered' key which is used in the parts of the clustering code to determine whether the the rabbitmq instance is clustered and whether to send the data to the clients.

This code is in the hooks/rabbitmq_utils.py

def update_peer_cluster_status():
    """Inform peers that this unit is clustered if it is."""
    # check the leader and try to cluster with it
    if clustered_with_leader():
        log('Host already clustered with %s.' % leader_node())

cluster_rid = relation_id('cluster', local_unit())
        is_clustered = relation_get(attribute='clustered',
                                    rid=cluster_rid,
                                    unit=local_unit())
        log("is_clustered: type(%s), value(%s)" % (type(is_clustered),
                                                   is_clustered), level=DEBUG)

log('am I clustered?: %s' % bool(is_clustered), level=DEBUG)
        if not is_clustered:
            # NOTE(freyes): this node needs to be marked as clustered, it's
            # part of the cluster according to 'rabbitmqctl cluster_status'
            # (LP: #1691510)
            relation_set(relation_id=cluster_rid,
                         clustered=get_unit_hostname(),
                         timestamp=time.time())

Essentially, it looks like this code is not run on the failing unit.  Still working out why this is the case.

Alex Kavanagh (ajkavanagh) on 2022-12-21

Changed in charm-rabbitmq-server:
assignee:	nobody → Alex Kavanagh (ajkavanagh)
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-12-24: Fix proposed to charm-rabbitmq-server (master)

#2

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/868514

Changed in charm-rabbitmq-server:
status:	Triaged → In Progress

Revision history for this message

Andre Ruiz (andre-ruiz) wrote on 2023-01-04:

#3

if I'm not confusing the bugs, a quick workaround is manually running the config-changed hook on all rabbitmq-server units makes one of them complete whatever was missing, and relations start to work.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-01-04:

#4

@Andre, it's possible this is a work-around (I couldn't get it to work), but the it does interrupt the deployment and stop it in its tracks.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-01-07: Fix merged to charm-rabbitmq-server (master)

#5

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/868514
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/81f08ab7695ab507ade076c33d0cc168a03be221
Submitter: "Zuul (22348)"
Branch: master

commit 81f08ab7695ab507ade076c33d0cc168a03be221
Author: Alex Kavanagh <email address hidden>
Date: Fri Dec 23 13:20:34 2022 +0000

Fix issue where charms aren't clustered but RMQ is

    Due to the @cache decorator in the code, it was possible to get the
    charm into a state where RMQ is clustered, but the charm doesn't record
    it. The charm 'thinks' it is clustered when it has set the 'clustered'
    key on the 'cluster' relation. Unfortunately, due to the @cached
    decorator it's possible in the 'cluster-relation-changed' hook to have a
    situation where the RMQ instance clusters during the hook execution and
    then, later, when it's supposed to writing the 'clustered' key, it reads
    the previous cached value where it wasn't clustered and therefore
    doesn't set the 'clustered' key. This is just about the only
    opportunity to do it, and so the charm ends up being locked.

The fix was to clear the @cache values so that the nodes would be
re-read, and this allows the charm to then write the 'clustered' key.

Change-Id: I12be41a83323d150ba1cbaeef64041f0bb5e32ce
Closes-Bug: #1975605

Changed in charm-rabbitmq-server:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-01-09: Fix proposed to charm-rabbitmq-server (stable/jammy)

#6

Fix proposed to branch: stable/jammy
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/869455

Revision history for this message

Peter Jose De Sousa (pjds) wrote on 2023-01-09:

#7

Worked around this one with:

juju run -a rabbitmq-server hooks/config-changed

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-01-18:

#8

Subscribing ~field-high for tracking the stable backport appropriately.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-02-20: Fix merged to charm-rabbitmq-server (stable/jammy)

#9

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/869455
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/8f1986b8b583c81e3351b19eea29988c3ce83715
Submitter: "Zuul (22348)"
Branch: stable/jammy

commit 8f1986b8b583c81e3351b19eea29988c3ce83715
Author: Alex Kavanagh <email address hidden>
Date: Fri Dec 23 13:20:34 2022 +0000

Fix issue where charms aren't clustered but RMQ is

    Due to the @cache decorator in the code, it was possible to get the
    charm into a state where RMQ is clustered, but the charm doesn't record
    it. The charm 'thinks' it is clustered when it has set the 'clustered'
    key on the 'cluster' relation. Unfortunately, due to the @cached
    decorator it's possible in the 'cluster-relation-changed' hook to have a
    situation where the RMQ instance clusters during the hook execution and
    then, later, when it's supposed to writing the 'clustered' key, it reads
    the previous cached value where it wasn't clustered and therefore
    doesn't set the 'clustered' key. This is just about the only
    opportunity to do it, and so the charm ends up being locked.

The fix was to clear the @cache values so that the nodes would be
re-read, and this allows the charm to then write the 'clustered' key.

    Change-Id: I12be41a83323d150ba1cbaeef64041f0bb5e32ce
    Closes-Bug: #1975605
    (cherry picked from commit 81f08ab7695ab507ade076c33d0cc168a03be221)

tags:

added: in-stable-jammy

Billy Olsen (billy-olsen) on 2023-06-26

Changed in charm-rabbitmq-server:
status:	Fix Committed → Fix Released

OpenStack RabbitMQ Server Charm

3.9/stable + yoga/stable charms are not getting amq relation

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches