3.9/stable + yoga/stable charms are not getting amq relation

Bug #1975605 reported by Alexander Balderson
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
High
Alex Kavanagh

Bug Description

On a deployment with 3.9/stable and yoga/stable charms deploying on focal,

The services aodh, barbican, designate, and octavia are all reporting that "amqp" is incomplete.

Looking at the logs for aodh, the aodh-listener.log displays that it is getting a connection refused. There are similar messages in the logs for the other services.

There are also messages in the logs about missing certificate relations, but these are expected since vault has not yet been unsealed.

the bundle for the deployment can be found at:
https://oil-jenkins.canonical.com/artifacts/8f9b5ab2-e9ee-44d9-81ca-f5b6fba90b54/generated/generated/openstack/bundle.yaml

and the crashdump can be found at:
https://oil-jenkins.canonical.com/artifacts/8f9b5ab2-e9ee-44d9-81ca-f5b6fba90b54/generated/generated/openstack/juju-crashdump-openstack-2022-05-24-07.18.29.tar.gz

Other occurrences of this bug can be found at: https://solutions.qa.canonical.com/bugs/bugs/bug/1975605

description: updated
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Okay, I've finally worked out what's going on with this bug. The issue is in the cluster code (which I'm still tracking down, but I do have a solid way to reproduce it). The symptom of this bug report is very similar to https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1971451 but has a very different cause.

This bug is due to the rabbitmq charms not thinking the rabbitmq server instances are clustered and thus not setting the relation data to the client units. The lack of clustering is shown (from the crashdump) in the juju-show-unit of the missing server:

  relation-info:
  - relation-id: 40
    endpoint: cluster
    related-endpoint: cluster
    application-data: {}
    local-unit:
      in-scope: true
      data:
        clustered: juju-19cd36-4-lxd-11
        coordinator: '{}'
        egress-subnets: 10.246.168.155/32
        hostname: juju-19cd36-4-lxd-11
        ingress-address: 10.246.168.155
        private-address: 10.246.168.155
        timestamp: "1653364407.7154403"
    related-units:
      rabbitmq-server/0:
        in-scope: true
        data:
          clustered: juju-19cd36-3-lxd-10
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.169.138/32
          hostname: juju-19cd36-3-lxd-10
          ingress-address: 10.246.169.138
          private-address: 10.246.169.138
          timestamp: "1653364664.9672046"
      rabbitmq-server/2:
        in-scope: true
        data:
          cookie: QZQPPMHUUZEEMDRGIUVO
          coordinator: '{}'
          egress-subnets: 10.246.168.142/32
          hostname: juju-19cd36-5-lxd-11
          ingress-address: 10.246.168.142
          private-address: 10.246.168.142

Notice that rabbitmq-server/2 is missing the 'clustered' key which is used in the parts of the clustering code to determine whether the the rabbitmq instance is clustered and whether to send the data to the clients.

This code is in the hooks/rabbitmq_utils.py

def update_peer_cluster_status():
    """Inform peers that this unit is clustered if it is."""
    # check the leader and try to cluster with it
    if clustered_with_leader():
        log('Host already clustered with %s.' % leader_node())

        cluster_rid = relation_id('cluster', local_unit())
        is_clustered = relation_get(attribute='clustered',
                                    rid=cluster_rid,
                                    unit=local_unit())
        log("is_clustered: type(%s), value(%s)" % (type(is_clustered),
                                                   is_clustered), level=DEBUG)

        log('am I clustered?: %s' % bool(is_clustered), level=DEBUG)
        if not is_clustered:
            # NOTE(freyes): this node needs to be marked as clustered, it's
            # part of the cluster according to 'rabbitmqctl cluster_status'
            # (LP: #1691510)
            relation_set(relation_id=cluster_rid,
                         clustered=get_unit_hostname(),
                         timestamp=time.time())

Essentially, it looks like this code is not run on the failing unit. Still working out why this is the case.

Changed in charm-rabbitmq-server:
assignee: nobody → Alex Kavanagh (ajkavanagh)
status: New → Triaged
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)
Changed in charm-rabbitmq-server:
status: Triaged → In Progress
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

if I'm not confusing the bugs, a quick workaround is manually running the config-changed hook on all rabbitmq-server units makes one of them complete whatever was missing, and relations start to work.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

@Andre, it's possible this is a work-around (I couldn't get it to work), but the it does interrupt the deployment and stop it in its tracks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/868514
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/81f08ab7695ab507ade076c33d0cc168a03be221
Submitter: "Zuul (22348)"
Branch: master

commit 81f08ab7695ab507ade076c33d0cc168a03be221
Author: Alex Kavanagh <email address hidden>
Date: Fri Dec 23 13:20:34 2022 +0000

    Fix issue where charms aren't clustered but RMQ is

    Due to the @cache decorator in the code, it was possible to get the
    charm into a state where RMQ is clustered, but the charm doesn't record
    it. The charm 'thinks' it is clustered when it has set the 'clustered'
    key on the 'cluster' relation. Unfortunately, due to the @cached
    decorator it's possible in the 'cluster-relation-changed' hook to have a
    situation where the RMQ instance clusters during the hook execution and
    then, later, when it's supposed to writing the 'clustered' key, it reads
    the previous cached value where it wasn't clustered and therefore
    doesn't set the 'clustered' key. This is just about the only
    opportunity to do it, and so the charm ends up being locked.

    The fix was to clear the @cache values so that the nodes would be
    re-read, and this allows the charm to then write the 'clustered' key.

    Change-Id: I12be41a83323d150ba1cbaeef64041f0bb5e32ce
    Closes-Bug: #1975605

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/jammy)
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Worked around this one with:

juju run -a rabbitmq-server hooks/config-changed

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high for tracking the stable backport appropriately.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/jammy)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/869455
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/8f1986b8b583c81e3351b19eea29988c3ce83715
Submitter: "Zuul (22348)"
Branch: stable/jammy

commit 8f1986b8b583c81e3351b19eea29988c3ce83715
Author: Alex Kavanagh <email address hidden>
Date: Fri Dec 23 13:20:34 2022 +0000

    Fix issue where charms aren't clustered but RMQ is

    Due to the @cache decorator in the code, it was possible to get the
    charm into a state where RMQ is clustered, but the charm doesn't record
    it. The charm 'thinks' it is clustered when it has set the 'clustered'
    key on the 'cluster' relation. Unfortunately, due to the @cached
    decorator it's possible in the 'cluster-relation-changed' hook to have a
    situation where the RMQ instance clusters during the hook execution and
    then, later, when it's supposed to writing the 'clustered' key, it reads
    the previous cached value where it wasn't clustered and therefore
    doesn't set the 'clustered' key. This is just about the only
    opportunity to do it, and so the charm ends up being locked.

    The fix was to clear the @cache values so that the nodes would be
    re-read, and this allows the charm to then write the 'clustered' key.

    Change-Id: I12be41a83323d150ba1cbaeef64041f0bb5e32ce
    Closes-Bug: #1975605
    (cherry picked from commit 81f08ab7695ab507ade076c33d0cc168a03be221)

tags: added: in-stable-jammy
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.