Comment 11 for bug 1796886

Revision history for this message
David Ames (thedac) wrote :

I have been asked to rule out this bug https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1818260 as a possible culprit.

If the following log message is found on the leader node in one of the failed scenarios then it would be a duplicate.

 "check_cluster_memberships(): '<NODENAME>' in nodes but not in charm relations or running_nodes, telling RabbitMQ to forget about it"

Based on my reading of the bug I don't think that is likely.

Further TRIAGE:

I still think the root problem is the new node joining the cluster (or something else in the charm) sets its IP address on the amqp relation before it is ready.

As I mentioned in Comment #1 I think this can be resolved in rabbit_utils.client_node_is_ready(). It could, for example, check all of the users/passwds in leader-settings are accessible before setting amqp relation data.

Alok in Comment #6 is describing the symptom not the root cause. We must stop the new node's IP from being set on the relation until it is ready to handle requests.

I agree with all that the oslo.messaging client code should probably be more robust and handle retries better, but we have more control over the rabbitmq-server charm.