I have deployed an Openstack (ocata) cloud with a 3 node rabbitmq cluster and done a bunch of vm creates etc to check that rpc is working and all is good. Now I add a unit to rabbitmq and I notice that once it is up and running, the rabbitmq cluster is healthy but some of my clients e.g. nova-compute and neutron-metering agent have died. Checking the logs at both ends I seen that they got restarted (when their .conf files got updated with new hosts) and they tried to connect to the newly added unit of rabbit BUT it seems they did so before the new rabbit unit was fully synced/mirrored because the clients get an authentication failure and never retry thus they sit there unusable until they are manually restarted. They also therefore get reported as down in agent/service listings.
nova-compute.log error - https://pastebin.ubuntu.com/p/CKYBJPVnPK/
neutron-metering-agent.log error - https://pastebin.ubuntu.com/p/kjYFnhXdrp/
neutron-openvswitch-agent.log ... (they are all the same)
newly added rabbtimq log from around the time of these errors - https://pastebin.ubuntu.com/p/C26KH69JBZ/ - also these errors all happen before the logs indicating that the cluster is syncing
Let's add more gates in rabbit_ utils.client_ node_is_ ready() to make sure the node has joined the cluster and there is a reasonable expectation it has its data.