oslo.messaging

Services may lost their classic queues after RabbitMQ service restart

Bug #2039693 reported by Andrey Grishin on 2023-10-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	oslo.messaging	New	Undecided	Unassigned

Bug Description

Hi,
I'm testing Rabbitmq HA with OS Zed kolla-ansible classic installation.
Rabbitmq service is deployed in containers using Ubuntu 22.04.1, on 3 host cluster using quorum+classic queues.
I've noticed, when the Rabbitmq service is restarted on the host that owns the classic queue, the queue moves to(succesfully recreates on) any other running Rabbitmq host. Next, we restart the service on the next owner of the queue (after correct restoration of the cluster). In our configuration, the queue will essentially migrate between two of the three hosts due to the Rabbitmq queue_master_locator = min-masters setting
With several similar iterations of Rabbitmq service restart, Rabbitmq informs reconnecting clients(catched with Wireshark) that the queue exists, but in fact it does not exist - as a result, the clients does not recreate their queues(reply,fanout, etc.) when reconnecting, that leads to dead OS services.

Additional context
Tested on:
RabbitMQ 3.12.0, 3.12.6
Erlang 26.1.1
oslo.messaging 14.0.1
amqp 5.1.1
kombu 5.2.4

This simple script I'm using to automate this check:

#!/usr/bin/bash

connect_host="control-3"
host_list="control-2 control-1"
reply="reply_351f90757a75436287b313f0eb076e46"
iter=1
tries=20

while [[ "${iter}" != "${tries}" ]]; do
  for host in ${host_list}; do
    if ssh ${connect_host} "sudo docker exec -u0 rabbitmq rabbitmqctl list_consumers -sq --no-table-headers" | grep -wq ${reply}; then
      echo "iter:${iter} service restart on: ${host}"
      ssh ${host} "sudo docker restart rabbitmq" > /dev/null
      sleep 10
    else
      echo "Queue is lost - iter:${iter} service restart on: ${host}"
      exit 0
    fi
  done
((++iter))
done

in script variables:
connect_host - the host to which the queue does not move because queue_master_locator = min-masters, this host is used for control
host_list - hosts between which the queue moves
reply - classic queue being monitored

How I see this behavior:

Client has a pool of RabbitMQ nodes to connect to - node1, node2, node3
Client connects to node1, creates a classic non mirrored queue, that appears on node3
We restart RabbitMQ service on node3
Client creates a queue again, that appears on node2
We restart RabbitMQ service on node2 after node3 enter cluster successfully
Client creates a queue again, that appears on node3
after some iterations like that, connected to node1 client gets an answer from RabbitMQ node1, that queue still exists but in fact it does not - and client is not redeclaring that queue.

See original description

Revision history for this message

Andrey Grishin (agrishin) wrote on 2023-10-18:

Also, there is my discussion related this problem https://github.com/rabbitmq/rabbitmq-server/discussions/9716

summary:

- Services may lost their queues after RabbitMQ service restart
+ Services may lost their classic queues after RabbitMQ service restart

Andrey Grishin (agrishin) on 2023-10-19

description:

updated

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.