rabbitmq failover cause not creating transient queues again

Bug #2031512 reported by Yusuf Güngör
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Incomplete
Undecided
Unassigned

Bug Description

Hi everyone, we are using zed version openstack with oslo.messaging version is 14.0.1 for all services.

Our rabbitmq installed and configured via kolla-ansible.
"amqp_durable_queues = true" is exist for all services under oslo_messaging_rabbit sections.

Also there exist a ha-all policy with "ha-mode: all", "ha-promote-on-shutdown: always" and pattern ^(?!(amq\.)|(.*_fanout_)|(reply_)).*

After rebooting the all controller nodes one by one, some reply queues is not recreated. So the not HA queues like reply is not recreated.

Do you have any idea to fix?

Logs like below:

/var/log/kolla/nova/nova-conductor.log:2023-08-16 10:52:13.970 27 WARNING oslo_messaging._drivers.amqpdriver [None req-bda2bd6c-5c18-48ce-924c-6d9af2fb28b6 - - - - - -] reply_60d6e5b31df946a391621751856800ce doesn't exist, drop reply to 29951416f27546cdbbc70146e619439b: oslo_messaging.exceptions.MessageUndeliverable
/var/log/kolla/nova/nova-conductor.log:2023-08-16 10:52:13.971 27 ERROR oslo_messaging._drivers.amqpdriver [None req-bda2bd6c-5c18-48ce-924c-6d9af2fb28b6 - - - - - -] The reply 29951416f27546cdbbc70146e619439b failed to send after 60 seconds due to a missing queue (reply_60d6e5b31df946a391621751856800ce). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable

/var/log/kolla/neutron/neutron-server.log:2023-08-15 17:36:24.333 40 ERROR oslo_messaging._drivers.amqpdriver [None req-0b7ed5e1-9488-4095-8d92-c9aeb333f011 - - - - - -] The reply d79b3d1485574cf9b0f418f0e920421c failed to send after 60 seconds due to a missing queue (reply_c8d743b4e7184a7ca0be09c43ed03608). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
/var/log/kolla/neutron/neutron-server.log:2023-08-15 17:36:47.533 54 WARNING oslo_messaging._drivers.amqpdriver [None req-be715e9b-7139-4e92-9e78-a103ec828d9d - - - - - -] reply_5c0c508ae6a44d6c859b21b9eec8a828 doesn't exist, drop reply to 14f52a5591a24f0890b0c956c799b750: oslo_messaging.exceptions.MessageUndeliverable

Revision history for this message
Takashi Kajinami (kajinamit) wrote :

You have to dig into rabbitmq to find out why these queue are deleted, and not promoted.

Can you check rabbitmq log and see if there are any records about that "broken" queue during failover ?

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Andrew Bonney (andrewbonney) wrote (last edit ):

I suspect we have the same issue. I've seen cases where some services of the same version can reconnect and re-create queues successfully, but also cases where reconnection happens but queues are not re-instated. I've copied some logs from these cases here:

In this case, the nova-compute service lost connection to RMQ when it went down and the queue was destroyed. Upon reconnection it successfully re-created the queue: https://gist.githubusercontent.com/andrewbonney/fb28609b8266d6691007181518054d60/raw/769128e10f18031217d08123f8ba7271b2850e19/compute101-rmq-ok

In this case, the nova-compute service lost connection to RMQ when it went down and the queue was destroyed. Upon reconnection it failed to re-create the queue, resulting in timeouts waiting for replies: https://gist.githubusercontent.com/andrewbonney/e2391ed01bfda77f55beebba51e60dd4/raw/6e816a03371b47d42a14398b7a7e1bd4f70bf806/compute302-rmq-fail (I've since noted debug isn't turned on here which is unhelpful - I'll try to get another example with that turned on).

These are using oslo_messaging 14.2.1 with the same HA policy as the original poster, which excludes reply and fanout queues, relying upon the service to re-create them if the RMQ instance it was connected to goes away. We aren't using durable queues in this case.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

After a fair bit of debug I'm currently suspecting a RabbitMQ bug so have created a report there (https://github.com/rabbitmq/rabbitmq-server/issues/11000). This has only become apparent as a result of https://github.com/openstack/oslo.messaging/commit/b4b49248bcfcb169f96ab2d47b5d207b1354ffa8 which was present in the two oslo.messaging versions so far mentioned in this bug report. In my own testing if I revert this fix then whilst oslo.messaging gets stuck trying to reconnect to the same member of a RabbitMQ cluster during an outage, once that cluster member comes back online it appears to function correctly.

There are a few oslo.messaging bugs which sound similar to this one at present, most notably https://bugs.launchpad.net/oslo.messaging/+bug/2039693. The configurations are usually slightly different, but the behaviour is the same and I suspect it relates to the same underlying issue.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

After reviewing a response from RabbitMQ and some important bits of their docs (see the Queue Properties section last paragraph) https://www.rabbitmq.com/docs/queues#properties, it looks like this is the result of an expected race condition. Whilst their docs don't explicitly call out this case, it's so similar I expect it's the same issue.

My intention is to try switching our reply queues into HA mode, which both OSA and Kolla currently appear to exclude (^(?!(amq\.)|(.*_fanout_)|(reply_)).*). Given the response from RabbitMQ (see the discussion in https://github.com/rabbitmq/rabbitmq-server/discussions/11001), unless server-named exclusive queues are used, the current model (non-exclusive, classic) won't be supported in the future anyway. As such the options seem to be either use HA for reply queues, or switch to using quorum queues.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.