RabbitMQ queues often hang after charm config changes or rabbitmq restarts due to multiple causes (overview/co-ordination bug)

Bug #1943937 reported by Trent Lloyd
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
In Progress
High
Unassigned

Bug Description

[Issue]
RabbitMQ restarts, particularly rolling restarts of multiple nodes, can leave the queues in a bad state that is difficult to diagnose and difficult to recover, requiring all nodes to be stopped simultaneously and then started again (a rolling restart does not resolve it).

This is often triggered by the charm itself as part of a config-changed event as all the servers get restarted 30 seconds apart (due to the default known-wait=30) combined with it currently re-applying some of the queue configuration such as enabling mirroring, HA policies, etc at the same time.

This can be reliably and easily reproduced with any cluster-partition-handling value (ignore, autoheal or pause_minority).

This is happening frequently in production deployments (on a weekly basis) causing high severity cases and cloud downtime with a high impact to users. These issues have been persisting for a long period of time and caused much confusion. I have attempted to comprehensively research and document this and this bug is the result of that work. As you’ll see from the below data there are a large number of items related to this that require attention. This bug is intended as a ‘covering bug’ to document the various causes and spin off smaller bugs to fix relevant pieces. There is some overlap between the fixes though and it’s possible not all will be required depending on which are expected.

Please note that I appreciate this bug description is VERY long, however, the issue truly appears to be that complex. I will split each individual fix into a separate bug to handle it’s resolution but wanted to track the overarching and inter-related situation somewhere.

[Test Case]
This issue is best reproduced on Bionic 18.04. It is harder to reproduce on Focal 20.04 due to a number of bug fixes but still possible particularly if you also have network partitions.

Generally speaking, restarting all 3 servers at approximately the same time is likely to trigger the issue. In some cases, especially where a cluster partition had previously occurred (even days ago), restarting only 1 or 2 of the servers may also trigger the situation.

I found the following scenario to reliably reproduce it most of the time I attempt it when used in an openstack-on-openstack test deployment.

(in parallel at the same time)
rabbitmq-server/0: sudo systemctl restart rabbitmq-server
rabbitmq-server/1: sudo systemctl restart rabbitmq-server

(as soon as one of the above restarts returns to the command prompt)
rabbitmq-server/2: sudo systemctl restart rabbitmq-server

Depending on the speed of the underlying server and number of queues created (a basic openstack install seems to have around 500 queues), you may need to experiment a little with the exact timing. It can be reproduced with all cluster-partition-handling settings though some settings change exactly how reproducible it is by which timing.

Changing one of the charm config options causes the charm to also do such a rolling restart and is also likely to reproduce the issue. The default 30 second known-wait between restarts makes it slightly less reliable to reproduce than the above but still happens but can depend a little on the speed and size of your environment. It’s a bit racy.

[Symptoms]

A random subset of the queues will then hang. Some or all of the following symptoms are observed

(a) The queues disappear from the output of "rabbitmqctl list_queues -p openstack" entirely even though they exist. The only way to notice their existence and broken state is via the Management Plugin REST API (consumed directly, via the web interface or via rabbitmqadmin). In that case the queues are listed but with no statistics or mirrors. Basically just only the name is listed.

(b) Clients fail to use or declare the queue. This action times out after 30 seconds and logs the following error on the server side:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-bgp-plugin' in vhost 'openstack' due to timeout

(c) “Old incarnation” errors like the following are persistently logged to /<email address hidden>
=ERROR REPORT==== 17-Sep-2021::06:27:34 ===
Discarding message {'$gen_call',{<0.12580.0>,#Ref<0.2898216055.2000945153.142860>},stat} from <0.12580.0> to <0.2157.0> in an old incarnation (2) of this node (3)

(d) The queue has no active master due to the default ha-promote-on-shutdown=when-synced policy. A warning is sometimes logged about this to /<email address hidden>

=WARNING REPORT==== 16-Sep-2016::10:32:57 ===
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

(e) In theory when the original master comes back, the queue should manage to go alive again, however it seems when the queue hangs as part of item (a) and (b) this original master gets stuck and can never recover, particularly on Bionic’s 3.6.10. This seems less common on Focal’s 3.8.2 but still happens.

(f) You cannot delete the queue in order to recreate it. Known bug fixed in 3.6.16. https://github.com/rabbitmq/rabbitmq-server/issues/1501

(g) In some cases the queue is alive, however, fails to synchronise 1 or more of the slaves and it now has reduced redundancy running on only 2 or 1 node. This also happens consistently on Focal’s 3.8 as well as Bionic’s 3.6.

[Recovery]
When this happens, a rolling restart of the cluster (systemctl restart rabbitmq-server) does not repair the situation whether you do them 1 at a time or restart all 3 at the same time which is the action most people take. If anything it makes it worse as it is generally the trigger of it happening in the first place.

The only way to recover from the situation is to stop all 3 nodes, wait until they are all stopped, then start all 3 again. Then it recovers all of the queues reliably.

[Possible Solutions]

I found the following solutions/related bugs. I spent a significant amount of time reproducing and researching the situation and it seems this is caused by an aggregate of a large number of different bugs and possible configuration/charm changes. I will use this as a tracking bug for implementing related fixes in additional bugs focussed on each item.

(1) Move to Quorum Queues long term

In general the RabbitMQ project documented multiple times that classic HA queues have a number of these problems that may not always be solved and that we should move to "Quorum Queues" which use a proper consensu algorithm and Classic HA Queues are being deprecated. While there are some fixes to make the classic queues work better, we should look to add Quorum Queue support particularly for newer releases.

See for example this bug about the 'old incarnation' messages in many ways stating that basically they won't fix it and you should move to Quorum queues instead:
https://github.com/rabbitmq/rabbitmq-server/discussions/2950

Action Required: Propose spec for Quorum Queue support and prioritise implementation. Will need to include a review of the quality of this feature in Focal’s 3.8.2 it seems there are many bug fixes for this feature in future point releases as it was a new feature in this version.

(2) Set policy ha-promote-on-shutdown=always

By default, a node is only promoted to master if it is synchronised with the old master. In cases where that is not the case, such as a rolling restart of multiple nodes, it's possible none of the nodes are synchronised. In this case a manual trigger is required to synchronise it. This default favours consistency (not losing messages) over availability.

Using this option is recommended in numerous resources including puppet-triplo, the openstack wiki and upstream bugs and documentation. A full list of those resources is included in the created bug linked below.

Having tested this change to the charm, it is possible but much more difficult to reproduce the situation with this fix applied. In one set of tests I could reproduce the issue with the default 3 out of 3 times. With this setting applied 2 out of 3 times it worked and the third time all queues were still available but 285/500 of them did not have all 3 mirrors running. Restarting 1 of the nodes only got those queues to all re-synchronise.

Thus I think we should go ahead and add ha-promote-on-shutdown=always as a configuration option with a default of true as it will eliminate many of the cases.

Action Required: Implement this policy by default in the charm, work tracked in https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1943929

(3) RabbitMQ 3.8.2 in Focal generally handles this situation better than RabbitMQ 3.6.10 in Bionic.

It's possible to reproduce but much harder and usually affects less queues.

It seems to have a number of related bug fixes however I have had a lot of trouble nailing down exactly WHICH bug fixes it despite a few hours of research on the topic. We also need to evaluate the upstream stable 3.6.16 release (and 3.8.x point releases) to see if they contain any relevant fixes and either backport those exact fixes or get a micro release exception for RabbitMQ 3.6.16. I am concerned that we don't have the Erlang expertise to properly backport the various fixes - some are simple but some had substantial code changes.

Note that RabbitMQ 3.6.16 upstream technically requires a newer Erlang version than that shipped in Bionic although 3.6.15 still supports the Bionic erlang. It seems this was done under a newer policy to only support and test the last 2 years of erlang releases to work, but I cannot see any indication that they actively believe an Erlang incompatibility exists.

Additionally 3.6.16 notes 2 backwards incompatible changes that seem minor/uncommon in practice but is a regression risk. Unfortunately 3.6.16 also seems to contain a number of the related fixes.

Action Required: Test 3.6.16 to see if it works better in these scenarios, consider any relevant bugs for backport or get a micro release exception for RabbitMQ. If not possible, we may need to consider shipping a newer Erlang+RabbitMQ in the cloud-archive.

(4) Cannot delete queues without promotable master

Known issue fixed in RabbitMQ 3.6.16

https://github.com/rabbitmq/rabbitmq-server/issues/1501
https://github.com/rabbitmq/rabbitmq-server/commit/6dbfcf5069d78591215d3e20883e4397e8a299e0

Action Required: Open a bug to backport this change if possible, unless a 3.6.16 micro-release exception is granted.

(5) Upstream recommends to avoid rapid queue and mirror policy changes at the same time

From the comments in https://github.com/rabbitmq/rabbitmq-server/issues/889

Currently the charm makes policy changes each time config-changed runs which happens at the same time all the nodes are restarted and thus slaves are added and removed. In particular it's often still making these changes after itself restart but 30 seconds had passed and the 2nd and 3rd nodes are doing their restarts.

This code should be improved to check and compare if the policy actual changes and only apply it if that is true. (set_ha_mode, set_policy)

We could also consider having the restarts on secondary nodes take note of if the cluster was stable when the hook started, and then if it is not longer stable once the delay has passed to wait a bit longer (maybe 2-3 minutes) for the cluster to stabilise before doing it's own local restart as the 30 seconds seem often not to be enough. Or otherwise implement some other kind of restart synchronisation mechanism or increase the default 30 second known-wait.

Action Required: Existing bug https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1909031 can track this work

(6) Queue crashes on startup

Another user documented a queue crash on startup and proposed a fix. This fix was not accepted mostly due to them not proposing a revised fix. However this issue seems like it is solved possibly in Bionic’s 3.8.2 (and broken in 3.6.10) but I could not locate the bug or commit that seemed to affect the same code area. More research is required.

https://github.com/rabbitmq/rabbitmq-server/issues/2009

Action Required: Determine which commit fixes this and consider backport

(7) There are some possible erlang related bugs

https://github.com/rabbitmq/rabbitmq-server/discussions/2950 suggests that OTP 23 may prevent some of the old incarnation related failures, as OTP 22 and earlier would only use 2-bit creation values but OTP 23 now uses 32-bit creation values.

16.04 has Erlang OTP 18 (Xenial)
18.04 has Erlang OTP 20 (Bionic)
20.04 has Erlang OTP 22 (Focal)
21.04 has Erlang OTP 23 (Hirsute)

There is also 2 other known bugs in Erlang 21 that we may be able to fix:
<A TCP related bug, link to be added later>
<Another bug, will find link later>
<Consider reviewing all bugs in the OTP stable point releases for relevant bugs>

Action Required: Research related Erlang bugs further and consider backporting the fixes.

(8) There is no nagios check for this failure

Because the problems here are due to some of the queues being "stuck", the existing checks for cluster partitions do not detect the failure. Additionally the checks that create and send messages on a queue are not sufficient as a random subset of the queues get affected by this issue in my testing and so the nagios test queue may or may not work.

It is possible to detect this situation reliably using the Management API, unfortunately that is not enabled by default. The existing RabbitMQ Partitions nagios check also depends on management_plugin=true and is not enabled by default so we also have no nagios reporting of partitions by default.

I have drafted a new nagios check that uses the management API to check all of the queues are (a) Alive and not stuck, (b) Actually synchronised to all 3 nodes, (c) Having an active master. This reliably detects the issue in all cases I was able to reproduce. So at worst manual intervention can rescue the cluster.

Action Required: Enable management_plugin by default, implement new nagios check.

Tracked in the following 2 bugs
“Nagios does not detect queues which are not running or have no master”
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1943936

"check_rabbitmq_cluster partition check is not enabled by default (due to management_plugin=false)”
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1930547

Note: The management plugin appears to cause RabbitMQ to hang on Xenial-Queens specifically. It may not be possible to implement this fix for Xenial-Queens unless that is fixed. More information in the above bugs.

(9) queue_master_locator=min-masters not creating queues evenly when bindings are involved #1519

The charm recently added support for queue_master_locator=min-masters but it was found to cause some problems in practice. This bug possibly explains why, in that in some cases it would locate all the masters on the same node which is what the change was trying to avoid.

Fixed in 3.7.5 upstream. Not backported to 3.6.x upstream.

https://github.com/rabbitmq/rabbitmq-server/issues/1519
https://github.com/rabbitmq/rabbitmq-server/pull/1541

Action Required: Backport the fix

(10) Revise the cluster-partition-handling default

There had been debate and multiple changes to the default cluster-partition-handling strategy seemingly with the understanding that the setting of that was leading to some of these bugs during deployment and runtime.

For example the switch to autoheal was done in this bug, describing the exact same symptoms I describe here, but now have shown are not related to the cluster-partition-handling at all:
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1802315

This has led to a silly situation where there has been a refusal to again change the default, but the primary user of the charm (new openstack deployments) is overriding the default anyway.

Now that we have a thorough understanding of the related issues we may have enough data and justification to actually revise the default again and justify it with some further data and testing. That default should most likely be pause_minority.

Action Required: Raise a bug to reconsider the cluster-partition-handling default once the other fixes have been released and a more concrete testing of new and existing deployments including load testing has been devised.

Tags: sts
Trent Lloyd (lathiat)
summary: RabbitMQ queues often hang after charm config changes or rabbitmq
- restarts due to multiple causes
+ restarts due to multiple causes (overview/co-ordination bug)
Trent Lloyd (lathiat)
Changed in charm-rabbitmq-server:
assignee: nobody → Trent Lloyd (lathiat)
Revision history for this message
Liam Young (gnuoy) wrote :

Thank you for taking the time to do such a thorough analysis of these RabbitMQ issues.

Changed in charm-rabbitmq-server:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)
Changed in charm-rabbitmq-server:
status: Confirmed → In Progress
Trent Lloyd (lathiat)
Changed in charm-rabbitmq-server:
assignee: Trent Lloyd (lathiat) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-rabbitmq-server (master)

Change abandoned by "Billy Olsen <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/813146
Reason: This patch set has been pending an update for a period of time without a response from the author.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.