Allow removal of classic queue mirroring for internal RabbitMQ
Backport note: This patch has been updated to retain the existing
behaviour by default. A temporary variable,
rabbitmq_remove_ha_all_policy, has been added which may be set to true
in order to remove the ha-all policy. In order to support changing the
policy without upgrading, the the ha-all policy is removed on deploys,
in addition to upgrades.
When OpenStack is deployed with Kolla-Ansible, by default there
are no durable queues or exchanges created by the OpenStack
services in RabbitMQ. In Rabbit terminology, not being durable
is referred to as `transient`, and this means that the queue
is generally held in memory.
Whether OpenStack services create durable or transient queues is
traditionally controlled by the Oslo Notification config option:
`amqp_durable_queues`. In Kolla-Ansible, this remains set to
the default of `False` in all services. The only `durable`
objects are the `amq*` exchanges which are internal to RabbitMQ.
More recently, Oslo Notification has introduced support for
Quorum queues [7]. These are a successor to durable classic
queues, however it isn't yet clear if they are a good fit for
OpenStack in general [8].
For clustered RabbitMQ deployments, Kolla-Ansible configures all
queues as `replicated` [1]. Replication occurs over all nodes
in the cluster. RabbitMQ refers to this as 'mirroring of classic
queues'.
In summary, this means that a multi-node Kolla-Ansible deployment
will end up with a large number of transient, mirrored queues
and exchanges. However, the RabbitMQ documentation warns against
this, stating that 'For replicated queues, the only reasonable
option is to use durable queues: [2]`. This is discussed
further in the following bug report: [3].
Whilst we could try enabling the `amqp_durable_queues` option
for each service (this is suggested in [4]), there are
a number of complexities with this approach, not limited to:
1) RabbitMQ is planning to remove classic queue mirroring in
favor of 'Quorum queues' in a forthcoming release [5].
2) Durable queues will be written to disk, which may cause
performance problems at scale. Note that this includes
Quorum queues which are always durable.
3) Potential for race conditions and other complexity
discussed recently on the mailing list under:
`[ops] [kolla] RabbitMQ High Availability`
The remaining option, proposed here, is to use classic
non-mirrored queues everywhere, and rely on services to recover
if the node hosting a queue or exchange they are using fails.
There is some discussion of this approach in [6]. The downside
of potential message loss needs to be weighed against the real
upsides of increasing the performance of RabbitMQ, and moving
to a configuration which is officially supported and hopefully
more stable. In the future, we can then consider promoting
specific queues to quorum queues, in cases where message loss
can result in failure states which are hard to recover from.
Reviewed: https:/ /review. opendev. org/c/openstack /kolla- ansible/ +/833043 /opendev. org/openstack/ kolla-ansible/ commit/ 425ead5792661dc 4616e30a9a0af5e c506f9bdfb
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/xena
commit 425ead5792661dc 4616e30a9a0af5e c506f9bdfb
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000
Allow removal of classic queue mirroring for internal RabbitMQ
Backport note: This patch has been updated to retain the existing remove_ ha_all_ policy, has been added which may be set to true
behaviour by default. A temporary variable,
rabbitmq_
in order to remove the ha-all policy. In order to support changing the
policy without upgrading, the the ha-all policy is removed on deploys,
in addition to upgrades.
When OpenStack is deployed with Kolla-Ansible, by default there
are no durable queues or exchanges created by the OpenStack
services in RabbitMQ. In Rabbit terminology, not being durable
is referred to as `transient`, and this means that the queue
is generally held in memory.
Whether OpenStack services create durable or transient queues is durable_ queues` . In Kolla-Ansible, this remains set to
traditionally controlled by the Oslo Notification config option:
`amqp_
the default of `False` in all services. The only `durable`
objects are the `amq*` exchanges which are internal to RabbitMQ.
More recently, Oslo Notification has introduced support for
Quorum queues [7]. These are a successor to durable classic
queues, however it isn't yet clear if they are a good fit for
OpenStack in general [8].
For clustered RabbitMQ deployments, Kolla-Ansible configures all
queues as `replicated` [1]. Replication occurs over all nodes
in the cluster. RabbitMQ refers to this as 'mirroring of classic
queues'.
In summary, this means that a multi-node Kolla-Ansible deployment
will end up with a large number of transient, mirrored queues
and exchanges. However, the RabbitMQ documentation warns against
this, stating that 'For replicated queues, the only reasonable
option is to use durable queues: [2]`. This is discussed
further in the following bug report: [3].
Whilst we could try enabling the `amqp_durable_ queues` option
for each service (this is suggested in [4]), there are
a number of complexities with this approach, not limited to:
1) RabbitMQ is planning to remove classic queue mirroring in
favor of 'Quorum queues' in a forthcoming release [5].
2) Durable queues will be written to disk, which may cause
performance problems at scale. Note that this includes
Quorum queues which are always durable.
3) Potential for race conditions and other complexity
discussed recently on the mailing list under:
`[ops] [kolla] RabbitMQ High Availability`
The remaining option, proposed here, is to use classic
non-mirrored queues everywhere, and rely on services to recover
if the node hosting a queue or exchange they are using fails.
There is some discussion of this approach in [6]. The downside
of potential message loss needs to be weighed against the real
upsides of increasing the performance of RabbitMQ, and moving
to a configuration which is officially supported and hopefully
more stable. In the future, we can then consider promoting
specific queues to quorum queues, in cases where message loss
can result in failure states which are hard to recover from.
[1] https:/ /www.rabbitmq. com/ha. html /www.rabbitmq. com/queues. html /github. com/rabbitmq/ rabbitmq- server/ issues/ 2045 /wiki.openstack .org/wiki/ Large_Scale_ Configuration_ Rabbit /blog.rabbitmq. com/posts/ 2021/08/ 4.0-deprecation -announcements/ /fuel-ccp. readthedocs. io/en/latest/ design/ ref_arch_ 1000_nodes. html#replicatio n /bugs.launchpad .net/oslo. messaging/ +bug/1942933 /www.rabbitmq. com/quorum- queues. html#use- cases
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
[8] https:/
Partial-Bug: #1954925 f3fdb7603f5401d 24e3b76a56e 0f0a92a8d345175 7a46ccdd33)
Change-Id: I91d0e23b22319c
(cherry picked from commit 6bfe1927f0e10eb