RabbitMQ redeploy fails with emulator Discarding message
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla-ansible |
In Progress
|
High
|
Unassigned |
Bug Description
# Steps to reproduce
Not reliably reproducible, but I have seen it multiple times.
* Build a new rabbitmq image
* update rabbitmq_tag or run kolla-ansible pull -t rabbitmq
* kolla-ansible deploy -t rabbitmq
# Expected results
* Rabbitmq cluster restarts using the new image.
* Rabbitmq cluster works correctly
# Actual results
* Rabbitmq cluster restarts using the new image.
* Rabbitmq cluster broken, affecting most of openstack
We see the following log messages in /var/log/
emulator Discarding message {'$gen_
<0.3624.13> Channel error on connection <0.3456.13> (1.2.3.4:36454 -> 1.2.3.4:5671, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-reports-plugin' in vhost '/' due to timeout
# Environment
Most recently seen on CentOS 8 using Kolla Ansible Train.
Before:
erlang-
rabbitmq-
After:
erlang-
rabbitmq-
Also seen on CentOS stream 8 using Kolla Ansible Victoria.
Before and after (same image, but different tag, causing a restart):
erlang-
rabbitmq-
# Links
https:/
https:/
https:/
Changed in kolla-ansible: | |
importance: | Undecided → High |
Mark Goddard (mgoddard) wrote : | #1 |
Mark Goddard (mgoddard) wrote : | #2 |
Workaround:
Stop all nodes in the cluster:
kolla-ansible stop -t rabbitmq
or:
docker stop rabbitmq
Start all nodes in the cluster one by one:
kolla-ansible deploy -t rabbitmq
or:
docker start rabbitmq
Mark Goddard (mgoddard) wrote : | #3 |
jovial linked me to this bug report, which could be relevant: https:/
John Garbutt (johngarbutt) wrote : | #4 |
I also think we are using a bad HA setting, we should think about:
{"ha-mode"
The reference for that is this:
https:/
My theory being, that means the transient queues we create for the rpc call response queues are less likely to be an issue, as we will have less rabbitmq load.
Interesting, openstack-ansible does this:
https:/
rabbitmq_
- name: "HA"
pattern: '^(?!(amq\
tags: "ha-mode=all"
And tripple-o does:
https:/
ha-all ^(?!amq\.).* queues {"ha-mode"
I think following openstack-ansible is a good idea here, more on what they are doing here:
https:/
Not just because the commit came from one of the creators of rabbitmq :)
The fix was raised in an oslo meeting it turns out.
John Garbutt (johngarbutt) wrote : | #5 |
I also wonder if enable_
https:/
I have seen issues with neutron agents, when rabbit fails as described above, eventually they go inactive. It feels a bit related.
Changed in kolla-ansible: | |
status: | New → Triaged |
Thierry (golvanig) wrote : | #6 |
Thanks for sharing all this.
You saved my life :)
I am running Kolla-Ansible (3 controllers, 6 Computes) with the latest Victoria, and got this annoying issue.
I just tried the new pattern in "definitions.
Did also changed the 'rabbitmq_
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master) | #7 |
Fix proposed to branch: master
Review: https:/
Changed in kolla-ansible: | |
status: | Triaged → In Progress |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master) | #8 |
Related fix proposed to branch: master
Review: https:/
John Garbutt (johngarbutt) wrote : | #9 |
Just to check @golvanig, which specific changes helped you the most? If you could comment on my related patches, that would be briliant.
Thierry (golvanig) wrote : | #10 |
I add less crash by increasing the number of cores in 'rabbitmq_
But the ends of crashes was the template.
OpenStack Infra (hudson-openstack) wrote : | #11 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master) | #12 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 6bfe1927f0e10eb
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000
Remove classic queue mirroring for internal RabbitMQ
When OpenStack is deployed with Kolla-Ansible, by default there
are no durable queues or exchanges created by the OpenStack
services in RabbitMQ. In Rabbit terminology, not being durable
is referred to as `transient`, and this means that the queue
is generally held in memory.
Whether OpenStack services create durable or transient queues is
traditionally controlled by the Oslo Notification config option:
`amqp_
the default of `False` in all services. The only `durable`
objects are the `amq*` exchanges which are internal to RabbitMQ.
More recently, Oslo Notification has introduced support for
Quorum queues [7]. These are a successor to durable classic
queues, however it isn't yet clear if they are a good fit for
OpenStack in general [8].
For clustered RabbitMQ deployments, Kolla-Ansible configures all
queues as `replicated` [1]. Replication occurs over all nodes
in the cluster. RabbitMQ refers to this as 'mirroring of classic
queues'.
In summary, this means that a multi-node Kolla-Ansible deployment
will end up with a large number of transient, mirrored queues
and exchanges. However, the RabbitMQ documentation warns against
this, stating that 'For replicated queues, the only reasonable
option is to use durable queues: [2]`. This is discussed
further in the following bug report: [3].
Whilst we could try enabling the `amqp_durable_
for each service (this is suggested in [4]), there are
a number of complexities with this approach, not limited to:
1) RabbitMQ is planning to remove classic queue mirroring in
favor of 'Quorum queues' in a forthcoming release [5].
2) Durable queues will be written to disk, which may cause
performance problems at scale. Note that this includes
Quorum queues which are always durable.
3) Potential for race conditions and other complexity
discussed recently on the mailing list under:
`[ops] [kolla] RabbitMQ High Availability`
The remaining option, proposed here, is to use classic
non-mirrored queues everywhere, and rely on services to recover
if the node hosting a queue or exchange they are using fails.
There is some discussion of this approach in [6]. The downside
of potential message loss needs to be weighed against the real
upsides of increasing the performance of RabbitMQ, and moving
to a configuration which is officially supported and hopefully
more stable. In the future, we can then consider promoting
specific queues to quorum queues, in cases where message loss
can result in failure states which are hard to recover fro...
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/xena) | #13 |
Fix proposed to branch: stable/xena
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/xena) | #14 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/xena
commit 425ead5792661dc
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000
Allow removal of classic queue mirroring for internal RabbitMQ
Backport note: This patch has been updated to retain the existing
behaviour by default. A temporary variable,
rabbitmq_
in order to remove the ha-all policy. In order to support changing the
policy without upgrading, the the ha-all policy is removed on deploys,
in addition to upgrades.
When OpenStack is deployed with Kolla-Ansible, by default there
are no durable queues or exchanges created by the OpenStack
services in RabbitMQ. In Rabbit terminology, not being durable
is referred to as `transient`, and this means that the queue
is generally held in memory.
Whether OpenStack services create durable or transient queues is
traditionally controlled by the Oslo Notification config option:
`amqp_
the default of `False` in all services. The only `durable`
objects are the `amq*` exchanges which are internal to RabbitMQ.
More recently, Oslo Notification has introduced support for
Quorum queues [7]. These are a successor to durable classic
queues, however it isn't yet clear if they are a good fit for
OpenStack in general [8].
For clustered RabbitMQ deployments, Kolla-Ansible configures all
queues as `replicated` [1]. Replication occurs over all nodes
in the cluster. RabbitMQ refers to this as 'mirroring of classic
queues'.
In summary, this means that a multi-node Kolla-Ansible deployment
will end up with a large number of transient, mirrored queues
and exchanges. However, the RabbitMQ documentation warns against
this, stating that 'For replicated queues, the only reasonable
option is to use durable queues: [2]`. This is discussed
further in the following bug report: [3].
Whilst we could try enabling the `amqp_durable_
for each service (this is suggested in [4]), there are
a number of complexities with this approach, not limited to:
1) RabbitMQ is planning to remove classic queue mirroring in
favor of 'Quorum queues' in a forthcoming release [5].
2) Durable queues will be written to disk, which may cause
performance problems at scale. Note that this includes
Quorum queues which are always durable.
3) Potential for race conditions and other complexity
discussed recently on the mailing list under:
`[ops] [kolla] RabbitMQ High Availability`
The remaining option, proposed here, is to use classic
non-mirrored queues everywhere, and rely on services to recover
if the node hosting a queue or exchange they are using fails.
There is some discussion of this approach in [6]. The downside...
tags: | added: in-stable-xena |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/wallaby) | #15 |
Fix proposed to branch: stable/wallaby
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/victoria) | #16 |
Fix proposed to branch: stable/victoria
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/wallaby) | #17 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/wallaby
commit 8e1c98d987e73f7
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000
Allow removal of classic queue mirroring for internal RabbitMQ
Backport note: This patch has been updated to retain the existing
behaviour by default. A temporary variable,
rabbitmq_
in order to remove the ha-all policy. In order to support changing the
policy without upgrading, the the ha-all policy is removed on deploys,
in addition to upgrades.
When OpenStack is deployed with Kolla-Ansible, by default there
are no durable queues or exchanges created by the OpenStack
services in RabbitMQ. In Rabbit terminology, not being durable
is referred to as `transient`, and this means that the queue
is generally held in memory.
Whether OpenStack services create durable or transient queues is
traditionally controlled by the Oslo Notification config option:
`amqp_
the default of `False` in all services. The only `durable`
objects are the `amq*` exchanges which are internal to RabbitMQ.
More recently, Oslo Notification has introduced support for
Quorum queues [7]. These are a successor to durable classic
queues, however it isn't yet clear if they are a good fit for
OpenStack in general [8].
For clustered RabbitMQ deployments, Kolla-Ansible configures all
queues as `replicated` [1]. Replication occurs over all nodes
in the cluster. RabbitMQ refers to this as 'mirroring of classic
queues'.
In summary, this means that a multi-node Kolla-Ansible deployment
will end up with a large number of transient, mirrored queues
and exchanges. However, the RabbitMQ documentation warns against
this, stating that 'For replicated queues, the only reasonable
option is to use durable queues: [2]`. This is discussed
further in the following bug report: [3].
Whilst we could try enabling the `amqp_durable_
for each service (this is suggested in [4]), there are
a number of complexities with this approach, not limited to:
1) RabbitMQ is planning to remove classic queue mirroring in
favor of 'Quorum queues' in a forthcoming release [5].
2) Durable queues will be written to disk, which may cause
performance problems at scale. Note that this includes
Quorum queues which are always durable.
3) Potential for race conditions and other complexity
discussed recently on the mailing list under:
`[ops] [kolla] RabbitMQ High Availability`
The remaining option, proposed here, is to use classic
non-mirrored queues everywhere, and rely on services to recover
if the node hosting a queue or exchange they are using fails.
There is some discussion of this approach in [6]. The downs...
tags: | added: in-stable-wallaby |
tags: | added: in-stable-victoria |
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/victoria) | #18 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/victoria
commit 2764844ee2ff939
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000
Allow removal of classic queue mirroring for internal RabbitMQ
Backport note: This patch has been updated to retain the existing
behaviour by default. A temporary variable,
rabbitmq_
in order to remove the ha-all policy. In order to support changing the
policy without upgrading, the the ha-all policy is removed on deploys,
in addition to upgrades.
When OpenStack is deployed with Kolla-Ansible, by default there
are no durable queues or exchanges created by the OpenStack
services in RabbitMQ. In Rabbit terminology, not being durable
is referred to as `transient`, and this means that the queue
is generally held in memory.
Whether OpenStack services create durable or transient queues is
traditionally controlled by the Oslo Notification config option:
`amqp_
the default of `False` in all services. The only `durable`
objects are the `amq*` exchanges which are internal to RabbitMQ.
More recently, Oslo Notification has introduced support for
Quorum queues [7]. These are a successor to durable classic
queues, however it isn't yet clear if they are a good fit for
OpenStack in general [8].
For clustered RabbitMQ deployments, Kolla-Ansible configures all
queues as `replicated` [1]. Replication occurs over all nodes
in the cluster. RabbitMQ refers to this as 'mirroring of classic
queues'.
In summary, this means that a multi-node Kolla-Ansible deployment
will end up with a large number of transient, mirrored queues
and exchanges. However, the RabbitMQ documentation warns against
this, stating that 'For replicated queues, the only reasonable
option is to use durable queues: [2]`. This is discussed
further in the following bug report: [3].
Whilst we could try enabling the `amqp_durable_
for each service (this is suggested in [4]), there are
a number of complexities with this approach, not limited to:
1) RabbitMQ is planning to remove classic queue mirroring in
favor of 'Quorum queues' in a forthcoming release [5].
2) Durable queues will be written to disk, which may cause
performance problems at scale. Note that this includes
Quorum queues which are always durable.
3) Potential for race conditions and other complexity
discussed recently on the mailing list under:
`[ops] [kolla] RabbitMQ High Availability`
The remaining option, proposed here, is to use classic
non-mirrored queues everywhere, and rely on services to recover
if the node hosting a queue or exchange they are using fails.
There is some discussion of this approach in [6]. The down...
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master) | #19 |
Change abandoned by "Mark Goddard <email address hidden>" on branch: master
Review: https:/
Reason: Superseded by https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master) | #20 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master) | #21 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 94f3ce0c78998e2
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000
RabbitMQ: Support setting ha-promote-
By default ha-promote-
issues with RabbitMQ automatically recovering when nodes are restarted.
https:/
Rather than waiting for operator interventions, it is better we allow
recovery to happen, even if that means we may loose some messages.
A few failed and timed out operations is better than a totaly broken
cloud. This is achieved using ha-promote-
Note, when a node failure is detected, this is already the default
behaviour from 3.7.5 onwards:
https:/
This patch adds the option to change the ha-promote-
definition, using the flag `rabbitmq_
value is unset by default to avoid any unexpected changes to the
RabbitMQ definitions.json file, as that would trigger an unexpected
restart of RabbitMQ during the next deploy.
Related-Bug: #1954925
Change-Id: I2146bda2c72dda
OpenStack Infra (hudson-openstack) wrote : | #22 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 6cf22b0cb1f2dc4
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 17:34:44 2021 +0000
Improve RabbitMQ performance by reducing ha replicas
Currently we do not follow the RabbitMQ advice on replicas here:
https:/
Here we reduce the number of replicas to n // 2 + 1 as advised
above. The hope it this helps speed up recovery from rabbit
issues.
Related-Bug: #1954925
Change-Id: Ib6bcb26c499c98
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/zed) | #23 |
Related fix proposed to branch: stable/zed
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/yoga) | #24 |
Related fix proposed to branch: stable/yoga
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/xena) | #25 |
Related fix proposed to branch: stable/xena
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/xena) | #26 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/xena
commit a060f45bab2b88f
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000
RabbitMQ: Support setting ha-promote-
By default ha-promote-
issues with RabbitMQ automatically recovering when nodes are restarted.
https:/
Rather than waiting for operator interventions, it is better we allow
recovery to happen, even if that means we may loose some messages.
A few failed and timed out operations is better than a totaly broken
cloud. This is achieved using ha-promote-
Note, when a node failure is detected, this is already the default
behaviour from 3.7.5 onwards:
https:/
This patch adds the option to change the ha-promote-
definition, using the flag `rabbitmq_
value is unset by default to avoid any unexpected changes to the
RabbitMQ definitions.json file, as that would trigger an unexpected
restart of RabbitMQ during the next deploy.
Related-Bug: #1954925
Change-Id: I2146bda2c72dda
(cherry picked from commit 94f3ce0c78998e2
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master) | #27 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit a87810db7e5bccd
Author: Matt Crees <email address hidden>
Date: Tue Feb 7 09:56:43 2023 +0000
Set RabbitMQ ha-promote-
Changes the default value of `rabbitmq-
`"always"`.
We are seeing issues with RabbitMQ automatically recovering when nodes
are restarted. https:/
Rather than waiting for operator interventions, it is better we allow
recovery to happen, even if that means we may loose some messages.
A few failed and timed out operations is better than a totaly broken
cloud. This is achieved using ha-promote-
Note, when a node failure is detected, this is already the default
behaviour from 3.7.5 onwards:
https:/
Related-Bug: #1954925
Change-Id: I484a81163f703f
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/zed) | #28 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/zed
commit 300f584710c840e
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000
RabbitMQ: Support setting ha-promote-
By default ha-promote-
issues with RabbitMQ automatically recovering when nodes are restarted.
https:/
Rather than waiting for operator interventions, it is better we allow
recovery to happen, even if that means we may loose some messages.
A few failed and timed out operations is better than a totaly broken
cloud. This is achieved using ha-promote-
Note, when a node failure is detected, this is already the default
behaviour from 3.7.5 onwards:
https:/
This patch adds the option to change the ha-promote-
definition, using the flag `rabbitmq_
value is unset by default to avoid any unexpected changes to the
RabbitMQ definitions.json file, as that would trigger an unexpected
restart of RabbitMQ during the next deploy.
Related-Bug: #1954925
Change-Id: I2146bda2c72dda
(cherry picked from commit 94f3ce0c78998e2
tags: | added: in-stable-zed |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/yoga) | #29 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/yoga
commit f01896ffdee1c95
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000
RabbitMQ: Support setting ha-promote-
By default ha-promote-
issues with RabbitMQ automatically recovering when nodes are restarted.
https:/
Rather than waiting for operator interventions, it is better we allow
recovery to happen, even if that means we may loose some messages.
A few failed and timed out operations is better than a totaly broken
cloud. This is achieved using ha-promote-
Note, when a node failure is detected, this is already the default
behaviour from 3.7.5 onwards:
https:/
This patch adds the option to change the ha-promote-
definition, using the flag `rabbitmq_
value is unset by default to avoid any unexpected changes to the
RabbitMQ definitions.json file, as that would trigger an unexpected
restart of RabbitMQ during the next deploy.
Related-Bug: #1954925
Change-Id: I2146bda2c72dda
(cherry picked from commit 94f3ce0c78998e2
tags: | added: in-stable-yoga |
The broken node crashed, lots of these:
2021-12-14 17:27:13.394 [error] <0.15156.2> CRASH REPORT Process <0.15156.2> with 0 neighbours exited with reason: channel_ termination_ timeout in rabbit_ reader: wait_for_ channel_ termination/ 3 line 769
Then when it starts up we see many of these per second, continuing indefinitely:
2021-12-14 17:27:28.205 [error] <0.10727.0> Discarding message {'$gen_ call',{ <0.10727. 0>,#Ref< 0.3559579180. 2146172930. 215375> },stat} from <0.10727.0> to <0.6502.1> in an old incarnation (1639495131) of this node (1639502843)