RabbitMQ redeploy fails with emulator Discarding message

Bug #1954925 reported by Mark Goddard
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
kolla-ansible
In Progress
High
Unassigned

Bug Description

# Steps to reproduce

Not reliably reproducible, but I have seen it multiple times.

* Build a new rabbitmq image
* update rabbitmq_tag or run kolla-ansible pull -t rabbitmq
* kolla-ansible deploy -t rabbitmq

# Expected results

* Rabbitmq cluster restarts using the new image.
* Rabbitmq cluster works correctly

# Actual results

* Rabbitmq cluster restarts using the new image.
* Rabbitmq cluster broken, affecting most of openstack

We see the following log messages in /var/log/kolla/rabbitmq/rabbitmq-*.log, typically for one node in the rabbitmq cluster.

emulator Discarding message {'$gen_call',{<0.11156.0>,#Ref<0.1345069768.2975596546.131037>},stat} from <0.11156.0> to <0.7260.1> in an old incarnation (1639495131) of this node (1639507118)

<0.3624.13> Channel error on connection <0.3456.13> (1.2.3.4:36454 -> 1.2.3.4:5671, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-reports-plugin' in vhost '/' due to timeout

# Environment

Most recently seen on CentOS 8 using Kolla Ansible Train.

Before:
erlang-22.3.4.1-1.el8.x86_64
rabbitmq-server-3.7.26-1.el8.noarch

After:
erlang-22.3.4.21-1.el8.x86_64
rabbitmq-server-3.7.28-1.el8.noarch

Also seen on CentOS stream 8 using Kolla Ansible Victoria.

Before and after (same image, but different tag, causing a restart):

erlang-23.3.4.7-1.el8.x86_64
rabbitmq-server-3.8.26-1.el8.noarch

# Links

https://groups.google.com/g/rabbitmq-users/c/Uyo-YjfEuB4
https://groups.google.com/g/rabbitmq-users/c/QNTkoKrg2H0
https://groups.google.com/g/rabbitmq-users/c/OQr3IB_ddEM

Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: Undecided → High
Revision history for this message
Mark Goddard (mgoddard) wrote :

The broken node crashed, lots of these:

2021-12-14 17:27:13.394 [error] <0.15156.2> CRASH REPORT Process <0.15156.2> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 769

Then when it starts up we see many of these per second, continuing indefinitely:

2021-12-14 17:27:28.205 [error] <0.10727.0> Discarding message {'$gen_call',{<0.10727.0>,#Ref<0.3559579180.2146172930.215375>},stat} from <0.10727.0> to <0.6502.1> in an old incarnation (1639495131) of this node (1639502843)

Revision history for this message
Mark Goddard (mgoddard) wrote :

Workaround:

Stop all nodes in the cluster:

kolla-ansible stop -t rabbitmq

or:

docker stop rabbitmq

Start all nodes in the cluster one by one:

kolla-ansible deploy -t rabbitmq

or:

docker start rabbitmq

Revision history for this message
Mark Goddard (mgoddard) wrote :

jovial linked me to this bug report, which could be relevant: https://github.com/rabbitmq/rabbitmq-server/issues/2045

Revision history for this message
John Garbutt (johngarbutt) wrote :

I also think we are using a bad HA setting, we should think about:
{"ha-mode":"exactly","ha-params":2}

The reference for that is this:
https://www.rabbitmq.com/ha.html#replication-factor

My theory being, that means the transient queues we create for the rpc call response queues are less likely to be an issue, as we will have less rabbitmq load.

Interesting, openstack-ansible does this:
https://github.com/openstack/openstack-ansible-rabbitmq_server/blob/34819a10ace4f800d20c2d36035bbfca3ab9671e/defaults/main.yml#L275
rabbitmq_openstack_policies:
  - name: "HA"
    pattern: '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'
    tags: "ha-mode=all"

And tripple-o does:
https://github.com/openstack/puppet-tripleo/blob/fdca31a2009a0aaf3f3ee9c5e30083ac59bf067f/manifests/profile/pacemaker/rabbitmq_bundle.pp#L344

ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0

I think following openstack-ansible is a good idea here, more on what they are doing here:
https://github.com/openstack/openstack-ansible-rabbitmq_server/commit/52ad552129afc715dc978c61edf881090fcf48c0
Not just because the commit came from one of the creators of rabbitmq :)

The fix was raised in an oslo meeting it turns out.

Revision history for this message
John Garbutt (johngarbutt) wrote :

I also wonder if enable_cancel_on_failover oslo_messaging setting is related, to when ha failover happens, stop clients retrying all the time, which might be related to why rabbit is so busy logging these errors, but its a total guess:
https://github.com/openstack/oslo.messaging/commit/196fa877a90d7eb0f82ec9e1c194eef3f98fc0b1

I have seen issues with neutron agents, when rabbit fails as described above, eventually they go inactive. It feels a bit related.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: New → Triaged
Revision history for this message
Thierry (golvanig) wrote :

Thanks for sharing all this.
You saved my life :)
I am running Kolla-Ansible (3 controllers, 6 Computes) with the latest Victoria, and got this annoying issue.
I just tried the new pattern in "definitions.json.j2" and it works!!
Did also changed the 'rabbitmq_server_additional_erl_args', but the 'pattern' was the final fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Changed in kolla-ansible:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/822135

Revision history for this message
John Garbutt (johngarbutt) wrote :

Just to check @golvanig, which specific changes helped you the most? If you could comment on my related patches, that would be briliant.

Revision history for this message
Thierry (golvanig) wrote :

I add less crash by increasing the number of cores in 'rabbitmq_server_additional_erl_args'
But the ends of crashes was the template.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/822187

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)
Download full text (3.7 KiB)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/824994
Committed: https://opendev.org/openstack/kolla-ansible/commit/6bfe1927f0e10eb0f0a92a8d3451757a46ccdd33
Submitter: "Zuul (22348)"
Branch: master

commit 6bfe1927f0e10eb0f0a92a8d3451757a46ccdd33
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000

    Remove classic queue mirroring for internal RabbitMQ

    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.

    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.

    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].

    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.

    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].

    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:

    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`

    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The downside
    of potential message loss needs to be weighed against the real
    upsides of increasing the performance of RabbitMQ, and moving
    to a configuration which is officially supported and hopefully
    more stable. In the future, we can then consider promoting
    specific queues to quorum queues, in cases where message loss
    can result in failure states which are hard to recover fro...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/833043

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/xena)
Download full text (4.2 KiB)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/833043
Committed: https://opendev.org/openstack/kolla-ansible/commit/425ead5792661dc4616e30a9a0af5ec506f9bdfb
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 425ead5792661dc4616e30a9a0af5ec506f9bdfb
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000

    Allow removal of classic queue mirroring for internal RabbitMQ

    Backport note: This patch has been updated to retain the existing
    behaviour by default. A temporary variable,
    rabbitmq_remove_ha_all_policy, has been added which may be set to true
    in order to remove the ha-all policy. In order to support changing the
    policy without upgrading, the the ha-all policy is removed on deploys,
    in addition to upgrades.

    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.

    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.

    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].

    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.

    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].

    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:

    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`

    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The downside...

Read more...

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/835501

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/835502

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/wallaby)
Download full text (4.3 KiB)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/835501
Committed: https://opendev.org/openstack/kolla-ansible/commit/8e1c98d987e73f7ff12f814c8b8f6215f2e55bbc
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 8e1c98d987e73f7ff12f814c8b8f6215f2e55bbc
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000

    Allow removal of classic queue mirroring for internal RabbitMQ

    Backport note: This patch has been updated to retain the existing
    behaviour by default. A temporary variable,
    rabbitmq_remove_ha_all_policy, has been added which may be set to true
    in order to remove the ha-all policy. In order to support changing the
    policy without upgrading, the the ha-all policy is removed on deploys,
    in addition to upgrades.

    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.

    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.

    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].

    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.

    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].

    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:

    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`

    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The downs...

Read more...

tags: added: in-stable-wallaby
tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/victoria)
Download full text (4.3 KiB)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/835502
Committed: https://opendev.org/openstack/kolla-ansible/commit/2764844ee2ff9393a4eebd90a9a912588af0a180
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 2764844ee2ff9393a4eebd90a9a912588af0a180
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000

    Allow removal of classic queue mirroring for internal RabbitMQ

    Backport note: This patch has been updated to retain the existing
    behaviour by default. A temporary variable,
    rabbitmq_remove_ha_all_policy, has been added which may be set to true
    in order to remove the ha-all policy. In order to support changing the
    policy without upgrading, the the ha-all policy is removed on deploys,
    in addition to upgrades.

    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.

    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.

    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].

    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.

    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].

    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:

    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`

    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The down...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by "Mark Goddard <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/822132
Reason: Superseded by https://review.opendev.org/c/openstack/kolla-ansible/+/867771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/872863

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/822135
Committed: https://opendev.org/openstack/kolla-ansible/commit/94f3ce0c78998e29fcc034a9b0844f9d6d602807
Submitter: "Zuul (22348)"
Branch: master

commit 94f3ce0c78998e29fcc034a9b0844f9d6d602807
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000

    RabbitMQ: Support setting ha-promote-on-shutdown

    By default ha-promote-on-shutdown=when-synced. However we are seeing
    issues with RabbitMQ automatically recovering when nodes are restarted.
    https://www.rabbitmq.com/ha.html#cluster-shutdown

    Rather than waiting for operator interventions, it is better we allow
    recovery to happen, even if that means we may loose some messages.
    A few failed and timed out operations is better than a totaly broken
    cloud. This is achieved using ha-promote-on-shutdown=always.

    Note, when a node failure is detected, this is already the default
    behaviour from 3.7.5 onwards:
    https://www.rabbitmq.com/ha.html#promoting-unsynchronised-mirrors

    This patch adds the option to change the ha-promote-on-shutdown
    definition, using the flag `rabbitmq_ha_promote_on_shutdown`. This
    value is unset by default to avoid any unexpected changes to the
    RabbitMQ definitions.json file, as that would trigger an unexpected
    restart of RabbitMQ during the next deploy.

    Related-Bug: #1954925

    Change-Id: I2146bda2c72ddac2c9923c6941b0596395fd9ab5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/822187
Committed: https://opendev.org/openstack/kolla-ansible/commit/6cf22b0cb1f2dc4d8910409284fa5757a7dd67a1
Submitter: "Zuul (22348)"
Branch: master

commit 6cf22b0cb1f2dc4d8910409284fa5757a7dd67a1
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 17:34:44 2021 +0000

    Improve RabbitMQ performance by reducing ha replicas

    Currently we do not follow the RabbitMQ advice on replicas here:
    https://www.rabbitmq.com/ha.html#replication-factor

    Here we reduce the number of replicas to n // 2 + 1 as advised
    above. The hope it this helps speed up recovery from rabbit
    issues.

    Related-Bug: #1954925
    Change-Id: Ib6bcb26c499c9884faa4a0cd51abaec00cacb096

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/878336

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/878338

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/878340

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/878340
Committed: https://opendev.org/openstack/kolla-ansible/commit/a060f45bab2b88fa1fe778ea21da36669e6f26db
Submitter: "Zuul (22348)"
Branch: stable/xena

commit a060f45bab2b88fa1fe778ea21da36669e6f26db
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000

    RabbitMQ: Support setting ha-promote-on-shutdown

    By default ha-promote-on-shutdown=when-synced. However we are seeing
    issues with RabbitMQ automatically recovering when nodes are restarted.
    https://www.rabbitmq.com/ha.html#cluster-shutdown

    Rather than waiting for operator interventions, it is better we allow
    recovery to happen, even if that means we may loose some messages.
    A few failed and timed out operations is better than a totaly broken
    cloud. This is achieved using ha-promote-on-shutdown=always.

    Note, when a node failure is detected, this is already the default
    behaviour from 3.7.5 onwards:
    https://www.rabbitmq.com/ha.html#promoting-unsynchronised-mirrors

    This patch adds the option to change the ha-promote-on-shutdown
    definition, using the flag `rabbitmq_ha_promote_on_shutdown`. This
    value is unset by default to avoid any unexpected changes to the
    RabbitMQ definitions.json file, as that would trigger an unexpected
    restart of RabbitMQ during the next deploy.

    Related-Bug: #1954925

    Change-Id: I2146bda2c72ddac2c9923c6941b0596395fd9ab5
    (cherry picked from commit 94f3ce0c78998e29fcc034a9b0844f9d6d602807)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/872863
Committed: https://opendev.org/openstack/kolla-ansible/commit/a87810db7e5bccdc863dd5cb5158ca5193eb5fd3
Submitter: "Zuul (22348)"
Branch: master

commit a87810db7e5bccdc863dd5cb5158ca5193eb5fd3
Author: Matt Crees <email address hidden>
Date: Tue Feb 7 09:56:43 2023 +0000

    Set RabbitMQ ha-promote-on-shutdown=always

    Changes the default value of `rabbitmq-ha-promote-on-shutdown` to
    `"always"`.

    We are seeing issues with RabbitMQ automatically recovering when nodes
    are restarted. https://www.rabbitmq.com/ha.html#cluster-shutdown

    Rather than waiting for operator interventions, it is better we allow
    recovery to happen, even if that means we may loose some messages.
    A few failed and timed out operations is better than a totaly broken
    cloud. This is achieved using ha-promote-on-shutdown=always.

    Note, when a node failure is detected, this is already the default
    behaviour from 3.7.5 onwards:
    https://www.rabbitmq.com/ha.html#promoting-unsynchronised-mirrors

    Related-Bug: #1954925
    Change-Id: I484a81163f703fa27112df22473d657e2a9ab964

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/878336
Committed: https://opendev.org/openstack/kolla-ansible/commit/300f584710c840e44ccf219f29f042338857e446
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 300f584710c840e44ccf219f29f042338857e446
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000

    RabbitMQ: Support setting ha-promote-on-shutdown

    By default ha-promote-on-shutdown=when-synced. However we are seeing
    issues with RabbitMQ automatically recovering when nodes are restarted.
    https://www.rabbitmq.com/ha.html#cluster-shutdown

    Rather than waiting for operator interventions, it is better we allow
    recovery to happen, even if that means we may loose some messages.
    A few failed and timed out operations is better than a totaly broken
    cloud. This is achieved using ha-promote-on-shutdown=always.

    Note, when a node failure is detected, this is already the default
    behaviour from 3.7.5 onwards:
    https://www.rabbitmq.com/ha.html#promoting-unsynchronised-mirrors

    This patch adds the option to change the ha-promote-on-shutdown
    definition, using the flag `rabbitmq_ha_promote_on_shutdown`. This
    value is unset by default to avoid any unexpected changes to the
    RabbitMQ definitions.json file, as that would trigger an unexpected
    restart of RabbitMQ during the next deploy.

    Related-Bug: #1954925

    Change-Id: I2146bda2c72ddac2c9923c6941b0596395fd9ab5
    (cherry picked from commit 94f3ce0c78998e29fcc034a9b0844f9d6d602807)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/878338
Committed: https://opendev.org/openstack/kolla-ansible/commit/f01896ffdee1c95a6f404ec379cdc38af475f701
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit f01896ffdee1c95a6f404ec379cdc38af475f701
Author: John Garbutt <email address hidden>
Date: Fri Dec 17 16:20:32 2021 +0000

    RabbitMQ: Support setting ha-promote-on-shutdown

    By default ha-promote-on-shutdown=when-synced. However we are seeing
    issues with RabbitMQ automatically recovering when nodes are restarted.
    https://www.rabbitmq.com/ha.html#cluster-shutdown

    Rather than waiting for operator interventions, it is better we allow
    recovery to happen, even if that means we may loose some messages.
    A few failed and timed out operations is better than a totaly broken
    cloud. This is achieved using ha-promote-on-shutdown=always.

    Note, when a node failure is detected, this is already the default
    behaviour from 3.7.5 onwards:
    https://www.rabbitmq.com/ha.html#promoting-unsynchronised-mirrors

    This patch adds the option to change the ha-promote-on-shutdown
    definition, using the flag `rabbitmq_ha_promote_on_shutdown`. This
    value is unset by default to avoid any unexpected changes to the
    RabbitMQ definitions.json file, as that would trigger an unexpected
    restart of RabbitMQ during the next deploy.

    Related-Bug: #1954925

    Change-Id: I2146bda2c72ddac2c9923c6941b0596395fd9ab5
    (cherry picked from commit 94f3ce0c78998e29fcc034a9b0844f9d6d602807)

tags: added: in-stable-yoga
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.