Bug #1954925 “RabbitMQ redeploy fails with emulator Discarding m...” : Bugs : kolla-ansible

Mark Goddard (mgoddard) on 2021-12-15

Changed in kolla-ansible:
importance:	Undecided → High

Revision history for this message

Mark Goddard (mgoddard) wrote on 2021-12-15:

#1

The broken node crashed, lots of these:

2021-12-14 17:27:13.394 [error] <0.15156.2> CRASH REPORT Process <0.15156.2> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 769

Then when it starts up we see many of these per second, continuing indefinitely:

2021-12-14 17:27:28.205 [error] <0.10727.0> Discarding message {'$gen_call',{<0.10727.0>,#Ref<0.3559579180.2146172930.215375>},stat} from <0.10727.0> to <0.6502.1> in an old incarnation (1639495131) of this node (1639502843)

Revision history for this message

Mark Goddard (mgoddard) wrote on 2021-12-15:

#2

Workaround:

Stop all nodes in the cluster:

kolla-ansible stop -t rabbitmq

or:

docker stop rabbitmq

Start all nodes in the cluster one by one:

kolla-ansible deploy -t rabbitmq

or:

docker start rabbitmq

Revision history for this message

Mark Goddard (mgoddard) wrote on 2021-12-15:

#3

jovial linked me to this bug report, which could be relevant: https://github.com/rabbitmq/rabbitmq-server/issues/2045

Revision history for this message

John Garbutt (johngarbutt) wrote on 2021-12-15:

#4

I also think we are using a bad HA setting, we should think about:
{"ha-mode":"exactly","ha-params":2}

The reference for that is this:
https://www.rabbitmq.com/ha.html#replication-factor

My theory being, that means the transient queues we create for the rpc call response queues are less likely to be an issue, as we will have less rabbitmq load.

Interesting, openstack-ansible does this:
https://github.com/openstack/openstack-ansible-rabbitmq_server/blob/34819a10ace4f800d20c2d36035bbfca3ab9671e/defaults/main.yml#L275
rabbitmq_openstack_policies:
  - name: "HA"
    pattern: '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'
    tags: "ha-mode=all"

And tripple-o does:
https://github.com/openstack/puppet-tripleo/blob/fdca31a2009a0aaf3f3ee9c5e30083ac59bf067f/manifests/profile/pacemaker/rabbitmq_bundle.pp#L344

ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0

I think following openstack-ansible is a good idea here, more on what they are doing here:
https://github.com/openstack/openstack-ansible-rabbitmq_server/commit/52ad552129afc715dc978c61edf881090fcf48c0
Not just because the commit came from one of the creators of rabbitmq :)

The fix was raised in an oslo meeting it turns out.

Revision history for this message

John Garbutt (johngarbutt) wrote on 2021-12-15:

#5

I also wonder if enable_cancel_on_failover oslo_messaging setting is related, to when ha failover happens, stop clients retrying all the time, which might be related to why rabbit is so busy logging these errors, but its a total guess:
https://github.com/openstack/oslo.messaging/commit/196fa877a90d7eb0f82ec9e1c194eef3f98fc0b1

I have seen issues with neutron agents, when rabbit fails as described above, eventually they go inactive. It feels a bit related.

Mark Goddard (mgoddard) on 2021-12-16

Changed in kolla-ansible:
status:	New → Triaged

Revision history for this message

Thierry (golvanig) wrote on 2021-12-17:

#6

Thanks for sharing all this.
You saved my life :)
I am running Kolla-Ansible (3 controllers, 6 Computes) with the latest Victoria, and got this annoying issue.
I just tried the new pattern in "definitions.json.j2" and it works!!
Did also changed the 'rabbitmq_server_additional_erl_args', but the 'pattern' was the final fix.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-17: Fix proposed to kolla-ansible (master)

#7

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/822132

Changed in kolla-ansible:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-17: Related fix proposed to kolla-ansible (master)

#8

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/822135

Revision history for this message

John Garbutt (johngarbutt) wrote on 2021-12-17:

#9

Just to check @golvanig, which specific changes helped you the most? If you could comment on my related patches, that would be briliant.

Revision history for this message

Thierry (golvanig) wrote on 2021-12-17:

#10

I add less crash by increasing the number of cores in 'rabbitmq_server_additional_erl_args'
But the ends of crashes was the template.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-17:

#11

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/822187

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-23: Fix merged to kolla-ansible (master)

#12

Download full text (3.7 KiB)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/824994
Committed: https://opendev.org/openstack/kolla-ansible/commit/6bfe1927f0e10eb0f0a92a8d3451757a46ccdd33
Submitter: "Zuul (22348)"
Branch: master

commit 6bfe1927f0e10eb0f0a92a8d3451757a46ccdd33
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000

Remove classic queue mirroring for internal RabbitMQ

    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.

    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.

    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].

    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.

    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].

    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:

    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`

    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The downside
    of potential message loss needs to be weighed against the real
    upsides of increasing the performance of RabbitMQ, and moving
    to a configuration which is officially supported and hopefully
    more stable. In the future, we can then consider promoting
    specific queues to quorum queues, in cases where message loss
    can result in failure states which are hard to recover fro...

Reviewed:  https://review.opendev.org/c/openstack/kolla-ansible/+/824994
Committed: https://opendev.org/openstack/kolla-ansible/commit/6bfe1927f0e10eb0f0a92a8d3451757a46ccdd33
Submitter: "Zuul (22348)"
Branch:    master

commit 6bfe1927f0e10eb0f0a92a8d3451757a46ccdd33
Author: Doug Szumski <doug@stackhpc.com>
Date:   Mon Jan 17 15:15:07 2022 +0000

Remove classic queue mirroring for internal RabbitMQ
    
    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.
    
    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.
    
    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].
    
    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.
    
    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].
    
    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:
    
    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`
    
    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The downside
    of potential message loss needs to be weighed against the real
    upsides of increasing the performance of RabbitMQ, and moving
    to a configuration which is officially supported and hopefully
    more stable. In the future, we can then consider promoting
    specific queues to quorum queues, in cases where message loss
    can result in failure states which are hard to recover from.
    
    [1] https://www.rabbitmq.com/ha.html
    [2] https://www.rabbitmq.com/queues.html
    [3] https://github.com/rabbitmq/rabbitmq-server/issues/2045
    [4] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
    [5] https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/
    [6] https://fuel-ccp.readthedocs.io/en/latest/design/ref_arch_1000_nodes.html#replication
    [7] https://bugs.launchpad.net/oslo.messaging/+bug/1942933
    [8] https://www.rabbitmq.com/quorum-queues.html#use-cases
    
    Partial-Bug: #1954925
    Change-Id: I91d0e23b22319cf3fdb7603f5401d24e3b76a56e

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-10: Fix proposed to kolla-ansible (stable/xena)

#13

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/833043

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-29: Fix merged to kolla-ansible (stable/xena)

#14

Download full text (4.2 KiB)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/833043
Committed: https://opendev.org/openstack/kolla-ansible/commit/425ead5792661dc4616e30a9a0af5ec506f9bdfb
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 425ead5792661dc4616e30a9a0af5ec506f9bdfb
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000

Allow removal of classic queue mirroring for internal RabbitMQ

    Backport note: This patch has been updated to retain the existing
    behaviour by default. A temporary variable,
    rabbitmq_remove_ha_all_policy, has been added which may be set to true
    in order to remove the ha-all policy. In order to support changing the
    policy without upgrading, the the ha-all policy is removed on deploys,
    in addition to upgrades.

    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.

    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.

    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].

    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.

    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].

    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:

    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`

    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The downside...

Reviewed:  https://review.opendev.org/c/openstack/kolla-ansible/+/833043
Committed: https://opendev.org/openstack/kolla-ansible/commit/425ead5792661dc4616e30a9a0af5ec506f9bdfb
Submitter: "Zuul (22348)"
Branch:    stable/xena

commit 425ead5792661dc4616e30a9a0af5ec506f9bdfb
Author: Doug Szumski <doug@stackhpc.com>
Date:   Mon Jan 17 15:15:07 2022 +0000

Allow removal of classic queue mirroring for internal RabbitMQ
    
    Backport note: This patch has been updated to retain the existing
    behaviour by default. A temporary variable,
    rabbitmq_remove_ha_all_policy, has been added which may be set to true
    in order to remove the ha-all policy. In order to support changing the
    policy without upgrading, the the ha-all policy is removed on deploys,
    in addition to upgrades.
    
    When OpenStack is deployed with Kolla-Ansible, by default there
    are no durable queues or exchanges created by the OpenStack
    services in RabbitMQ. In Rabbit terminology, not being durable
    is referred to as `transient`, and this means that the queue
    is generally held in memory.
    
    Whether OpenStack services create durable or transient queues is
    traditionally controlled by the Oslo Notification config option:
    `amqp_durable_queues`. In Kolla-Ansible, this remains set to
    the default of `False` in all services. The only `durable`
    objects are the `amq*` exchanges which are internal to RabbitMQ.
    
    More recently, Oslo Notification has introduced support for
    Quorum queues [7]. These are a successor to durable classic
    queues, however it isn't yet clear if they are a good fit for
    OpenStack in general [8].
    
    For clustered RabbitMQ deployments, Kolla-Ansible configures all
    queues as `replicated` [1]. Replication occurs over all nodes
    in the cluster. RabbitMQ refers to this as 'mirroring of classic
    queues'.
    
    In summary, this means that a multi-node Kolla-Ansible deployment
    will end up with a large number of transient, mirrored queues
    and exchanges. However, the RabbitMQ documentation warns against
    this, stating that 'For replicated queues, the only reasonable
    option is to use durable queues: [2]`. This is discussed
    further in the following bug report: [3].
    
    Whilst we could try enabling the `amqp_durable_queues` option
    for each service (this is suggested in [4]), there are
    a number of complexities with this approach, not limited to:
    
    1) RabbitMQ is planning to remove classic queue mirroring in
       favor of 'Quorum queues' in a forthcoming release [5].
    2) Durable queues will be written to disk, which may cause
       performance problems at scale. Note that this includes
       Quorum queues which are always durable.
    3) Potential for race conditions and other complexity
       discussed recently on the mailing list under:
       `[ops] [kolla] RabbitMQ High Availability`
    
    The remaining option, proposed here, is to use classic
    non-mirrored queues everywhere, and rely on services to recover
    if the node hosting a queue or exchange they are using fails.
    There is some discussion of this approach in [6]. The downside
    of potential message loss needs to be weighed against the real
    upsides of increasing the performance of RabbitMQ, and moving
    to a configuration which is officially supported and hopefully
    more stable. In the future, we can then consider promoting
    specific queues to quorum queues, in cases where message loss
    can result in failure states which are hard to recover from.
    
    [1] https://www.rabbitmq.com/ha.html
    [2] https://www.rabbitmq.com/queues.html
    [3] https://github.com/rabbitmq/rabbitmq-server/issues/2045
    [4] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit
    [5] https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/
    [6] https://fuel-ccp.readthedocs.io/en/latest/design/ref_arch_1000_nodes.html#replication
    [7] https://bugs.launchpad.net/oslo.messaging/+bug/1942933
    [8] https://www.rabbitmq.com/quorum-queues.html#use-cases
    
    Partial-Bug: #1954925
    Change-Id: I91d0e23b22319cf3fdb7603f5401d24e3b76a56e
    (cherry picked from commit 6bfe1927f0e10eb0f0a92a8d3451757a46ccdd33)

tags:

added: in-stable-xena

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-29: Fix proposed to kolla-ansible (stable/wallaby)

#15

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/835501

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-29: Fix proposed to kolla-ansible (stable/victoria)

#16

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/835502

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-30: Fix merged to kolla-ansible (stable/wallaby)

#17

Download full text (4.3 KiB)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/835501
Committed: https://opendev.org/openstack/kolla-ansible/commit/8e1c98d987e73f7ff12f814c8b8f6215f2e55bbc
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 8e1c98d987e73f7ff12f814c8b8f6215f2e55bbc
Author: Doug Szumski <email address hidden>
Date: Mon Jan 17 15:15:07 2022 +0000