Controller rolling restart creates rabbitmq duplicateerror messages and some services do not recover

Bug #1823305 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Damien Ciabrini

Bug Description

In our current controller architecture, rabbitmq message recipients are currently configured to be HA queues mirrored across controllers (for failover), but not durable (not persisted to disk for recoveries on server restart). One logical recipient has typically one master queue, and at least one mirror queue.

When a master queue becomes unavailable, a mirror automatically takes over the master role, as long as it's synchronized (i.e it received all the messages that the previous master queue had).

Publishing a message to a recipient consists in pushing the message to a rabbitmq "exchange", which has "bindings" (i.e. routes) to the master and mirror queues. The message publishing is acknowledged once all connected queues acknowledge its reception.

If one queue replica disappears (e.g. rabbitmq stops, controller node reboots), and reconnects _after_ some messages have been queued and not consumed yet in the remaining replicas, this reconnecting replica will become an "unsynchronized" mirror. As such, it won't be able to take over the Master role automatically if a master failover happens.

So during a rolling restart of all rabbitmq servers, it might happen that all master queues disconnects sequentially, and reconnect as unsynchronized mirrors. In such a case, when the last master disconnects, no mirror can take over the master role, and RabbitMQ deletes all the queues for the logical recipient. However, the important detail is that RabbitMQ _does not_ delete the "bindings" to those queues [1].

At this time, when publishing a message to the original logical recipient, rabbitmq will still receive it in the exchange, try to push it to inexistant queues via the leftover "bindings" and will never acknowledge the publishing, because there's no queue anymore to publish to.

Going back to our OpenStack context: an OpenStack client/service can send a "notification" to many consumers at once (i.e. Pub/Sub idiom) via a "fanout" exchange. Each registered consumer has its own HA queue, which means its own master and mirror queues.

As described above, it might happen that a rolling restart of controller nodes deletes all queues for a consumer. If that consumer never comes back online, the bindings to its queues will linger in the fanout exchange, publishing to this particular consumer will never be acknowledged, and consequently the fanout exchange can never acknowledge the publishing of the message to the OpenStack client.

The OpenStack client is unaware of that condition, so it will retry to publish the same message to the fanout exchange. At that time, some consumers already received and acknowledged the original message, ultimately resulting in the DuplicateMessageError that we experience.

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
assignee: Michele Baldessari (michele) → Damien Ciabrini (dciabrin)
status: Triaged → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/649689
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=610c8d8d41cd4b6bfd228ce1012416e424db625d
Submitter: Zuul
Branch: master

commit 610c8d8d41cd4b6bfd228ce1012416e424db625d
Author: Michele Baldessari <email address hidden>
Date: Wed Apr 3 20:15:19 2019 +0200

    RabbitMQ: always allow promotion on HA queue during failover

    When the RabbitMQ experience a rolling restart of its peers, the
    master of an HA queue fails over from one replica to another.

    If there are messages sent to the HA queue while some rabbit
    nodes are restarting, the latter will reconnect as unsynchronized
    slaves. It can happen that during a rolling restart, all rabbit
    nodes reconnect as unsynchronized, which prevents RabbitMQ to
    automatically elect a new Master for failover. This has other
    side effects on fanout queues and may prevent OpenStack
    notification to be consumed properly.

    Change the HA policy to always allow a promotion even when all
    replicas are unsynchronized. When such rare condition happens,
    rely on OpenStack client to retry RPC if they need to.

    Closes-Bug: #1823305
    Co-Authored-By: Damien Ciabrini <email address hidden>
    Change-Id: Id9bdd36aa0ee81424212e3a89185311817a15aee

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/665473

tags: added: stein-backport-potential
tags: added: queens-backport-potential rocky-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/stein)

Reviewed: https://review.opendev.org/665473
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=af8b1cfbf78e87840b1d6200974069b2a2d38051
Submitter: Zuul
Branch: stable/stein

commit af8b1cfbf78e87840b1d6200974069b2a2d38051
Author: Michele Baldessari <email address hidden>
Date: Wed Apr 3 20:15:19 2019 +0200

    RabbitMQ: always allow promotion on HA queue during failover

    When the RabbitMQ experience a rolling restart of its peers, the
    master of an HA queue fails over from one replica to another.

    If there are messages sent to the HA queue while some rabbit
    nodes are restarting, the latter will reconnect as unsynchronized
    slaves. It can happen that during a rolling restart, all rabbit
    nodes reconnect as unsynchronized, which prevents RabbitMQ to
    automatically elect a new Master for failover. This has other
    side effects on fanout queues and may prevent OpenStack
    notification to be consumed properly.

    Change the HA policy to always allow a promotion even when all
    replicas are unsynchronized. When such rare condition happens,
    rely on OpenStack client to retry RPC if they need to.

    Closes-Bug: #1823305
    Co-Authored-By: Damien Ciabrini <email address hidden>
    Change-Id: Id9bdd36aa0ee81424212e3a89185311817a15aee
    (cherry picked from commit 610c8d8d41cd4b6bfd228ce1012416e424db625d)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/666153

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.opendev.org/666153
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=76101c8971f8e66e59bf43ce636a09029a080a6d
Submitter: Zuul
Branch: stable/rocky

commit 76101c8971f8e66e59bf43ce636a09029a080a6d
Author: Michele Baldessari <email address hidden>
Date: Wed Apr 3 20:15:19 2019 +0200

    RabbitMQ: always allow promotion on HA queue during failover

    When the RabbitMQ experience a rolling restart of its peers, the
    master of an HA queue fails over from one replica to another.

    If there are messages sent to the HA queue while some rabbit
    nodes are restarting, the latter will reconnect as unsynchronized
    slaves. It can happen that during a rolling restart, all rabbit
    nodes reconnect as unsynchronized, which prevents RabbitMQ to
    automatically elect a new Master for failover. This has other
    side effects on fanout queues and may prevent OpenStack
    notification to be consumed properly.

    Change the HA policy to always allow a promotion even when all
    replicas are unsynchronized. When such rare condition happens,
    rely on OpenStack client to retry RPC if they need to.

    Closes-Bug: #1823305
    Co-Authored-By: Damien Ciabrini <email address hidden>
    Change-Id: Id9bdd36aa0ee81424212e3a89185311817a15aee
    (cherry picked from commit 610c8d8d41cd4b6bfd228ce1012416e424db625d)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/666401

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.opendev.org/666401
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=a00b779cd30afb03656f6ceea0cc11d2c2e50faa
Submitter: Zuul
Branch: stable/queens

commit a00b779cd30afb03656f6ceea0cc11d2c2e50faa
Author: Michele Baldessari <email address hidden>
Date: Wed Apr 3 20:15:19 2019 +0200

    RabbitMQ: always allow promotion on HA queue during failover

    When the RabbitMQ experience a rolling restart of its peers, the
    master of an HA queue fails over from one replica to another.

    If there are messages sent to the HA queue while some rabbit
    nodes are restarting, the latter will reconnect as unsynchronized
    slaves. It can happen that during a rolling restart, all rabbit
    nodes reconnect as unsynchronized, which prevents RabbitMQ to
    automatically elect a new Master for failover. This has other
    side effects on fanout queues and may prevent OpenStack
    notification to be consumed properly.

    Change the HA policy to always allow a promotion even when all
    replicas are unsynchronized. When such rare condition happens,
    rely on OpenStack client to retry RPC if they need to.

    Closes-Bug: #1823305
    Co-Authored-By: Damien Ciabrini <email address hidden>
    Change-Id: Id9bdd36aa0ee81424212e3a89185311817a15aee
    (cherry picked from commit 610c8d8d41cd4b6bfd228ce1012416e424db625d)
    (Resolved conflicts manually as cherry pick didn't apply cleanly)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.5.0

This issue was fixed in the openstack/puppet-tripleo 10.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.1.0

This issue was fixed in the openstack/puppet-tripleo 11.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.5.1

This issue was fixed in the openstack/puppet-tripleo 9.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.5.1

This issue was fixed in the openstack/puppet-tripleo 8.5.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.