Quorum queues stucked on rabbit issue

Bug #2028384 reported by Arnaud Morin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Arnaud Morin

Bug Description

When using quorum queues and if the queue declaration on rabbit side is failing, the queue can exists but in a bad state, like this:

$ rabbitmq-queues quorum_status reply_36dcaa363be04d2d953c69f39b5719d3

┌────────────────┬────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@rabbit5 │ noproc │ │ │ │ │ │
├────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@rabbit4 │ noproc │ │ │ │ │ │
├────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@rabbit6 │ noproc │ │ │ │ │ │
└────────────────┴────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘

In such situation, the only way to fix is to delete the queue as stated in doc [1]:
If a quorum of nodes cannot be recovered (say if 2 out of 3 RabbitMQ nodes are permanently lost) the queue is permanently unavailable and will need to be force deleted and recreated.

That would be nice if oslo_messaging was able to recover from such situation automatically.

[1] https://www.rabbitmq.com/quorum-queues.html#availability

Changed in oslo.messaging:
assignee: nobody → Arnaud Morin (arnaud-morin)
Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/889313
Committed: https://opendev.org/openstack/oslo.messaging/commit/8e3c523fd74257a78ceb384063f81db2e92a2ebd
Submitter: "Zuul (22348)"
Branch: master

commit 8e3c523fd74257a78ceb384063f81db2e92a2ebd
Author: Arnaud Morin <email address hidden>
Date: Fri Jul 21 16:51:51 2023 +0200

    Auto-delete the failed quorum rabbit queues

    When rabbit is failing for a specific quorum queue, the only thing to
    do is to delete the queue (as per rabbit doc, see [1]).

    So, to avoid the RPC service to be broken until an operator eventually
    do a manual fix on it, catch any INTERNAL ERROR (code 541) and trigger
    the deletion of the failed queues under those conditions.
    So on next queue declare (triggered from various retries), the queue
    will be created again and the service will recover by itself.

    Closes-Bug: #2028384
    Related-bug: #2031497

    [1] https://www.rabbitmq.com/quorum-queues.html#availability

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: Ib8dba833542973091a4e0bf23bb593aca89c5905

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/900891

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/900891
Committed: https://opendev.org/openstack/oslo.messaging/commit/34260a40358b17980f151dd6f4c4145533fba799
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 34260a40358b17980f151dd6f4c4145533fba799
Author: Arnaud Morin <email address hidden>
Date: Fri Jul 21 16:51:51 2023 +0200

    Auto-delete the failed quorum rabbit queues

    When rabbit is failing for a specific quorum queue, the only thing to
    do is to delete the queue (as per rabbit doc, see [1]).

    So, to avoid the RPC service to be broken until an operator eventually
    do a manual fix on it, catch any INTERNAL ERROR (code 541) and trigger
    the deletion of the failed queues under those conditions.
    So on next queue declare (triggered from various retries), the queue
    will be created again and the service will recover by itself.

    Closes-Bug: #2028384
    Related-bug: #2031497

    [1] https://www.rabbitmq.com/quorum-queues.html#availability

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: Ib8dba833542973091a4e0bf23bb593aca89c5905
    (cherry picked from commit 8e3c523fd74257a78ceb384063f81db2e92a2ebd)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 14.6.0

This issue was fixed in the openstack/oslo.messaging 14.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/905531

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/905531
Committed: https://opendev.org/openstack/oslo.messaging/commit/dcad86be5f7262f0f0323144dd47aea0b2dd1db2
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit dcad86be5f7262f0f0323144dd47aea0b2dd1db2
Author: Arnaud Morin <email address hidden>
Date: Fri Jul 21 16:51:51 2023 +0200

    Auto-delete the failed quorum rabbit queues

    When rabbit is failing for a specific quorum queue, the only thing to
    do is to delete the queue (as per rabbit doc, see [1]).

    So, to avoid the RPC service to be broken until an operator eventually
    do a manual fix on it, catch any INTERNAL ERROR (code 541) and trigger
    the deletion of the failed queues under those conditions.
    So on next queue declare (triggered from various retries), the queue
    will be created again and the service will recover by itself.

    Closes-Bug: #2028384
    Related-bug: #2031497

    [1] https://www.rabbitmq.com/quorum-queues.html#availability

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: Ib8dba833542973091a4e0bf23bb593aca89c5905
    (cherry picked from commit 8e3c523fd74257a78ceb384063f81db2e92a2ebd)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 14.2.4

This issue was fixed in the openstack/oslo.messaging 14.2.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/907267

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/zed)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/zed
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/907267
Reason: stable/zed branch of openstack/oslo.messaging is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/zed if you want to further work on this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.