INTERNAL_ERROR - Cannot declare a queue during RabbitMQ start

Bug #1822778 reported by Gabriele Santomaggio
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Gabriele Santomaggio

Bug Description

VErsions:
1. oslo.messaging==9.5.0
2. Tested with different RabbitMQ versions 3.6.16 and 3.7.13/14

When one RabbitMQ cluster node comes up, there is a time that the AMQP socket is ready, but the store is not available yet.

In general, it is not a problem but when the `queue_master_locator = min-masters` is enabled RabbitMQ tries to create the queue to the node with the fewer queues.

So even the olso-messaging client is connected to one running node, RabbitMQ tries to create the queues to the coming node.

This "rare" condition cause this error:

```
Calling echo ({'arg1': 'test_n_20', 'arg2': 'test_2_20'}) on server=None exchange=my-exchange topic=my-topic namespace=None fanout=False cast=False
2019-04-02 14:14:30.193 31178 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to declare consumer for topic 'reply_c576784af01a47b989ce34416aa94fe1': Queue.declare: (541) INTERNAL_ERROR - Cannot declare a queue 'queue 'reply_c576784af01a47b989ce34416aa94fe1' in vhost '/'' on node 'rabbit@node1': {vhost_supervisor_not_running,<<"/">>}: amqp.exceptions.InternalError: Queue.declare: (541) INTERNAL_ERROR - Cannot declare a queue 'queue 'reply_c576784af01a47b989ce34416aa94fe1' in vhost '/'' on node 'rabbit@node1': {vhost_supervisor_not_running,<<"/">>}
```

and the message can be lost.

To reproduce the error, you have to:
1- create a RabbitMQ cluster ( you can use my ready Vagrant conf [1]). I used the RabbitMQ version 3.7.14
2- pump the cluster with 1000/2000 queues
3- Use the Ken Giusti example:
     git clone <email address hidden>:kgiusti/oslo-messaging-clients.git
     ./rpc-server --url rabbit://test:test@10.0.0.10:5672 --name Server02
     for i in {1..20}; do ./rpc-client --method echo --kwargs "arg1=test_n_$i arg2=test_2_$i" --url rabbit://test:test@10.0.0.10:5672 ; done
4- during the test restart the second node, the one where RabbitMQ is not connected.

you will see that some message gets lost, even if one or more RabbitMQ nodes are running.
There is a thread [2] on the RabbitMQ user group [2] about that.

I am looking at how to make the queue.declare function more tolerant.

Regards
Gabriele Santomaggio
Developer @SUSE

[1] https://github.com/Gsantomaggio/rabbitmq-utils/tree/master/rabbitmq-suse/vagrant_cluster
[2] https://groups.google.com/d/msg/rabbitmq-users/xEWZCmPXI-Q/yDYlPBC5EwAJ
[3] https://github.com/rabbitmq/rabbitmq-server/issues/1869

Changed in oslo.messaging:
assignee: nobody → Gabriele Santomaggio (gsantomaggio)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/649989
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=4d2787227b00b973973554f7387e621d2664c0d8
Submitter: Zuul
Branch: master

commit 4d2787227b00b973973554f7387e621d2664c0d8
Author: Gabriele <email address hidden>
Date: Thu Apr 4 14:56:25 2019 +0200

    Retry to declare a queue after internal error

    Without this commit, the client can lose the messages, because the
    client does not handler the 'AMQP internal error 541',
    read here [2] for details.
    The fix retries to create the queue after a delay.
    When the virtual-host is ready the declare does not fail.
    This is a rare condiction, please read the bug [1] for details.

    Closes-Bug: #1822778

    [1] https://bugs.launchpad.net/oslo.messaging/+bug/1822778
    [2] https://www.rabbitmq.com/amqp-0-9-1-reference.html

    Change-Id: I7ab1f9d21ebb807285bf1422bc14cc6e07dcd32a

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/654817

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/654818

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/654819

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/654820

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/rocky)

Reviewed: https://review.opendev.org/654819
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=d75eba02c74d309cd769a8257d2d20f735c60ad1
Submitter: Zuul
Branch: stable/rocky

commit d75eba02c74d309cd769a8257d2d20f735c60ad1
Author: Gabriele <email address hidden>
Date: Thu Apr 4 14:56:25 2019 +0200

    Retry to declare a queue after internal error

    Without this commit, the client can lose the messages, because the
    client does not handler the 'AMQP internal error 541',
    read here [2] for details.
    The fix retries to create the queue after a delay.
    When the virtual-host is ready the declare does not fail.
    This is a rare condiction, please read the bug [1] for details.

    Closes-Bug: #1822778

    [1] https://bugs.launchpad.net/oslo.messaging/+bug/1822778
    [2] https://www.rabbitmq.com/amqp-0-9-1-reference.html

    Change-Id: I7ab1f9d21ebb807285bf1422bc14cc6e07dcd32a
    (cherry picked from commit 4d2787227b00b973973554f7387e621d2664c0d8)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/queens)

Reviewed: https://review.opendev.org/654818
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=6ea1bb5fad84079d07d8ad48d2c2a71aa29cdda4
Submitter: Zuul
Branch: stable/queens

commit 6ea1bb5fad84079d07d8ad48d2c2a71aa29cdda4
Author: Gabriele <email address hidden>
Date: Thu Apr 4 14:56:25 2019 +0200

    Retry to declare a queue after internal error

    Without this commit, the client can lose the messages, because the
    client does not handler the 'AMQP internal error 541',
    read here [2] for details.
    The fix retries to create the queue after a delay.
    When the virtual-host is ready the declare does not fail.
    This is a rare condiction, please read the bug [1] for details.

    Closes-Bug: #1822778

    [1] https://bugs.launchpad.net/oslo.messaging/+bug/1822778
    [2] https://www.rabbitmq.com/amqp-0-9-1-reference.html

    Change-Id: I7ab1f9d21ebb807285bf1422bc14cc6e07dcd32a
    (cherry picked from commit 4d2787227b00b973973554f7387e621d2664c0d8)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 8.1.3

This issue was fixed in the openstack/oslo.messaging 8.1.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 9.7.0

This issue was fixed in the openstack/oslo.messaging 9.7.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/stein)

Reviewed: https://review.opendev.org/654820
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=99b77c07bee7dc482af03d13b2c454593f153f0c
Submitter: Zuul
Branch: stable/stein

commit 99b77c07bee7dc482af03d13b2c454593f153f0c
Author: Gabriele <email address hidden>
Date: Thu Apr 4 14:56:25 2019 +0200

    Retry to declare a queue after internal error

    Without this commit, the client can lose the messages, because the
    client does not handler the 'AMQP internal error 541',
    read here [2] for details.
    The fix retries to create the queue after a delay.
    When the virtual-host is ready the declare does not fail.
    This is a rare condiction, please read the bug [1] for details.

    Closes-Bug: #1822778

    [1] https://bugs.launchpad.net/oslo.messaging/+bug/1822778
    [2] https://www.rabbitmq.com/amqp-0-9-1-reference.html

    Change-Id: I7ab1f9d21ebb807285bf1422bc14cc6e07dcd32a
    (cherry picked from commit 4d2787227b00b973973554f7387e621d2664c0d8)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.35.5

This issue was fixed in the openstack/oslo.messaging 5.35.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/pike)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: stable/pike
Review: https://review.opendev.org/654817
Reason: Pike is in extended maintenance now so this won't ever be released. I'm going to close this as it doesn't seem worth the effort to merge now

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 9.5.1

This issue was fixed in the openstack/oslo.messaging 9.5.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.