RPC accidentally raise "queue not found" when RabbitMQ cluster used

Bug #1441298 reported by Alexey Khivin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Invalid
Undecided
Alexey Khivin

Bug Description

Openstack environment with three-nodes RabbitMQ cluster for messaging

Version of rabbit "RabbitMQ","3.3.5"

Nova scheduler loses rabbit queue while booting new instance. Queue with that name exists(according to list queues). Restarting nova-scheduler solves the problem.

The same is true for other services (cinder, neutron), so this is not nova-specific.

In the logs
http://paste.openstack.org/show/199695/
http://paste.openstack.org/show/199696/

I made a little investigation and I see that queue was created
http://paste.openstack.org/show/197823/

but it seems that the Queue was created on the one of RabbitMQ node but exception "Queue not found" occurs on another node.

So my theory is
1) oslo get a connection from pool and creates a temporary queue
2) oslo trying to consume from newly created queue and gets another connection from the connection pool
sometimes these connections connected to the different servers
and new queue is not replicated to the server from which ReplayWaiter trying to consume
exception....
and after exception occurs queue created on this server by replication mechanism

Alexey Khivin (akhivin)
Changed in oslo.messaging:
assignee: nobody → Alex Khivin (akhivin)
Revision history for this message
Alexey Khivin (akhivin) wrote :

I tried to reproduce this error with embedded script but it seems works perfect with the empty cluster and with one rpc-server and one rpc client

It is easy to patch a consumer to ignore NotFound exception until timeout but it would be better to ensure about the reasons of such behaviour.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

@akhivin, can you please propose the change as a review?

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Perhaps this normal, I known this can occurs:
* when you restart rabbit, the publisher can send a message before the consumer have effectively reconnected to the broker. Before kilo, the send message was going nowhere and the publisher got a timeout error.
* if you shutdown a compute where someone have ask something through rpc as this moment.
* and also in the situation you describe, too.

Since kilo (https://review.openstack.org/#/c/109373/ ), we log "The exchange to reply to %s doesn't exist yet, retrying..." and we retry, and raise the error only if the consumer doesn't comeback after 60 seconds (ie: the queue doesn't not appear).

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

@akhivin, Can you please confirm if the fix mentioned is in your version/copy of oslo.messaging?

Revision history for this message
Alexey Khivin (akhivin) wrote :

@sileht, @dims,

This great patch is not in my oslo.messaging version but I should note that this issue has a different syndromes
Exceptions "Not found" raised by the consumer, not by the publisher

Like this
<179>Jan 29 13:04:45 node-2 nova-conductor Failed to consume message from queue: Basic.consume: (404) NOT_FOUND - no queue 'reply_cfeb6683351949ecae09082f682b15c9

So, after this exception ReplyWaiter dies

I saw this error on the different environments and it seems this particular issue does not related to rabbit cluster restarting as I have thought before

I think ReplyWaiter should retry to consume until timeout even if queue not found during firsts attempts

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/175441

Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by Alex Khivin (<email address hidden>) on branch: master
Review: https://review.openstack.org/175441
Reason: patch seems useless and issue should be investigated more

Alexey Khivin (akhivin)
Changed in oslo.messaging:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.