RPC accidentally raise "queue not found" when RabbitMQ cluster used
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
Invalid
|
Undecided
|
Alexey Khivin |
Bug Description
Openstack environment with three-nodes RabbitMQ cluster for messaging
Version of rabbit "RabbitMQ","3.3.5"
Nova scheduler loses rabbit queue while booting new instance. Queue with that name exists(according to list queues). Restarting nova-scheduler solves the problem.
The same is true for other services (cinder, neutron), so this is not nova-specific.
In the logs
http://
http://
I made a little investigation and I see that queue was created
http://
but it seems that the Queue was created on the one of RabbitMQ node but exception "Queue not found" occurs on another node.
So my theory is
1) oslo get a connection from pool and creates a temporary queue
2) oslo trying to consume from newly created queue and gets another connection from the connection pool
sometimes these connections connected to the different servers
and new queue is not replicated to the server from which ReplayWaiter trying to consume
exception....
and after exception occurs queue created on this server by replication mechanism
Changed in oslo.messaging: | |
assignee: | nobody → Alex Khivin (akhivin) |
Changed in oslo.messaging: | |
status: | In Progress → Invalid |
I tried to reproduce this error with embedded script but it seems works perfect with the empty cluster and with one rpc-server and one rpc client
It is easy to patch a consumer to ignore NotFound exception until timeout but it would be better to ensure about the reasons of such behaviour.