Hi Again,

We have tracked this error that happened a few times in our cluster and we found that this error happen only if you're using RabbitMQ clustering (I don't know if it the same case with QPid), and it's start when one of the cluster nodes go down.

Basically when a cluster node go down, neutron will try to reconnect to another RabbitMQ node and then re-create everything from scratch i.e. Exchanges, Queues, Bindings ... and so on, and b/c RabbitMQ clustering make sure that everything is replicated so most of the re-creation end up being on no-op in cluster side, until here everything is fine except for exchanges with the auto-delete flag set.

A bit of background first, as some of you may already know, basically an auto-delete flag in the Exchange tell The RabbitMQ server to delete the Exchange when there is no more Queue using that exchange [check footer], same thing with Queues when auto-delete is set this tell RabbitMQ server to delete the Queue when the last consumer is disconnected from the Queue.  Now in Neutron side, we should state here that all RPC Queues and Exchanges are created with auto-delete, and that only RPC Queues/Exchanges exhibit this problem, which mean that from now on we will be only discussing the RPC case.
 
Also for the purpose of this example, let's say that we have a 3 node cluster of RabbitMQ (node1, node2, node3) and that some neutron agent X is using one node1, by that i mean that the connection are made from neutron-x-agent to node1, we should also state that we will be talking about agents only b/c they are the one that create the RPC Queues/Exchanges.

Now when node1 go down [step 1], the Queue consumers will be removed b/c connection from agent X to node1 is broken, which mean that the the Queue will be deleted, and when Queues are deleted the exchanges will also be deleted, this all happen **eventually** in all cluster nodes that are still alive [step 2]. **In the same time** when neutron agents detect that node1 is down, the agent will try to reconnect to another node and re-create the same RPC Queues and Exchanges [step 3].

As you may have guessed there a race condition here over Queues and Exchanges, from one hand in [step 2] the cluster is trying to delete the Queues and Exchanges, and from the other hand Neutron is trying to create them. 

In details this is what happen from neutron side:

N1. Connect to node 2.
N2. Create Exchange X.
N3. Create Queue Q.
N4. Create Binding from Q to X.

NOTE: (N2, N3, N4) are all part of kombu.entity.Queue.declare() 

And in each alive node of the cluster side:

R1. Delete consumer.
R2. Delete Queue Q (Binding is deleted explicitly).
R3. Delete Exchange X.

So there is actually a good change that before N4 is being done R3 has already been executed, which mean that N4 will fail with Exchange NOT_FOUND (Check first traceback) ! 

As for what happen after this happen, well while agent can send RPC calls to the Neutron server, the Neutron server cannot send reply back b/c there is no exchange to send it to, which mean that replies will be dropped by RabbitMQ server, and the agent will wait until the timeout is reached (Check second traceback).

To generalize, actually if one agent is using a specific RabbitMQ node this mean that all agents are using the same one if ** rabbit_hosts** variable is in the same order in all nodes, which mean that when this problem happen, it impact all agents at the same time.

As to how severe is this ? Well if agents can't get RPC calls to the server it's mean that the agent are useless, i.e. will not be able to do anything, until they are restarted.

A patch as to how we fix this is coming soon ...