Comment 8 for bug 1463802

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote : Re: RPC clients do not recreate a reply queue after restart of the last RabbitMQ server in the cluster

@Alexey

As I tried to describe in the bug description, the point here is: for some reason the reply queue *has not been* recreated on the disconnect, which caused failure of *all* subsequent RPC calls (i.e. the queue hasn't been recreated on the next call either).

>>> This massage about the lost queue you can see each time when you rebooted RabbitMQ cluster during the rpc call.

Well, what we see on our local environments and in oslo.messaging code: queues and exchanges are re-redeclared on reconnect (as they both are not durable and does not survive RabbitMQ server restart).

>>> If client have started an rpc-call and awating the anwer In the case if queue disappeared during rpc-call, client will wait reply until timeout and if server replies in this period then server recreates reply queue

I'm aware of that, but this particular queue *has been lost for hours* by the moment we saw "queue not found" error in nova-conductor logs. So the cause of TimeoutError is that the reply queue has never been recreated.

>>> reply queue will be recreated by client at the moment of new rpc-call start

Technically, it's recreated on reconnect https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/_drivers/impl_rabbit.py#L157-L162

What we see when testing this on local environment - all the queues are recreated after RabbitMQ restart even without RPC calls. I'm not sure why that didn't happen on the bug reporter's environment - probably we are hitting some edge case here.

>>> So, the real issue in this case is the lost reply-message or server not replied at all. To avoid continous service interruption Nova should handle Timeout exception raised by oslo.messaging and we should investigate why reply message may be lost

I respectfully disagree without you here: as you can see in the logs, this particular reply queue has been missing for hours (the first message in RabbitMQ logs happened on "9-Jun-2015::21:47:51" and nova-conductor error happened on "2015-06-10 08:22:55.914"). For some reason oslo.messaging never redeclared the queue, so we just missed the reply.

Handling of Timeout errors is a totally different question and it won't help here as we haven't provided a queue to receive the reply from.