Comment 13 for bug 1463433

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked because virt memory got exhausted by publishers

Deeper investigation had shown this is the classic deadlock: the rabbit node have no free
resources left and the cluster blocks *all* publishing, by design. The
app thinks "let's wait for the publish block have lifted" and cannot
recover, hence keeps running continuously reporting "timed out waiting
for reply" things and keeping blocked connections open for ever.

As we discussed with Roman Podolyaka today:
- we *can* apply the w/a for the AMQP cluster monitoring control plane
(OCF). Which is about to "monitor and restart, if something looks really
bad"
- we *cannot* apply w/a or fix for the app layer as it is not clear what
Oslo should do when the rabbit cluster have blocked all publishing
because some node(s) got exhausted memory resources. Roman thinks that
the app should just wait for unblock, as it is now, and I almost agree
- we also *cannot* be sure why exactly the rabbit node's memory high
watermark exceeds: either due to some app side issues or due to the
rabbit side memory leak. Good news, this is *not* important, if we gonna
apply only w/a to the control plane. Which is to monitor additionally if
'rabbitmqctl list_queues' hangs and restart rabbit node the same way as
we do now for the cases when rabbitmqctl list_channels hangs.