Comment 28 for bug 1463433

Revision history for this message
Dan Hata (dhata) wrote : Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node

Requested by Eugene Bogdanov

Clear steps to reproduce and expected result vs actual result
During a shaker network test where we test VM to VM network throughput performance on all hosts, RabbitMQ memory process grows and eventually books all incoming requests.

Running Shaker on a 50 node cents environment has reproduced this problem 1 time.

Rough estimate of the probability of user facing the issue
We have seen this problem once. We issued a fix but the root cause did not manifest. So 1 out of 2 times we have seen this.

What is the real user facing impact / severity and is there a workaround available?

IMPACT: Data plane and control plane requests will fail
WORKAROUND: Restart RabbitMQ

Can we deliver the fix later and apply it easy on running env?
yes, the logic to restart RabbitMQ is in place currently.

However we have seen RabbitMQ die for a different reason now.

rabbitmqctl never hanged on nodes, but reported several nodedown errors which OCF considers as a resource failure and initiates restart of rabbit node.
- the complete list of single rabbit node failures during the test runs is http://pastebin.com/y12DDEx6

some logs and additional rabbit stats collected by this script http://pastebin.com/sX3DPyRG is attached

The fix for this is unclear and under investigation