Clear steps to reproduce and expected result vs actual result
During a shaker network test where we test VM to VM network throughput performance on all hosts, RabbitMQ memory process grows and eventually books all incoming requests.
Running Shaker on a 50 node cents environment has reproduced this problem 1 time.
Rough estimate of the probability of user facing the issue
We have seen this problem once. We issued a fix but the root cause did not manifest. So 1 out of 2 times we have seen this.
What is the real user facing impact / severity and is there a workaround available?
IMPACT: Data plane and control plane requests will fail
WORKAROUND: Restart RabbitMQ
Can we deliver the fix later and apply it easy on running env?
yes, the logic to restart RabbitMQ is in place currently.
However we have seen RabbitMQ die for a different reason now.
rabbitmqctl never hanged on nodes, but reported several nodedown errors which OCF considers as a resource failure and initiates restart of rabbit node.
- the complete list of single rabbit node failures during the test runs is http://pastebin.com/y12DDEx6
Requested by Eugene Bogdanov
Clear steps to reproduce and expected result vs actual result
During a shaker network test where we test VM to VM network throughput performance on all hosts, RabbitMQ memory process grows and eventually books all incoming requests.
Running Shaker on a 50 node cents environment has reproduced this problem 1 time.
Rough estimate of the probability of user facing the issue
We have seen this problem once. We issued a fix but the root cause did not manifest. So 1 out of 2 times we have seen this.
What is the real user facing impact / severity and is there a workaround available?
IMPACT: Data plane and control plane requests will fail
WORKAROUND: Restart RabbitMQ
Can we deliver the fix later and apply it easy on running env?
yes, the logic to restart RabbitMQ is in place currently.
However we have seen RabbitMQ die for a different reason now.
rabbitmqctl never hanged on nodes, but reported several nodedown errors which OCF considers as a resource failure and initiates restart of rabbit node. pastebin. com/y12DDEx6
- the complete list of single rabbit node failures during the test runs is http://
some logs and additional rabbit stats collected by this script http:// pastebin. com/sX3DPyRG is attached
The fix for this is unclear and under investigation