Comment 27 for bug 1460762

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There was a single node failover from 2015-06-08T23:47:01 to 2015-06-08T23:49:20 because of partitions detected and healed.
Failover events looks correct http://pastebin.com/44CjYmbA, and this seems not the reproduce of the original bug.
In originally reported bug, all nodes got list_channels hanged and restarted, there were no autoheal attempts, which means rabbit failed to detect partitions.
For this new case, I can see it had detected and healed just fine and there was non messages about hanged rabbitmqctl ion lrmd logs.
There was only 1 restart of the single rabbit@node-49 and this node was unavailable from 2015-06-08T23:47:01 to 2015-06-08T23:49:20
and this was caused by autohealing partitions.
I can also see numerous events like Node 'rabbitmqctlxxxx@foo' not responding but only one real partition case was detected and recovered, so this looks OK and these notices may be ignored.
@Dina: now please could you and Leontiy clarify how do you detect and define failure criterion of your test case?