Comment 2 for bug 1573030

Revision history for this message
Darren Birkett (darren-birkett) wrote :

So I managed to prove the theory - joining nodes, when left to all join at the same time (as is the current implementation) are often hitting a situation where they try and sync from each other when neither is actually synced yet. This causes an mnesia db issue, and the join hangs. rabbit doesn't seem to have a similar implementation to a galera cluster, where a 'donor' node has to be fully synced before it can be a donor. It seems that a joining node can pick another joining node to sync from even though it's not itself synced yet.

To prove this I added serial=1 to the playbook that calls the role, and moved the testing out into a separate play (so that it wasn't getting called at the end of each serial run of the playbook/role which would fail). Running a loop of destroying containers and then running the tests over and over again on 10 separate nodes for 4 days straight, I have not seen a single instance of a cluster join hang.

I'll post a review up soon and the implementation can be debated (it's a bit dirty in the test plays at least because I needed to hardcode the path) to the role default vars.