Comment 2 for bug 1249805

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

Log file analysis:

Firstly there were 4 nodes online, xdb2-5. Then node xdb1 started join process and requested SST from xdb2. Due to network conditions, xdb5 dropped from group briefly at least one time (xdb5 log 2:24:15 and onwards). Due to lp:1232747 nodes xdb1 and xdb5 crashed during group renegotiation, which caused remaining xdb2-4 to form singleton groups because of previous failed attempts making evs install_timeout_count_ counter to reach maximum value. At this point primary component was lost and could not be re-established because some nodes (probably xdb1, xdb5) from previous known primary component were not present.

So, there are at least two issues to be addressed:
* Obvious one lp:1232747 which should be fixed
* Make evs install_timeout_count_ counter maximum value higher or mark other nodes invalid one by one to avoid loosing too many nodes from group at once. One way would be to set max value to the size of last known group and mark nodes invalid only if they fail to reach consensus within install timeout period.

It might also make sense to isolate constantly failing nodes from the group for longer periods of time to avoid causing too much turbulence for the group.