Comment 5 for bug 1182367

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

Hi Victor,

Your patch will surely get rid of the exception, but I doubt it is a right thing to do. The thing is, in that code branch we are checking states between two nodes that both are claiming to be coming from primary component, so they should have seen exactly the same set of EVS views and messages and so have equal states. Getting this exception indicates that there is something wrong with state messages and continuing might cause data corruption. In most of the cases things will probably go ok, but it cannot be guaranteed. So use at your own risk.

It looks like one problem here is that in case of partitioning nodes remaining in primary mark partitioned nodes as non-primary too late (after message exchange), so if partitioned node comes back to group before forming a new primary component has been finished it may happen that state messages won't match.

Fix for this would be marking partitioned nodes as non-primary right after quorum computation, but that might require changes that are not backwards compatible.

Another thing worth inspecting is if the check for this state message should be done at all, it might be possible that the failing case here is with state message coming from previously partitioned node (node coming from non-prim). In that case fix would be simply avoiding state validation.