Lowest group communication layer (evs) fails to handle the situation properly when big number of nodes suddenly start to see each other

Bug #1271918 reported by Miguel Angel Nieto
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
Fix Committed
Undecided
Unassigned
3.x
Fix Committed
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Medium
Unassigned
5.6
Fix Released
Medium
Unassigned

Bug Description

We have a 9 node cluster. Suddenly they stop to see each other:

140122 9:57:38 [Note] WSREP: view(view_id(NON_PRIM,378576e2-82be-11e3-b36b-96b118ad9ea1,10428) memb {
        5c773ef3-82be-11e3-ab13-4ec5e0489f56,
} joined {
} left {
} partitioned {
        378576e2-82be-11e3-b36b-96b118ad9ea1,
        49ce39e7-82be-11e3-a6da-e3fdac1aff99,
        4d4cbe47-5379-11e3-9597-437084d45b0f,
        79ae5df7-82be-11e3-af7a-6fad1d747d02,
        9dbbcf2a-82be-11e3-9c79-367e9eb841fb,
        b7a4648a-82bd-11e3-9b24-33ceb08ce291,
        d682b20c-82bd-11e3-9955-477180b12d21,
        fa3ce7e9-82bd-11e3-92ee-969e5429ffde,
})

Later on the problem is solved but they can't reconnect:

140122 9:58:38 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
140122 9:58:38 [Note] WSREP: Flow-control interval: [16, 16]
140122 9:58:38 [Note] WSREP: Received NON-PRIMARY.
140122 9:58:38 [Note] WSREP: New cluster view: global state: 840ae537-bb36-11e2-0800-55dad0151e6b:47649869, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
140122 9:58:38 [Warning] WSREP: evs::proto(5c773ef3-82be-11e3-ab13-4ec5e0489f56, GATHER, view_id(REG,5c773ef3-82be-11e3-ab13-4ec5e0489f56,10430)) source 49ce39e7-82be-11e3-a6da-e3fdac1aff99 is not supposed to be representative
140122 9:58:39 [Warning] WSREP: evs::proto(5c773ef3-82be-11e3-ab13-4ec5e0489f56, GATHER, view_id(REG,5c773ef3-82be-11e3-ab13-4ec5e0489f56,10430)) source 49ce39e7-82be-11e3-a6da-e3fdac1aff99 is not supposed to be representative
140122 9:58:40 [Warning] WSREP: evs::proto(5c773ef3-82be-11e3-ab13-4ec5e0489f56, GATHER, view_id(REG,5c773ef3-82be-11e3-ab13-4ec5e0489f56,10430)) source 49ce39e7-82be-11e3-a6da-e3fdac1aff99 is not supposed to be representative
140122 9:58:41 [Warning] WSREP: evs::proto(5c773ef3-82be-11e3-ab13-4ec5e0489f56, GATHER, view_id(REG,5c773ef3-82be-11e3-ab13-4ec5e0489f56,10430)) source 49ce39e7-82be-11e3-a6da-e3fdac1aff99 is not supposed to be representative
140122 9:58:42 [Warning] WSREP: evs::proto(5c773ef3-82be-11e3-ab13-4ec5e0489f56, GATHER, view_id(REG,5c773ef3-82be-11e3-ab13-4ec5e0489f56,10430)) source 49ce39e7-82be-11e3-a6da-e3fdac1aff99 is not supposed to be representative

Similar messages on all nodes.

Revision history for this message
Jervin R (revin) wrote :

Miguel, what is the Galera version? Looks similar, at least in behavior to https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1269236

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

This is a bit different than lp:1269236. Message "... is not supposed to be representative" indicates that there were problems forming a new group after nodes reconnected. In lp:1269236 nodes ended up in non-primary because one of them crashed while cluster was fully partitioned.

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1096

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.