Originally reported in: https://groups.google.com/forum/?fromgroups#!topic/percona-discussion/fLL1eDJ34Ts
After some partitionings/re-merges exception was thrown from pc::Proto::validate_state_msgs(). Relevant part of log:
130517 18:21:36 [Note] WSREP: declaring 07015197-bbbe-11e2-0800-3e7a13126565 stable
130517 18:21:36 [Note] WSREP: Node 07015197-bbbe-11e2-0800-3e7a13126565 state prim
130517 18:21:36 [Note] WSREP: view(view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68) memb {
07015197-bbbe-11e2-0800-3e7a13126565,
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
} left {
} partitioned {
7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
})
130517 18:21:36 [Note] WSREP: forgetting 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567)
130517 18:21:36 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: sent state msg: 828d5335-bef0-11e2-0800-30030a942a4d
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: got state msg: 828d5335-bef0-11e2-0800-30030a942a4d from 0 (db2)
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: got state msg: 828d5335-bef0-11e2-0800-30030a942a4d from 1 (dbserver1)
130517 18:21:36 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 23,
members = 2/2 (joined/total),
act_id = 5703326,
last_appl. = 5703275,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 85b364e5-bb0c-11e2-0800-aa1ab3b9ca31
130517 18:21:36 [Note] WSREP: Flow-control interval: [23, 23]
130517 18:21:36 [Note] WSREP: New cluster view: global state: 85b364e5-bb0c-11e2-0800-aa1ab3b9ca31:5703326, view# 24: Primary, number of nodes: 2, my index: 1, protocol version 2
130517 18:21:36 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130517 18:21:36 [Note] WSREP: Assign initial position for certification: 5703326, protocol version: 2
130517 18:21:41 [Note] WSREP: cleaning up 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567)
130517 18:21:52 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://xx.xx..30.34:4567
130517 18:21:53 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') reconnecting to 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567), attempt 0
130517 18:21:54 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting off
130517 18:21:58 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://xx.xx..30.34:4567
130517 18:21:59 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') reconnecting to 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567), attempt 0
130517 18:22:18 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting off
130517 18:22:25 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://xx.xx..30.34:4567
130517 18:22:26 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') reconnecting to 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567), attempt 0
130517 18:22:29 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting off
130517 18:22:30 [Note] WSREP: declaring 07015197-bbbe-11e2-0800-3e7a13126565 stable
130517 18:22:30 [Note] WSREP: declaring 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 stable
130517 18:22:35 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=ec71ed3f-bbc1-11e2-0800-3622c3a697f8,start_prim=0,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=279,checksum=1,instances=
07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=1,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
7ac28eaf-bbb9-11e2-0800-1c760b64ad99,prim=0,un=0,last_seq=1327602,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=1,last_seq=279,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
,state_msgs=
07015197-bbbe-11e2-0800-3e7a13126565,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=0,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=0,last_seq=279,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
}}
7ac28eaf-bbb9-11e2-0800-1c760b64ad99,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=1,last_seq=1297805,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
7ac28eaf-bbb9-11e2-0800-1c760b64ad99,prim=0,un=0,last_seq=1327602,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=1,last_seq=1375500,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
}}
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=0,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=0,last_seq=279,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
}}
,current_view=view(view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78) memb {
07015197-bbbe-11e2-0800-3e7a13126565,
7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68) memb {
07015197-bbbe-11e2-0800-3e7a13126565,
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
} left {
} partitioned {
}),mtu=32636}
130517 18:22:35 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=ec71ed3f-bbc1-11e2-0800-3622c3a697f8,source_view_id=view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8372711,node_list=()
} 116
130517 18:22:35 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=0,flags=4,source=7ac28eaf-bbb9-11e2-0800-1c760b64ad99,source_view_id=view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8659774,node_list=()
}
130517 18:22:35 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(ec71ed3f-bbc1-11e2-0800-3622c3a697f8, OPERATIONAL, view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78)), OPERATIONAL) {
current_view=view(view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78) memb {
07015197-bbbe-11e2-0800-3e7a13126565,
7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3,safe_seq=0,node_index=node: {idx=0,range=[4,3],safe_seq=3} node: {idx=1,range=[4,3],safe_seq=0} node: {idx=2,range=[4,3],safe_seq=3} ,msg_index= (2,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=ec71ed3f-bbc1-11e2-0800-3622c3a697f8,source_view_id=view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8372711,node_list=()
}
(0,1),evs::msg{version=0,type=1,user_type=255,order=0,seq=1,seq_range=0,aru_seq=0,flags=4,source=07015197-bbbe-11e2-0800-3e7a13
130517 18:22:35 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: ec71ed3f-bbc1-11e2-0800-3622c3a697f8 node 07015197-bbbe-11e2-0800-3e7a13126565 prim state message and local states not consistent: msg node prim=1,un=0,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1 local state prim=1,un=1,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1 (FATAL)
at gcomm/src/pc_proto.cpp:validate_state_msgs():606
The reason for this crash is comparing message state to local state using == operator which compares also un status flag which is not expected to stay consistent over partitionings. Operator should be changed to compare only parts of states which must stay globally consistent.
Another report at https:/ /groups. google. com/d/msg/ codership- team/CqmJVHylz4 M/PjkovPEbW5cJ, logs attached. Node clocks are not synced though. According to reporter node2 is 176 seconds off and node3 is 120 seconds off.