Comment 2 for bug 1182367

Revision history for this message
Victor Teichert (victor-2) wrote :

Hello,

I had a cluster lock up after one node was partitioned from the cluster. It looks like it is the same issue described here:

On the server that partitioned:
140120 19:49:25 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.11956S), skipping check

Resulted with the other nodes all failing with a message like the following:

140120 19:50:12 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:

I have snipped this section and put all of the logs in the attached file
...
...
...

140120 19:50:13 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=2,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=b30e1037-74a5-11e3-b724-1f38224e5841,source_view_id=view_id(REG,3497e712-74a6-11e3-afef-63feb09277a0,400),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=30493889,node_list=()
 }140120 19:50:13 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: 97ea3a0a-6298-11e3-975a-07bbeda5973c node 97ea3a0a-6298-11e3-975a-07bbeda5973c prim state message and local states not consistent: msg node prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,97ea3a0a-6298-11e3-975a-07bbeda5973c,399),to_seq=2145834,weight=1 local state prim=1,un=1,last_seq=2,last_prim=view_id(PRIM,97ea3a0a-6298-11e3-975a-07bbeda5973c,399),to_seq=2145834,weight=1 (FATAL)

What is the reason that the entire cluster locked up? Is there anything that can be done to attempt and prevent this issue from reoccurring?

I can provide the logs from the other servers if needed.