I had a cluster lock up after one node was partitioned from the cluster. It looks like it is the same issue described here:
On the server that partitioned:
140120 19:49:25 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.11956S), skipping check
Resulted with the other nodes all failing with a message like the following:
140120 19:50:12 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
I have snipped this section and put all of the logs in the attached file
...
...
...
140120 19:50:13 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=2,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=b30e1037-74a5-11e3-b724-1f38224e5841,source_view_id=view_id(REG,3497e712-74a6-11e3-afef-63feb09277a0,400),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=30493889,node_list=()
}140120 19:50:13 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: 97ea3a0a-6298-11e3-975a-07bbeda5973c node 97ea3a0a-6298-11e3-975a-07bbeda5973c prim state message and local states not consistent: msg node prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,97ea3a0a-6298-11e3-975a-07bbeda5973c,399),to_seq=2145834,weight=1 local state prim=1,un=1,last_seq=2,last_prim=view_id(PRIM,97ea3a0a-6298-11e3-975a-07bbeda5973c,399),to_seq=2145834,weight=1 (FATAL)
What is the reason that the entire cluster locked up? Is there anything that can be done to attempt and prevent this issue from reoccurring?
I can provide the logs from the other servers if needed.
Hello,
I had a cluster lock up after one node was partitioned from the cluster. It looks like it is the same issue described here:
On the server that partitioned:
140120 19:49:25 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.11956S), skipping check
Resulted with the other nodes all failing with a message like the following:
140120 19:50:12 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
I have snipped this section and put all of the logs in the attached file
...
...
...
140120 19:50:13 [ERROR] WSREP: exception caused by message: evs::msg{ version= 0,type= 2,user_ type=255, order=1, seq=-1, seq_range= -1,aru_ seq=-1, flags=4, source= b30e1037- 74a5-11e3- b724-1f38224e58 41,source_ view_id= view_id( REG,3497e712- 74a6-11e3- afef-63feb09277 a0,400) ,range_ uuid=00000000- 0000-0000- 0000-0000000000 00,range= [-1,-1] ,fifo_seq= 30493889, node_list= () 6298-11e3- 975a-07bbeda597 3c node 97ea3a0a- 6298-11e3- 975a-07bbeda597 3c prim state message and local states not consistent: msg node prim=1, un=0,last_ seq=2,last_ prim=view_ id(PRIM, 97ea3a0a- 6298-11e3- 975a-07bbeda597 3c,399) ,to_seq= 2145834, weight= 1 local state prim=1, un=1,last_ seq=2,last_ prim=view_ id(PRIM, 97ea3a0a- 6298-11e3- 975a-07bbeda597 3c,399) ,to_seq= 2145834, weight= 1 (FATAL)
}140120 19:50:13 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: 97ea3a0a-
What is the reason that the entire cluster locked up? Is there anything that can be done to attempt and prevent this issue from reoccurring?
I can provide the logs from the other servers if needed.