I was also able to reproduce the problem on a smaller, 4 node cluster.
Again, only 1st node had bad network. No load traffic at all during the test. New logs attached as logs3.tgz.
After some time of working, eventually nodes 2,3 and 4 went down with status like this:
I was also able to reproduce the problem on a smaller, 4 node cluster.
Again, only 1st node had bad network. No load traffic at all during the test. New logs attached as logs3.tgz.
After some time of working, eventually nodes 2,3 and 4 went down with status like this:
percona2 mysql> show status like 'ws%'; ------- ------- ------- -+----- ------- ------- ------- ------- -----+ ------- ------- ------- -+----- ------- ------- ------- ------- -----+ state_uuid | eb4b0cbb- 88ea-11e3- bcab-160cab62cd b7 | version | 4 | committed | 0 | d_bytes | 0 | bytes | 10425 | cert_failures | 0 | send_queue | 0 | send_queue_ avg | 0.000000 | recv_queue | 0 | recv_queue_ avg | 0.000000 | control_ paused | 0.000000 | control_ sent | 0 | control_ recv | 0 | deps_distance | 0.000000 | state_comment | Initialized | index_size | 0 | addresses | | conf_id | 184467440737095 51615 | state_uuid | eb4b0cbb- 88ea-11e3- bcab-160cab62cd b7 | status | non-Primary | bf_aborts | 0 | 51615 | vendor | Codership Oy <email address hidden> | version | 2.8(r165) | ------- ------- ------- -+----- ------- ------- ------- ------- -----+
+------
| Variable_name | Value |
+------
| wsrep_local_
| wsrep_protocol_
| wsrep_last_
| wsrep_replicated | 0 |
| wsrep_replicate
| wsrep_received | 44 |
| wsrep_received_
| wsrep_local_commits | 0 |
| wsrep_local_
| wsrep_local_replays | 0 |
| wsrep_local_
| wsrep_local_
| wsrep_local_
| wsrep_local_
| wsrep_flow_
| wsrep_flow_
| wsrep_flow_
| wsrep_cert_
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_
| wsrep_cert_
| wsrep_causal_reads | 0 |
| wsrep_incoming_
| wsrep_cluster_
| wsrep_cluster_size | 0 |
| wsrep_cluster_
| wsrep_cluster_
| wsrep_connected | ON |
| wsrep_local_
| wsrep_local_index | 184467440737095
| wsrep_provider_name | Galera |
| wsrep_provider_
| wsrep_provider_
| wsrep_ready | OFF |
+------
40 rows in set (0.00 sec)
The last error log entries before nodes dropped out of cluster:
140129 21:10:16 [Note] WSREP: evs::msg{ version= 0,type= 1,user_ type=255, order=4, seq=0,seq_ range=0, aru_seq= -1,flags= 4,source= f760e81e- 891e-11e3- b212-97b6532891 7c,source_ view_id= view_id( REG,6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,137) ,range_ uuid=00000000- 0000-0000- 0000-0000000000 00,range= [-1,-1] ,fifo_seq= 5075,node_ list=() version= 0,type= 1,user_ type=255, order=4, seq=0,seq_ range=0, aru_seq= 9,flags= 6,source= a0ca33ea- 891d-11e3- 981b-47f09eb33d f9,source_ view_id= view_id( REG,6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,137) ,range_ uuid=00000000- 0000-0000- 0000-0000000000 00,range= [-1,-1] ,fifo_seq= 8269,node_ list=() evs::proto( a4edf89e- 891d-11e3- 995f-bb38ec0561 75, OPERATIONAL, view_id( REG,6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,137) ), OPERATIONAL) { view=view( view_id( REG,6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,137) memb {
6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,
a0ca33ea- 891d-11e3- 981b-47f09eb33d f9,
a4edf89e- 891d-11e3- 995f-bb38ec0561 75,
f760e81e- 891e-11e3- b212-97b6532891 7c, evs::input_ map: {aru_seq= 0,safe_ seq=0,node_ index=node: {idx=0, range=[ 18,17], safe_seq= 1} node: {idx=1, range=[ 1,10],safe_ seq=9} node: {idx=2, range=[ 18,17], safe_seq= 0} node: {idx=3, range=[ 18,17], safe_seq= 12} },
6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,evs: :node{operation al=1,suspected= 0,installed= 1,fifo_ seq=8404, }
a0ca33ea- 891d-11e3- 981b-47f09eb33d f9,evs: :node{operation al=1,suspected= 0,installed= 1,fifo_ seq=8290, }
a4edf89e- 891d-11e3- 995f-bb38ec0561 75,evs: :node{operation al=1,suspected= 0,installed= 1,fifo_ seq=-1, }
f760e81e- 891e-11e3- b212-97b6532891 7c,evs: :node{operation al=1,suspected= 0,installed= 1,fifo_ seq=5106, } 891d-11e3- 995f-bb38ec0561 75 node 6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3 prim state message and local states not consistent: msg node prim=1, un=0,last_ seq=2,last_ prim=view_ id(PRIM, 6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,136) ,to_seq= 141,weight= 1 local state prim=1, un=1,last_ seq=2,last_ prim=view_ id(PRIM, 6b6ff9b8- 891d-11e3- 80c2-fe8c1e9f83 f3,136) ,to_seq= 141,weight= 1 (FATAL) pc_proto. cpp:validate_ state_msgs( ):606 88ea-11e3- bcab-160cab62cd b7:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
} 168
140129 21:10:16 [ERROR] WSREP: exception caused by message: evs::msg{
}
state after handling message: evs::proto(
current_
} joined {
} left {
} partitioned {
}),
input_map=
fifo_seq=8187,
last_sent=17,
known={
}
}140129 21:10:16 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: a4edf89e-
at gcomm/src/
140129 21:10:16 [Note] WSREP: Received self-leave message.
140129 21:10:16 [Note] WSREP: Flow-control interval: [0, 0]
140129 21:10:16 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140129 21:10:16 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 0)
140129 21:10:16 [Note] WSREP: RECV thread exiting 0: Success
140129 21:10:16 [Note] WSREP: New cluster view: global state: eb4b0cbb-
140129 21:10:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140129 21:10:16 [Note] WSREP: applier thread exiting (code:0)