Comment 3 for bug 1274192

Revision history for this message
Przemek (pmalkowski) wrote :

I was also able to reproduce the problem on a smaller, 4 node cluster.
Again, only 1st node had bad network. No load traffic at all during the test. New logs attached as logs3.tgz.
After some time of working, eventually nodes 2,3 and 4 went down with status like this:

percona2 mysql> show status like 'ws%';
+----------------------------+--------------------------------------+
| Variable_name | Value |
+----------------------------+--------------------------------------+
| wsrep_local_state_uuid | eb4b0cbb-88ea-11e3-bcab-160cab62cdb7 |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 0 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 44 |
| wsrep_received_bytes | 10425 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 0 |
| wsrep_cluster_state_uuid | eb4b0cbb-88ea-11e3-bcab-160cab62cdb7 |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 18446744073709551615 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 2.8(r165) |
| wsrep_ready | OFF |
+----------------------------+--------------------------------------+
40 rows in set (0.00 sec)

The last error log entries before nodes dropped out of cluster:

140129 21:10:16 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=f760e81e-891e-11e3-b212-97b65328917c,source_view_id=view_id(REG,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,137),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=5075,node_list=()
} 168
140129 21:10:16 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=9,flags=6,source=a0ca33ea-891d-11e3-981b-47f09eb33df9,source_view_id=view_id(REG,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,137),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8269,node_list=()
}
 state after handling message: evs::proto(evs::proto(a4edf89e-891d-11e3-995f-bb38ec056175, OPERATIONAL, view_id(REG,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,137)), OPERATIONAL) {
current_view=view(view_id(REG,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,137) memb {
        6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,
        a0ca33ea-891d-11e3-981b-47f09eb33df9,
        a4edf89e-891d-11e3-995f-bb38ec056175,
        f760e81e-891e-11e3-b212-97b65328917c,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[18,17],safe_seq=1} node: {idx=1,range=[1,10],safe_seq=9} node: {idx=2,range=[18,17],safe_seq=0} node: {idx=3,range=[18,17],safe_seq=12} },
fifo_seq=8187,
last_sent=17,
known={
        6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,evs::node{operational=1,suspected=0,installed=1,fifo_seq=8404,}
        a0ca33ea-891d-11e3-981b-47f09eb33df9,evs::node{operational=1,suspected=0,installed=1,fifo_seq=8290,}
        a4edf89e-891d-11e3-995f-bb38ec056175,evs::node{operational=1,suspected=0,installed=1,fifo_seq=-1,}
        f760e81e-891e-11e3-b212-97b65328917c,evs::node{operational=1,suspected=0,installed=1,fifo_seq=5106,}
 }
 }140129 21:10:16 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: a4edf89e-891d-11e3-995f-bb38ec056175 node 6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3 prim state message and local states not consistent: msg node prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,136),to_seq=141,weight=1 local state prim=1,un=1,last_seq=2,last_prim=view_id(PRIM,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,136),to_seq=141,weight=1 (FATAL)
         at gcomm/src/pc_proto.cpp:validate_state_msgs():606
140129 21:10:16 [Note] WSREP: Received self-leave message.
140129 21:10:16 [Note] WSREP: Flow-control interval: [0, 0]
140129 21:10:16 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140129 21:10:16 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 0)
140129 21:10:16 [Note] WSREP: RECV thread exiting 0: Success
140129 21:10:16 [Note] WSREP: New cluster view: global state: eb4b0cbb-88ea-11e3-bcab-160cab62cdb7:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
140129 21:10:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140129 21:10:16 [Note] WSREP: applier thread exiting (code:0)