Comment 7 for bug 1274192

Revision history for this message
Muhammad Irfan (muhammad-irfan) wrote :

I reproduced this problem on 4 node cluster from percona1-4. PXC 5.5.37 with wsrep 2.10(r175).
I introduced bad network with loss/delay on percona1

1) After sometime cluster started malfunctioning all nodes went into wsrep_local_state_comment = Initialized and wsrep_cluster_status = non-Primary

2) I made percona2 as primary, all other nodes are into the cluster and percona1 (having network issues) still trying to connect and eventually entire cluster went down. wsrep stopped working & i have to issue kill -9 on all nodes to bring cluster up again.

[root@percona2 ~]# mysql
mysql> show status like 'wsrep%';
+----------------------------+--------------------------------------+
| Variable_name | Value |
+----------------------------+--------------------------------------+
| wsrep_local_state_uuid | 9f581f39-eb03-11e3-8eb4-97664aaec97d |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 0 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 151 |
| wsrep_received_bytes | 34204 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 0 |
| wsrep_cluster_state_uuid | 9f581f39-eb03-11e3-8eb4-97664aaec97d |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 18446744073709551615 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 2.10(r175) |
| wsrep_ready | OFF |
+----------------------------+--------------------------------------+

mysql> select 1;
ERROR 1047 (08S01): Unknown command

[root@percona2 ~]# tail -f /var/log/mysqld.log
140605 10:08:25 [ERROR] WSREP: exception from gcomm, backend must be restarted:8f61539e-ec8c-11e3-ae76-6b64399ac83dInstall message self state does not match, message state: prim=0,un=0,last_seq=2,last_prim=view_id(PRIM,8f61539e-ec8c-11e3-ae76-6b64399ac83d,1165),to_seq=445,weight=1, local state: prim=0,un=1,last_seq=2,last_prim=view_id(PRIM,8f61539e-ec8c-11e3-ae76-6b64399ac83d,1165),to_seq=445,weight=1 (FATAL)
         at gcomm/src/pc_proto.cpp:handle_install():1047
140605 10:08:25 [Note] WSREP: Received self-leave message.
140605 10:08:25 [Note] WSREP: Flow-control interval: [0, 0]
140605 10:08:25 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140605 10:08:25 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 0)
140605 10:08:25 [Note] WSREP: RECV thread exiting 0: Success
140605 10:08:25 [Note] WSREP: New cluster view: global state: 9f581f39-eb03-11e3-8eb4-97664aaec97d:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
140605 10:08:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140605 10:08:25 [Note] WSREP: applier thread exiting (code:0)