We started with a three node cluster. Default install of MariaDB 10.0.14-1 with Galera with the exception of the noted configs below. The config files and logs from each node are in the attached zip file. I have been able to replicate this on 5.5 and 10.0. Have also replicated the issue using wsrep_sst_method mysqldump and rsync.
Configs different than default:
wsrep_node_address=192.168.56.23x
wsrep_node_incoming_address=192.168.56.23x
wsrep_sst_method=xtrabackup
wsrep_sst_auth=sstuser:mariadb
On node2, I added the following firewall rules to drop traffic to and from node1:
[root@mdbc10_node2 mysql]# iptables -I OUTPUT -d 192.168.56.231 -j DROP
[root@mdbc10_node2 mysql]# iptables -I INPUT -d 192.168.56.231 -j DROP
This situation simulates a network outage between nodes 1 and 2, but 1 can still talk to 3 as can 2.
I then stopped mysql on node2:
[root@mdbc10_node2 mysql]# service mysql stop
Shutting down MySQL..... SUCCESS!
And now mysql on node2 cannot start:
[root@mdbc10_node2 mysql]# service mysql start
Starting MySQL................................. ERROR!
At this point both 1 and 3 become non-pc, and redo quorum.
If I start node2 with wsrep_cluster_address set to node3, it starts:
[root@mdbc10_node2 mysql]# service mysql start --wsrep_cluster_address=gcomm://192.168.56.233
Starting MySQL.. SUCCESS!
Joffrey's notes from trying to replicate the issue:
3 node running Primary cluster: A,B,C.
wsrep_cluster_address in cfg = A,B,C.
Shutting connections between A and B (iptables -j REJECT or -j DROP, both ways). Cluster is still alive.
If I try to restart node A with --wsrep_cluster_address=C, and wsrep_sst_donor = C : Works
If I try to restart node A with --wsrep_cluster_address=C, and wsrep_sst_donor = B : Fails
Note that I tried this because, without specifying wsrep_sst_donor, it tried to contact node B (I understood that SST donor was chosen from "1st one who replies" ... maybe I was wrong)
If I try to restart node A with --wsrep_cluster_address=B (whatever wsrep_sst_donor): Fails. This makes sense.
Now ....
If I try to restart node A with --wsrep_cluster_address=B,C. It fails, but also put C in non-primary for 1 minute at least. Whatever donor I put, it fails to start.
We started with a three node cluster. Default install of MariaDB 10.0.14-1 with Galera with the exception of the noted configs below. The config files and logs from each node are in the attached zip file. I have been able to replicate this on 5.5 and 10.0. Have also replicated the issue using wsrep_sst_method mysqldump and rsync.
Configs different than default: address= 192.168. 56.23x incoming_ address= 192.168. 56.23x method= xtrabackup auth=sstuser: mariadb
wsrep_node_
wsrep_node_
wsrep_sst_
wsrep_sst_
On node2, I added the following firewall rules to drop traffic to and from node1:
[root@mdbc10_node2 mysql]# iptables -I OUTPUT -d 192.168.56.231 -j DROP
[root@mdbc10_node2 mysql]# iptables -I INPUT -d 192.168.56.231 -j DROP
This situation simulates a network outage between nodes 1 and 2, but 1 can still talk to 3 as can 2.
I then stopped mysql on node2:
[root@mdbc10_node2 mysql]# service mysql stop
Shutting down MySQL..... SUCCESS!
And now mysql on node2 cannot start: ....... ....... ....... ....... ... ERROR!
[root@mdbc10_node2 mysql]# service mysql start
Starting MySQL..
At this point both 1 and 3 become non-pc, and redo quorum.
If I start node2 with wsrep_cluster_ address set to node3, it starts: cluster_ address= gcomm:/ /192.168. 56.233
[root@mdbc10_node2 mysql]# service mysql start --wsrep_
Starting MySQL.. SUCCESS!
Joffrey's notes from trying to replicate the issue:
3 node running Primary cluster: A,B,C. address in cfg = A,B,C.
wsrep_cluster_
Shutting connections between A and B (iptables -j REJECT or -j DROP, both ways). Cluster is still alive.
If I try to restart node A with --wsrep_ cluster_ address= C, and wsrep_sst_donor = C : Works cluster_ address= C, and wsrep_sst_donor = B : Fails
If I try to restart node A with --wsrep_
Note that I tried this because, without specifying wsrep_sst_donor, it tried to contact node B (I understood that SST donor was chosen from "1st one who replies" ... maybe I was wrong)
If I try to restart node A with --wsrep_ cluster_ address= B (whatever wsrep_sst_donor): Fails. This makes sense.
Now .... cluster_ address= B,C. It fails, but also put C in non-primary for 1 minute at least. Whatever donor I put, it fails to start.
If I try to restart node A with --wsrep_
Is this a known situation ?