Galera

Bug #1399005
Comment #0

Comment 0 for bug 1399005

Revision history for this message

Ben Stillman (bstillman-i) wrote on 2014-12-03:

We started with a three node cluster. Default install of MariaDB 10.0.14-1 with Galera with the exception of the noted configs below. The config files and logs from each node are in the attached zip file. I have been able to replicate this on 5.5 and 10.0. Have also replicated the issue using wsrep_sst_method mysqldump and rsync.

Configs different than default:
wsrep_node_address=192.168.56.23x
wsrep_node_incoming_address=192.168.56.23x
wsrep_sst_method=xtrabackup
wsrep_sst_auth=sstuser:mariadb

On node2, I added the following firewall rules to drop traffic to and from node1:
[root@mdbc10_node2 mysql]# iptables -I OUTPUT -d 192.168.56.231 -j DROP
[root@mdbc10_node2 mysql]# iptables -I INPUT -d 192.168.56.231 -j DROP

This situation simulates a network outage between nodes 1 and 2, but 1 can still talk to 3 as can 2.

I then stopped mysql on node2:
[root@mdbc10_node2 mysql]# service mysql stop
Shutting down MySQL..... SUCCESS!

And now mysql on node2 cannot start:
[root@mdbc10_node2 mysql]# service mysql start
Starting MySQL................................. ERROR!

At this point both 1 and 3 become non-pc, and redo quorum.

If I start node2 with wsrep_cluster_address set to node3, it starts:
[root@mdbc10_node2 mysql]# service mysql start --wsrep_cluster_address=gcomm://192.168.56.233
Starting MySQL.. SUCCESS!

Joffrey's notes from trying to replicate the issue:

3 node running Primary cluster: A,B,C.
wsrep_cluster_address in cfg = A,B,C.
Shutting connections between A and B (iptables -j REJECT or -j DROP, both ways). Cluster is still alive.

If I try to restart node A with --wsrep_cluster_address=C, and wsrep_sst_donor = C : Works
If I try to restart node A with --wsrep_cluster_address=C, and wsrep_sst_donor = B : Fails
Note that I tried this because, without specifying wsrep_sst_donor, it tried to contact node B (I understood that SST donor was chosen from "1st one who replies" ... maybe I was wrong)

If I try to restart node A with --wsrep_cluster_address=B (whatever wsrep_sst_donor): Fails. This makes sense.

Now ....
If I try to restart node A with --wsrep_cluster_address=B,C. It fails, but also put C in non-primary for 1 minute at least. Whatever donor I put, it fails to start.

Is this a known situation ?