Simulated network outage between 2 of 3 nodes, node cannot start

Bug #1399005 reported by Ben Stillman on 2014-12-03
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Undecided
Unassigned

Bug Description

We started with a three node cluster. Default install of MariaDB 10.0.14 with Galera with the exception of the noted configs below. The config files and logs from each node are in the attached zip file. I have been able to replicate this on 5.5 and 10.0.

Configs different than default:
wsrep_node_address=192.168.56.23x
wsrep_node_incoming_address=192.168.56.23x
wsrep_sst_method=xtrabackup
wsrep_sst_auth=sstuser:mariadb

On node2, I added the following firewall rules to drop traffic to and from node1:
[root@mdbc10_node2 mysql]# iptables -I OUTPUT -d 192.168.56.231 -j DROP
[root@mdbc10_node2 mysql]# iptables -I INPUT -d 192.168.56.231 -j DROP

This situation simulates a network outage between nodes 1 and 2, but 1 can still talk to 3 as can 2.

I then stopped mysql on node2:
[root@mdbc10_node2 mysql]# service mysql stop
Shutting down MySQL..... SUCCESS!

And now mysql on node2 cannot start:
[root@mdbc10_node2 mysql]# service mysql start
Starting MySQL................................. ERROR!

At this point both 1 and 3 become non-pc, and redo quorum.

If I start node2 with wsrep_cluster_address set to node3, it starts:
[root@mdbc10_node2 mysql]# service mysql start --wsrep_cluster_address=gcomm://192.168.56.233
Starting MySQL.. SUCCESS!

Joffrey's notes from trying to replicate the issue:

3 node running Primary cluster: A,B,C.
wsrep_cluster_address in cfg = A,B,C.
Shutting connections between A and B (iptables -j REJECT or -j DROP, both ways). Cluster is still alive.

If I try to restart node A with --wsrep_cluster_address=C, and wsrep_sst_donor = C : Works
If I try to restart node A with --wsrep_cluster_address=C, and wsrep_sst_donor = B : Fails
Note that I tried this because, without specifying wsrep_sst_donor, it tried to contact node B (I understood that SST donor was chosen from "1st one who replies" ... maybe I was wrong)

If I try to restart node A with --wsrep_cluster_address=B (whatever wsrep_sst_donor): Fails. This makes sense.

Now ....
If I try to restart node A with --wsrep_cluster_address=B,C. It fails, but also put C in non-primary for 1 minute at least. Whatever donor I put, it fails to start.

Is this a known situation ?

Ben Stillman (bstillman-i) wrote :
description: updated
description: updated
Ben Stillman (bstillman-i) wrote :

Adding logs from 3.6 testing and linking to Github issue.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers