Support multi-network clusters

Bug #1204500 reported by Bernhard Schmidt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
New
Wishlist
Unassigned

Bug Description

As far as I can tell, Percona/Galera only supports a single network to be used for replication traffic. If that network happens to be down the cluster will split.

Most HA systems (heartbeat, corosync) support and even recommend more than one path between the nodes. They are used simultaniously and the cluster will split only when all paths are down.

Revision history for this message
Jay Janssen (jay-janssen) wrote :

This is an interesting idea. On the one hand, you can handle this at a lower level with interface bonding, but at a higher level this should in theory make for a more robust cluster.

Such a feature would allow internal clusters with "bridge" nodes that would act as relays to external networks (say a DR site). Galera can already do this kind of relaying, but I don't know what complexity multiple gcomm networks would introduce (potentially non-trivial).

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

It would be nice to see a concrete example of a problem and a solution that "multi-network" would bring. This is to see
- how important this feature is
- how exactly it is supposed to work

By all means we too recommend multiple paths between the nodes. In fact it is an implicit requirement for any HA system, so it goes without saying. But it is a matter of infrastructure, it is not clear what is required of Galera here.

Revision history for this message
Bernhard Schmidt (berni) wrote :

Okay, one example.

We run a cluster of three nodes. They have normal network connectivity to a switch where user data is exchanged and the application is running. They also have direct connections to two seperate switches with a dedicated subnet each, say 192.168.1.0/24 and 192.168.2.0/24.

Corosync/Heartbeat/Pacemaker supports cluster heartbeat over both dedicated switches and the normal network connectivity. They can fail/reboot whenever needed, because one connection is sufficient for the cluster. As long as the nodes still see each other over one connection they are in sync and the normal quorum rules do not apply. This also allows the cluster to detect weird errors like (A can talk to B and C, but B and C cannot talk to each other). If a node looses uplink connectivity corosync (thus affecting the application) corosync can detect that and do something, i.e. _gracefully_ migrate the application away.

Percona however is different here. I can only connect to another Percona node with one IP address. If that connection is down, the cluster is split, the side without quorum is going down. Doesn't matter whether there are two other working connections to that node, if the primary IP of a node Percona is using goes down the cluster splits hard.

I think is needs to be a Galera feature (support multiple IP connections to a single neighbor node and gracefully failover). But maybe I'm getting the architecture wrong.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Ok, I can't verify it right away, but I believe that if you use the same subnet on all three interfaces, the kernel routing will do the trick for you, because Linux will responds to arp requests on all interfaces, regardless of the interface address. E.g. in case of

node1 node2
192.168.0.1 <-- X --> 192.168.0.2
192,168.0.3 <-------> 192.168.0.4

you should still be able to reach 192.168.0.2 from 192.168.0.1

Notice, that you can assign multiple IP addresses to a single interface, so you can still use those dedicated subnets for applications that want them.

Other options to consider is network interface bonding or bridging.

Changed in percona-xtradb-cluster:
importance: Undecided → Wishlist
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1167

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.