Comment 7 for bug 1285380

Revision history for this message
Ryan Gordon (ryan-5) wrote :

Hi Alex,

The main issue we're trying to tackle is the graceful interaction between a regularly scheduled xtrabackup backup process on a particular node and also configuring the cluster to use this same node for xtrabackup ISTs/SSTs when another node requests it. Because this node is isolated from any type of live traffic (it only does replication) it serves to reduce the risk of complications during a FTWRL in the backup/SST process. Again, the problem we're having is the graceful interaction between the two: How do we handle making an SST priority when it is requested over a regularly scheduled backup. Currently that is not possible because once a backup puts the node into a desynced state, there is no way for us to tell when a node requests an SST and cancel the backup and put the node back into a synced state for the SST to continue. Right the only thing the joiner node does is "wait" for the node (and eventually timeout) while it is in a desynced state. There's no functionality to gracefully stop a backup and put the donor node back into a synced state when the donor gets an SST request. Having that functionality would be the "ideal situation."

1) It would be great if we could control the timeout (maybe this is already possible)?

2) It is great that the joiner node has an error message (although it is kind of cryptic to understand what that message really means, it looks like its trying to describe a memory issue at first glance) but the donor node does not produce any logging messages which can be really frustrating in trying to figure out what the donor node is doing/why the SST isn't working. I would much prefer it if some logging was added to the donor node for this situation

3) Can you explain this a little more? Even if it requests an SST instead of an IST, the donor node will still be in a desynced state due to the backup and it will still eventually timeout

4) That is the million dollar question. The cluster only contains 3 nodes, so by making the backup node the donor node for the cluster we reduce risk of bringing down our entire cluster. If we have one node trying to join and SST, and the backup node already in a desynced state, and when the joiner node would failover to the last available node actively handling traffic we've effectively compromised our entire cluster which defeats the entire purpose of having a cluster for HA.

5) I'm not sure how that would work, that seems like that would be a pretty complicated mess of custom modifications and bash scripts to get something like that working. We're actually in the works of replacing our current xtrabackup backup solution with a ZFS snapshot solution so the node only has to enter desynced state for a few seconds per hour which greatly reduces the chance of having an issue occur during an SST. We will likely replace the xtrabackup SST solution with this ZFS snapshot solution due to the current complexity/instability with an xtrabackup SST and its interaction with a regularly scheduled xtrabackup backup process.