Donor in desynced state makes the joiner wait indefinitely

Bug #1285380 reported by Ovais Tariq on 2014-02-26
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Galera
Undecided
Unassigned
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

If a particular node is setup to be the donor via setting wsrep_sst_donor and the donor node is in desynced state for some reason, then the joiner simply waits for the donor to become synced again. May be the joiner should error out, or log a message to the error log.

a)Where is the wait, in IST, SST or somewhere else in galera?

b) Here, too, can you provide the logs of donor and joiner?

Alex Yurchenko (ayurchen) wrote :

Raghu, this is by design. Non-synced node can't be used for state transfer.

Ovais, is there any serious use case for that? We can make the joiner to abort in this case, but then there will be another complaint that the joiner aborts instead of waiting for the donor to become available. We can provide only as much flexibility with an simple interface like that (or we need to add more options).

Changed in percona-xtradb-cluster:
status: New → Confirmed
Ovais Tariq (ovais-tariq) wrote :

Hi Alex,

Perhaps we could have an option so that we can control when the state transfer times out. Since wsrep_sst_donor now accepts multiple hostnames, so timing out the state transfer request on one node, would allow PXC to proceed with the SST using the second node. Or does it already do it that way?

I have a customer who experienced the long waits when the node specified for the wsrep_sst_donor option is in desynced state. The node that is specified as the donor is also used to do backups and the node is put in a desynced state when running the backup. Now if a SST is needed at the same time that the backup is running, the joiner simply waits.

On 2014-02-28 15:11, Ovais Tariq wrote:
> Hi Alex,
>
> Perhaps we could have an option so that we can control when the state
> transfer times out. Since wsrep_sst_donor now accepts multiple
> hostnames, so timing out the state transfer request on one node, would
> allow PXC to proceed with the SST using the second node. Or does it
> already do it that way?

Yes.

> I have a customer who experienced the long waits when the node
> specified
> for the wsrep_sst_donor option is in desynced state. The node that is
> specified as the donor is also used to do backups and the node is put
> in
> a desynced state when running the backup. Now if a SST is needed at the
> same time that the backup is running, the joiner simply waits.

What was the exact wsrep_sst_donor value? Did it imply failover to some
other node? Otherwise, if that node was the only one in the list and
there was no trailing comma, then it is treated as precise instruction
to use ONLY THAT node for SST and, since that node is busy, what other
option is there but wait?

> --
> You received this bug notification because you are a member of
> Codership, which is subscribed to Percona XtraDB Cluster.
> https://bugs.launchpad.net/bugs/1285380
>
> Title:
> Donor in desynced state makes the joiner wait indefinitely
>
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> Confirmed
>
> Bug description:
> If a particular node is setup to be the donor via setting
> wsrep_sst_donor and the donor node is in desynced state for some
> reason, then the joiner simply waits for the donor to become synced
> again. May be the joiner should error out, or log a message to the
> error log.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Ovais Tariq (ovais-tariq) wrote :
Download full text (3.5 KiB)

Hi Alex,

On Fri, Feb 28, 2014 at 6:48 PM, Alex Yurchenko
<email address hidden>wrote:

> On 2014-02-28 15:11, Ovais Tariq wrote:
> > Hi Alex,
> >
> > Perhaps we could have an option so that we can control when the state
> > transfer times out. Since wsrep_sst_donor now accepts multiple
> > hostnames, so timing out the state transfer request on one node, would
> > allow PXC to proceed with the SST using the second node. Or does it
> > already do it that way?
>
> Yes.
>
> > I have a customer who experienced the long waits when the node
> > specified
> > for the wsrep_sst_donor option is in desynced state. The node that is
> > specified as the donor is also used to do backups and the node is put
> > in
> > a desynced state when running the backup. Now if a SST is needed at the
> > same time that the backup is running, the joiner simply waits.
>
> What was the exact wsrep_sst_donor value? Did it imply failover to some
> other node? Otherwise, if that node was the only one in the list and
> there was no trailing comma, then it is treated as precise instruction
> to use ONLY THAT node for SST and, since that node is busy, what other
> option is there but wait?
>

wsrep_sst_donor only had a single hostname specified. So yes that should be
the ONLY node used for SST, but perhaps there should be a timeout of some
sort so that the DBA is notified about SST failure and he can possibly
bring out the other node from desynced state.

>
> > --
> > You received this bug notification because you are a member of
> > Codership, which is subscribed to Percona XtraDB Cluster.
> > https://bugs.launchpad.net/bugs/1285380
> >
> > Title:
> > Donor in desynced state makes the joiner wait indefinitely
> >
> > Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> > Confirmed
> >
> > Bug description:
> > If a particular node is setup to be the donor via setting
> > wsrep_sst_donor and the donor node is in desynced state for some
> > reason, then the joiner simply waits for the donor to become synced
> > again. May be the joiner should error out, or log a message to the
> > error log.
> >
> > To manage notifications about this bug go to:
> >
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions
>
> --
> Alexey Yurchenko,
> Codership Oy, www.codership.com
> Skype: alexey.yurchenko, Phone: +358-400-516-011
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1285380
>
> Title:
> Donor in desynced state makes the joiner wait indefinitely
>
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> Confirmed
>
> Bug description:
> If a particular node is setup to be the donor via setting
> wsrep_sst_donor and the donor node is in desynced state for some
> reason, then the joiner simply waits for the donor to become synced
> again. May be the joiner should error out, or log a message to the
> error log.
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions
>

--

Ovais Tariq, Consultant, Percona

http://www.percona.com | http://www.mysql...

Read more...

Alex Yurchenko (ayurchen) wrote :

On 2014-02-28 15:58, Ovais Tariq wrote:
> wsrep_sst_donor only had a single hostname specified. So yes that
> should be
> the ONLY node used for SST, but perhaps there should be a timeout of
> some
> sort so that the DBA is notified about SST failure and he can possibly
> bring out the other node from desynced state.

Timeout is possible, however:

1) It is not exactly a failure. Even exactly not a failure. The node was
instructed to use only one donor for SST and hence is patiently waiting
for it to be available. What can be better, aborting?

2) When a joiner is waiting for the donor to be available, you can see
the following in the error log:
140228 18:54:41 [Note] WSREP: Requesting state transfer failed:
-11(Resource temporarily unavailable). Will keep retrying every 1
second(s)
so it is not exactly silent.

3) Joiner is by definition in undefined state. If it is aborted at that
moment, next time it will request an SST (unless you know what you're
doing).

4) In that particular case, are they willing to decrease availability of
their cluster even more by allowing another node to become an SST donor?

5) Even more, since there is already backup going on, it would be
considerably faster to use that backup (when it's finished) to bootstrap
the joiner rather then do it all over again via SST from another node.

Ryan Gordon (ryan-5) wrote :

Hi Alex,

The main issue we're trying to tackle is the graceful interaction between a regularly scheduled xtrabackup backup process on a particular node and also configuring the cluster to use this same node for xtrabackup ISTs/SSTs when another node requests it. Because this node is isolated from any type of live traffic (it only does replication) it serves to reduce the risk of complications during a FTWRL in the backup/SST process. Again, the problem we're having is the graceful interaction between the two: How do we handle making an SST priority when it is requested over a regularly scheduled backup. Currently that is not possible because once a backup puts the node into a desynced state, there is no way for us to tell when a node requests an SST and cancel the backup and put the node back into a synced state for the SST to continue. Right the only thing the joiner node does is "wait" for the node (and eventually timeout) while it is in a desynced state. There's no functionality to gracefully stop a backup and put the donor node back into a synced state when the donor gets an SST request. Having that functionality would be the "ideal situation."

1) It would be great if we could control the timeout (maybe this is already possible)?

2) It is great that the joiner node has an error message (although it is kind of cryptic to understand what that message really means, it looks like its trying to describe a memory issue at first glance) but the donor node does not produce any logging messages which can be really frustrating in trying to figure out what the donor node is doing/why the SST isn't working. I would much prefer it if some logging was added to the donor node for this situation

3) Can you explain this a little more? Even if it requests an SST instead of an IST, the donor node will still be in a desynced state due to the backup and it will still eventually timeout

4) That is the million dollar question. The cluster only contains 3 nodes, so by making the backup node the donor node for the cluster we reduce risk of bringing down our entire cluster. If we have one node trying to join and SST, and the backup node already in a desynced state, and when the joiner node would failover to the last available node actively handling traffic we've effectively compromised our entire cluster which defeats the entire purpose of having a cluster for HA.

5) I'm not sure how that would work, that seems like that would be a pretty complicated mess of custom modifications and bash scripts to get something like that working. We're actually in the works of replacing our current xtrabackup backup solution with a ZFS snapshot solution so the node only has to enter desynced state for a few seconds per hour which greatly reduces the chance of having an issue occur during an SST. We will likely replace the xtrabackup SST solution with this ZFS snapshot solution due to the current complexity/instability with an xtrabackup SST and its interaction with a regularly scheduled xtrabackup backup process.

Alex Yurchenko (ayurchen) wrote :
Download full text (4.0 KiB)

On 2014-02-28 23:31, Ryan Gordon wrote:
> 1) It would be great if we could control the timeout (maybe this is
> already possible)?

It is not implemented at the moment, but should be possible to
implement.

However, from what I understand, in this case the only purpose of the
timeout is to be a sort of notification for the operator that the donor
node is currently unavailable doing the backup. Which operator may be
already aware of, since the backup is a scheduled operation. Given that
during the joining phase one can't connect to mysqld (by mysqld design),
the only source of feedback is the error log. So it is only natural to
expect the operator to check the log for the progress of the operation.
So this is another source to get the information that the donor is
unavailable.

At the same time timeout will result in aborting the joiner operation
(what else?), which is exactly the opposite to what you want: bringing
up the joiner as fast as possible. Ideally the joiner should keep on
working and waiting for donor to become available and operator should
simply go to donor and make sure it is not busy with something else.

So I'm not exactly sure if timeout is what you really want. It doesn't
even save you from looking at the error log - you still have to do it
after joiner aborts. It just aborts the normally progressing operation.

> 2) It is great that the joiner node has an error message (although it
> is
> kind of cryptic to understand what that message really means, it looks
> like its trying to describe a memory issue at first glance)

Really? doesn't it say that it temporarily cannot access a "resource"
(prescribed donor) and will keep on retrying? It is kinda generic, but
at the level it is logged this is all information that it has (error
code).

> but the
> donor node does not produce any logging messages which can be really
> frustrating in trying to figure out what the donor node is doing/why
> the
> SST isn't working. I would much prefer it if some logging was added to
> the donor node for this situation.

Are you sure you would prefer a log message on donor? It is the joiner
who is having problems, not donor, so why would you look into donor log
before checking the joiner? What sort of message should the donor
produce that would not contain redundant information?

Besides it is not working as you think it is. Cluster is selecting a
donor based on the information provided by joiner. If the donor is
unavailable, then joiner receive error, but donor does not receive
anything. It is unaware about joiner having problems. One reason it is
so is that there can be several potential donors.

But, suppose we can go as far as see that there is only one possible
donor and then we pass the state transfer request to it. What can donor
do? It knows that it is in desynced state and something serious is going
on. But it does not know

1) what is going on?
2) is it interruptible?
3) how to interrupt it?

In this particular case, it is an xtrabackup started externally. For
mysqld to know something about it and being able to interrupt it we'll
have to integrate backup process with mysqld - which (besides being a
quest...

Read more...

Ryan Gordon (ryan-5) wrote :
Download full text (5.5 KiB)

Hi Alex,

> It is not implemented at the moment, but should be possible to
> implement.
>
> However, from what I understand, in this case the only purpose of the
> timeout is to be a sort of notification for the operator that the donor
> node is currently unavailable doing the backup. Which operator may be
> already aware of, since the backup is a scheduled operation. Given that
> during the joining phase one can't connect to mysqld (by mysqld design),
> the only source of feedback is the error log. So it is only natural to
> expect the operator to check the log for the progress of the operation.
> So this is another source to get the information that the donor is
> unavailable.
>
> At the same time timeout will result in aborting the joiner operation
> (what else?), which is exactly the opposite to what you want: bringing
> up the joiner as fast as possible. Ideally the joiner should keep on
> working and waiting for donor to become available and operator should
> simply go to donor and make sure it is not busy with something else.
>
> So I'm not exactly sure if timeout is what you really want. It doesn't
> even save you from looking at the error log - you still have to do it
> after joiner aborts. It just aborts the normally progressing operation.

You're correct a timeout isn't exactly related problem I'm trying to solve, but in respect to the original ticket, it is still a very helpful feature because in the event that the cluster is in a compromised state and an SST is not responding, DBA's can tune or understand what to expect for the default SST timeout. Right now I have no idea how long it would be before the joiner node would give up.

> Really? doesn't it say that it temporarily cannot access a "resource"
> (prescribed donor) and will keep on retrying? It is kinda generic, but
> at the level it is logged this is all information that it has (error
> code).

I do still contend that the message is not very clear: "140228 18:54:41 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)"

"Resource temporarily unavailable" doesn't really tell me anything specific and I can't find a list of PXC error codes anywhere on the percona website. Why not "[Note]: WSREP: Requesting state transfer failed: (Error Code: -11 Donor already in DESYNCED state, must be SYNCED first)."? That would be a lot clearer to me

> Are you sure you would prefer a log message on donor? It is the joiner
> who is having problems, not donor, so why would you look into donor log
> before checking the joiner? What sort of message should the donor
> produce that would not contain redundant information?

Yes. In the situations I've experienced the donor is having a problem (it is already in a desynced state and thus unable to provide a state transfer), not the joiner (perhaps the joiner is trying to IST from a scheduled restart which would be a perfectly nominal scenario). And so in these situations it is useful to have debug information on the donor saying "[Note] WSREP: Joiner node XYZ requesting state transfer" and subsuquently "[Note] WSREP: Unable to provide state transfer to node XYZ, already in D...

Read more...

Download full text (5.1 KiB)

So, this is how it looks:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
    do {
        end = strchr(begin, ',');

        int len;

        if (NULL == end) {
            len = str_len - (begin - str);
        }
        else {
            len = end - begin;
        }

        assert (len >= 0);

        int const idx = len > 0 ? /* consider empty name as "any" */
            group_find_node_by_name (group, joiner_idx, begin, len, status) :
            /* err == -EAGAIN here means that at least one of the nodes in the
             * list will be available later, so don't try others. */
            (err == -EAGAIN ?
             err : group_find_node_by_state(group, joiner_idx, status));

        if (idx >= 0) return idx;

        /* once we hit -EAGAIN, don't try to change error code: this means
         * that at least one of the nodes in the list will become available. */
        if (-EAGAIN != err) err = idx;

        begin = end + 1; /* skip comma */

    } while (end != NULL);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Based on my tests, when wsrep_sst_donor='A1,A2,' and A1 is
unavailable (non-SYNCED), A3 does SST from A2 without any issues.

*However*, if wsrep_sst_donor='A1,' and A1 is unavailable, then
it keeps looping without any bounds (there are no strict bounds on
number of retries):

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    do
    {
        tries++;

        gcs_seqno_t seqno_l;

        ret = gcs_.request_state_transfer(req->req(), req->len(), sst_donor_,
                                          &seqno_l);

        if (ret < 0)
        {
            if (!retry_str(ret))
            {
                log_error << "Requesting state transfer failed: "
                          << ret << "(" << strerror(-ret) << ")";
            }
            else if (1 == tries)
            {
                log_info << "Requesting state transfer failed: "
                         << ret << "(" << strerror(-ret) << "). "
                         << "Will keep retrying every " << sst_retry_sec_
                         << " second(s)";
            }
        }

        if (seqno_l != GCS_SEQNO_ILL)
        {
            /* Check that we're not running out of space in monitor. */
            if (local_monitor_.would_block(seqno_l))
            {
                long const seconds = sst_retry_sec_ * tries;
                log_error << "We ran out of resources, seemingly because "
                          << "we've been unsuccessfully requesting state "
                          << "transfer for over " << seconds << " seconds. "
                          << "Please check that there is "
                          << "at least one fully synced member in the group. "
                          << "Application must be restarted.";
                ret = -EDEADLK;
            }
            else
            {
                // we are already holding local monitor
                LocalOrder lo(seqno_l);
                local_monitor_.self_cancel(lo);
            }
        }
    }

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

From log:

140309 0:01:32 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Fre...

Read more...

With following, it works as described in last comment:

=== modified file 'gcs/src/gcs_group.c'
--- gcs/src/gcs_group.c 2014-02-28 02:15:22 +0000
+++ gcs/src/gcs_group.c 2014-03-08 20:20:23 +0000
@@ -894,6 +894,10 @@
     const char* begin = str;
     const char* end;
     int err = -EHOSTDOWN; /* worst error */
+ bool dcomma = false; /* dangling comma */
+
+ if (!strcmp(str+strlen(str)-1,","))
+ dcomma = true;

     do {
         end = strchr(begin, ',');
@@ -912,8 +916,10 @@
         int const idx = len > 0 ? /* consider empty name as "any" */
             group_find_node_by_name (group, joiner_idx, begin, len, status) :
             /* err == -EAGAIN here means that at least one of the nodes in the
- * list will be available later, so don't try others. */
- (err == -EAGAIN ?
+ * list will be available later, so don't try others.
+ * Also, dangling comma is checked to see if fallback is attempted
+ * or not */
+ ((err == -EAGAIN && !dcomma) ?
              err : group_find_node_by_state(group, joiner_idx, status));

         if (idx >= 0) return idx;

ie. with dangling comma, it falls back to provider discovery of
nodes based on their states, without it, it is strict in
conforming to the nodes.

One more thing:

because of this condition:

            else if (node->status >= GCS_NODE_STATE_JOINER) {
                /* will eventually become SYNCED */
                return -EAGAIN;
            }

a node which is itself a joiner (other than this joiner) may end
up being chosen as the donor which can lead to really long wait
times for this donor.

This should be fixed in upstream galera3 (not 3.5 but the HEAD) too (since the dangling comma is accounted there now).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers