Bug #1285380 “Donor in desynced state makes the joiner wait inde...” : Bugs : Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-02-28:

#1

a)Where is the wait, in IST, SST or somewhere else in galera?

b) Here, too, can you provide the logs of donor and joiner?

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-02-28:

#2

Raghu, this is by design. Non-synced node can't be used for state transfer.

Ovais, is there any serious use case for that? We can make the joiner to abort in this case, but then there will be another complaint that the joiner aborts instead of waiting for the donor to become available. We can provide only as much flexibility with an simple interface like that (or we need to add more options).

Changed in percona-xtradb-cluster:
status:	New → Confirmed

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2014-02-28:

#3

Hi Alex,

Perhaps we could have an option so that we can control when the state transfer times out. Since wsrep_sst_donor now accepts multiple hostnames, so timing out the state transfer request on one node, would allow PXC to proceed with the SST using the second node. Or does it already do it that way?

I have a customer who experienced the long waits when the node specified for the wsrep_sst_donor option is in desynced state. The node that is specified as the donor is also used to do backups and the node is put in a desynced state when running the backup. Now if a SST is needed at the same time that the backup is running, the joiner simply waits.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-02-28: Re: [Bug 1285380] Re: Donor in desynced state makes the joiner wait indefinitely

#4

On 2014-02-28 15:11, Ovais Tariq wrote:
> Hi Alex,
>
> Perhaps we could have an option so that we can control when the state
> transfer times out. Since wsrep_sst_donor now accepts multiple
> hostnames, so timing out the state transfer request on one node, would
> allow PXC to proceed with the SST using the second node. Or does it
> already do it that way?

Yes.

> I have a customer who experienced the long waits when the node
> specified
> for the wsrep_sst_donor option is in desynced state. The node that is
> specified as the donor is also used to do backups and the node is put
> in
> a desynced state when running the backup. Now if a SST is needed at the
> same time that the backup is running, the joiner simply waits.

What was the exact wsrep_sst_donor value? Did it imply failover to some
other node? Otherwise, if that node was the only one in the list and
there was no trailing comma, then it is treated as precise instruction
to use ONLY THAT node for SST and, since that node is busy, what other
option is there but wait?

> --
> You received this bug notification because you are a member of
> Codership, which is subscribed to Percona XtraDB Cluster.
> https://bugs.launchpad.net/bugs/1285380
>
> Title:
> Donor in desynced state makes the joiner wait indefinitely
>
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> Confirmed
>
> Bug description:
> If a particular node is setup to be the donor via setting
> wsrep_sst_donor and the donor node is in desynced state for some
> reason, then the joiner simply waits for the donor to become synced
> again. May be the joiner should error out, or log a message to the
> error log.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2014-02-28:

#5

Download full text (3.5 KiB)

Hi Alex,

On Fri, Feb 28, 2014 at 6:48 PM, Alex Yurchenko
<email address hidden>wrote:

> On 2014-02-28 15:11, Ovais Tariq wrote:
> > Hi Alex,
> >
> > Perhaps we could have an option so that we can control when the state
> > transfer times out. Since wsrep_sst_donor now accepts multiple
> > hostnames, so timing out the state transfer request on one node, would
> > allow PXC to proceed with the SST using the second node. Or does it
> > already do it that way?
>
> Yes.
>
> > I have a customer who experienced the long waits when the node
> > specified
> > for the wsrep_sst_donor option is in desynced state. The node that is
> > specified as the donor is also used to do backups and the node is put
> > in
> > a desynced state when running the backup. Now if a SST is needed at the
> > same time that the backup is running, the joiner simply waits.
>
> What was the exact wsrep_sst_donor value? Did it imply failover to some
> other node? Otherwise, if that node was the only one in the list and
> there was no trailing comma, then it is treated as precise instruction
> to use ONLY THAT node for SST and, since that node is busy, what other
> option is there but wait?
>

wsrep_sst_donor only had a single hostname specified. So yes that should be
the ONLY node used for SST, but perhaps there should be a timeout of some
sort so that the DBA is notified about SST failure and he can possibly
bring out the other node from desynced state.

>
> > --
> > You received this bug notification because you are a member of
> > Codership, which is subscribed to Percona XtraDB Cluster.
> > https://bugs.launchpad.net/bugs/1285380
> >
> > Title:
> > Donor in desynced state makes the joiner wait indefinitely
> >
> > Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> > Confirmed
> >
> > Bug description:
> > If a particular node is setup to be the donor via setting
> > wsrep_sst_donor and the donor node is in desynced state for some
> > reason, then the joiner simply waits for the donor to become synced
> > again. May be the joiner should error out, or log a message to the
> > error log.
> >
> > To manage notifications about this bug go to:
> >
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions
>
> --
> Alexey Yurchenko,
> Codership Oy, www.codership.com
> Skype: alexey.yurchenko, Phone: +358-400-516-011
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1285380
>
> Title:
> Donor in desynced state makes the joiner wait indefinitely
>
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> Confirmed
>
> Bug description:
> If a particular node is setup to be the donor via setting
> wsrep_sst_donor and the donor node is in desynced state for some
> reason, then the joiner simply waits for the donor to become synced
> again. May be the joiner should error out, or log a message to the
> error log.
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions
>

--

Ovais Tariq, Consultant, Percona

http://www.percona.com | http://www.mysql...

Hi Alex,

On Fri, Feb 28, 2014 at 6:48 PM, Alex Yurchenko
<1285380@bugs.launchpad.net>wrote:

> On 2014-02-28 15:11, Ovais Tariq wrote:
> > Hi Alex,
> >
> > Perhaps we could have an option so that we can control when the state
> > transfer times out. Since wsrep_sst_donor now accepts multiple
> > hostnames, so timing out the state transfer request on one node, would
> > allow PXC to proceed with the SST using the second node. Or does it
> > already do it that way?
>
> Yes.
>
> > I have a customer who experienced the long waits when the node
> > specified
> > for the wsrep_sst_donor option is in desynced state. The node that is
> > specified as the donor is also used to do backups and the node is put
> > in
> > a desynced state when running the backup. Now if a SST is needed at the
> > same time that the backup is running, the joiner simply waits.
>
> What was the exact wsrep_sst_donor value? Did it imply failover to some
> other node? Otherwise, if that node was the only one in the list and
> there was no trailing comma, then it is treated as precise instruction
> to use ONLY THAT node for SST and, since that node is busy, what other
> option is there but wait?
>

wsrep_sst_donor only had a single hostname specified. So yes that should be
the ONLY node used for SST, but perhaps there should be a timeout of some
sort so that the DBA is notified about SST failure and he can possibly
bring out the other node from desynced state.

>
> > --
> > You received this bug notification because you are a member of
> > Codership, which is subscribed to Percona XtraDB Cluster.
> > https://bugs.launchpad.net/bugs/1285380
> >
> > Title:
> >   Donor in desynced state makes the joiner wait indefinitely
> >
> > Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> >   Confirmed
> >
> > Bug description:
> >   If a particular node is setup to be the donor via setting
> >   wsrep_sst_donor and the donor node is in desynced state for some
> >   reason, then the joiner simply waits for the donor to become synced
> >   again. May be the joiner should error out, or log a message to the
> >   error log.
> >
> > To manage notifications about this bug go to:
> >
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions
>
> --
> Alexey Yurchenko,
> Codership Oy, www.codership.com
> Skype: alexey.yurchenko, Phone: +358-400-516-011
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1285380
>
> Title:
>   Donor in desynced state makes the joiner wait indefinitely
>
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
>   Confirmed
>
> Bug description:
>   If a particular node is setup to be the donor via setting
>   wsrep_sst_donor and the donor node is in desynced state for some
>   reason, then the joiner simply waits for the donor to become synced
>   again. May be the joiner should error out, or log a message to the
>   error log.
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1285380/+subscriptions
>

--

Ovais Tariq, Consultant, Percona

http://www.percona.com | http://www.mysqlperformanceblog.com
Phone               : +1 (888) 401-3401 Ext. 552
24/7 Emergency : +1 888 401 3401 ext 911
Skype               : ovaistariq

Training : http://www.percona.com/training/
Support  : http://www.percona.com/mysql-support/

Join us for the Percona Live MySQL Conference and Expo April 1-4, Santa
Clara CA.
https://www.percona.com/live/mysql-conference-2014/

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-02-28:

#6

On 2014-02-28 15:58, Ovais Tariq wrote:
> wsrep_sst_donor only had a single hostname specified. So yes that
> should be
> the ONLY node used for SST, but perhaps there should be a timeout of
> some
> sort so that the DBA is notified about SST failure and he can possibly
> bring out the other node from desynced state.

Timeout is possible, however:

1) It is not exactly a failure. Even exactly not a failure. The node was
instructed to use only one donor for SST and hence is patiently waiting
for it to be available. What can be better, aborting?

2) When a joiner is waiting for the donor to be available, you can see
the following in the error log:
140228 18:54:41 [Note] WSREP: Requesting state transfer failed:
-11(Resource temporarily unavailable). Will keep retrying every 1
second(s)
so it is not exactly silent.

3) Joiner is by definition in undefined state. If it is aborted at that
moment, next time it will request an SST (unless you know what you're
doing).

4) In that particular case, are they willing to decrease availability of
their cluster even more by allowing another node to become an SST donor?

5) Even more, since there is already backup going on, it would be
considerably faster to use that backup (when it's finished) to bootstrap
the joiner rather then do it all over again via SST from another node.

Revision history for this message

Ryan Gordon (ryan-5) wrote on 2014-02-28:

#7

Hi Alex,

The main issue we're trying to tackle is the graceful interaction between a regularly scheduled xtrabackup backup process on a particular node and also configuring the cluster to use this same node for xtrabackup ISTs/SSTs when another node requests it. Because this node is isolated from any type of live traffic (it only does replication) it serves to reduce the risk of complications during a FTWRL in the backup/SST process. Again, the problem we're having is the graceful interaction between the two: How do we handle making an SST priority when it is requested over a regularly scheduled backup. Currently that is not possible because once a backup puts the node into a desynced state, there is no way for us to tell when a node requests an SST and cancel the backup and put the node back into a synced state for the SST to continue. Right the only thing the joiner node does is "wait" for the node (and eventually timeout) while it is in a desynced state. There's no functionality to gracefully stop a backup and put the donor node back into a synced state when the donor gets an SST request. Having that functionality would be the "ideal situation."

1) It would be great if we could control the timeout (maybe this is already possible)?

2) It is great that the joiner node has an error message (although it is kind of cryptic to understand what that message really means, it looks like its trying to describe a memory issue at first glance) but the donor node does not produce any logging messages which can be really frustrating in trying to figure out what the donor node is doing/why the SST isn't working. I would much prefer it if some logging was added to the donor node for this situation

3) Can you explain this a little more? Even if it requests an SST instead of an IST, the donor node will still be in a desynced state due to the backup and it will still eventually timeout

4) That is the million dollar question. The cluster only contains 3 nodes, so by making the backup node the donor node for the cluster we reduce risk of bringing down our entire cluster. If we have one node trying to join and SST, and the backup node already in a desynced state, and when the joiner node would failover to the last available node actively handling traffic we've effectively compromised our entire cluster which defeats the entire purpose of having a cluster for HA.

5) I'm not sure how that would work, that seems like that would be a pretty complicated mess of custom modifications and bash scripts to get something like that working. We're actually in the works of replacing our current xtrabackup backup solution with a ZFS snapshot solution so the node only has to enter desynced state for a few seconds per hour which greatly reduces the chance of having an issue occur during an SST. We will likely replace the xtrabackup SST solution with this ZFS snapshot solution due to the current complexity/instability with an xtrabackup SST and its interaction with a regularly scheduled xtrabackup backup process.

Hi Alex,

The main issue we're trying to tackle is the graceful interaction between a regularly scheduled xtrabackup backup process on a particular node and also configuring the cluster to use this same node for xtrabackup ISTs/SSTs when another node requests it. Because this node is isolated from any type of live traffic (it only does replication) it serves to reduce the risk of complications during a FTWRL in the backup/SST process. Again, the problem we're having is the graceful interaction between the two: How do we handle making an SST priority when it is requested over a regularly scheduled backup. Currently that is not possible because once a backup puts the node into a desynced state, there is no way for us to tell when a node requests an SST and cancel the backup and put the node back into a synced state for the SST to continue. Right the only thing the joiner node does is "wait" for the node (and eventually timeout) while it is in a desynced state. There's no functionality to gracefully stop a backup and put the donor node back into a synced state when the donor gets an SST request. Having that functionality would be the "ideal situation."

1) It would be great if we could control the timeout (maybe this is already possible)?

2) It is great that the joiner node has an error message (although it is kind of cryptic to understand what that message really means, it looks like its trying to describe a memory issue at first glance) but the donor node does not produce any logging messages which can be really frustrating in trying to figure out what the donor node is doing/why the SST isn't working. I would much prefer it if some logging was added to the donor node for this situation

3) Can you explain this a little more? Even if it requests an SST instead of an IST, the donor node will still be in a desynced state due to the backup and it will still eventually timeout

4) That is the million dollar question. The cluster only contains 3 nodes, so by making the backup node the donor node for the cluster we reduce risk of bringing down our entire cluster. If we have one node trying to join and SST, and the backup node already in a desynced state, and when the joiner node would failover to the last available node actively handling traffic we've effectively compromised our entire cluster which defeats the entire purpose of having a cluster for HA.

5) I'm not sure how that would work, that seems like that would be a pretty complicated mess of custom modifications and bash scripts to get something like that working. We're actually in the works of replacing our current xtrabackup backup solution with a ZFS snapshot solution so the node only has to enter desynced state for a few seconds per hour which greatly reduces the chance of having an issue occur during an SST. We will likely replace the xtrabackup SST solution with this ZFS snapshot solution due to the current complexity/instability with an xtrabackup SST and its interaction with a regularly scheduled xtrabackup backup process.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-03-02:

#8

Download full text (4.0 KiB)

On 2014-02-28 23:31, Ryan Gordon wrote:
> 1) It would be great if we could control the timeout (maybe this is
> already possible)?

It is not implemented at the moment, but should be possible to
implement.

However, from what I understand, in this case the only purpose of the
timeout is to be a sort of notification for the operator that the donor
node is currently unavailable doing the backup. Which operator may be
already aware of, since the backup is a scheduled operation. Given that
during the joining phase one can't connect to mysqld (by mysqld design),
the only source of feedback is the error log. So it is only natural to
expect the operator to check the log for the progress of the operation.
So this is another source to get the information that the donor is
unavailable.

At the same time timeout will result in aborting the joiner operation
(what else?), which is exactly the opposite to what you want: bringing
up the joiner as fast as possible. Ideally the joiner should keep on
working and waiting for donor to become available and operator should
simply go to donor and make sure it is not busy with something else.

So I'm not exactly sure if timeout is what you really want. It doesn't
even save you from looking at the error log - you still have to do it
after joiner aborts. It just aborts the normally progressing operation.

> 2) It is great that the joiner node has an error message (although it
> is
> kind of cryptic to understand what that message really means, it looks
> like its trying to describe a memory issue at first glance)

Really? doesn't it say that it temporarily cannot access a "resource"
(prescribed donor) and will keep on retrying? It is kinda generic, but
at the level it is logged this is all information that it has (error
code).

> but the
> donor node does not produce any logging messages which can be really
> frustrating in trying to figure out what the donor node is doing/why
> the
> SST isn't working. I would much prefer it if some logging was added to
> the donor node for this situation.

Are you sure you would prefer a log message on donor? It is the joiner
who is having problems, not donor, so why would you look into donor log
before checking the joiner? What sort of message should the donor
produce that would not contain redundant information?

Besides it is not working as you think it is. Cluster is selecting a
donor based on the information provided by joiner. If the donor is
unavailable, then joiner receive error, but donor does not receive
anything. It is unaware about joiner having problems. One reason it is
so is that there can be several potential donors.

But, suppose we can go as far as see that there is only one possible
donor and then we pass the state transfer request to it. What can donor
do? It knows that it is in desynced state and something serious is going
on. But it does not know

1) what is going on?
2) is it interruptible?
3) how to interrupt it?

In this particular case, it is an xtrabackup started externally. For
mysqld to know something about it and being able to interrupt it we'll
have to integrate backup process with mysqld - which (besides being a
quest...

On 2014-02-28 23:31, Ryan Gordon wrote:
> 1) It would be great if we could control the timeout (maybe this is
> already possible)?

It is not implemented at the moment, but should be possible to 
implement.

However, from what I understand, in this case the only purpose of the 
timeout is to be a sort of notification for the operator that the donor 
node is currently unavailable doing the backup. Which operator may be 
already aware of, since the backup is a scheduled operation. Given that 
during the joining phase one can't connect to mysqld (by mysqld design), 
the only source of feedback is the error log. So it is only natural to 
expect the operator to check the log for the progress of the operation. 
So this is another source to get the information that the donor is 
unavailable.

At the same time timeout will result in aborting the joiner operation 
(what else?), which is exactly the opposite to what you want: bringing 
up the joiner as fast as possible. Ideally the joiner should keep on 
working and waiting for donor to become available and operator should 
simply go to donor and make sure it is not busy with something else.

So I'm not exactly sure if timeout is what you really want. It doesn't 
even save you from looking at the error log - you still have to do it 
after joiner aborts. It just aborts the normally progressing operation.

> 2) It is great that the joiner node has an error message (although it 
> is
> kind of cryptic to understand what that message really means, it looks
> like its trying to describe a memory issue at first glance)

Really? doesn't it say that it temporarily cannot access a "resource" 
(prescribed donor) and will keep on retrying?  It is kinda generic, but 
at the level it is logged this is all information that it has (error 
code).

> but the
> donor node does not produce any logging messages which can be really
> frustrating in trying to figure out what the donor node is doing/why 
> the
> SST isn't working. I would much prefer it if some logging was added to
> the donor node for this situation.

Are you sure you would prefer a log message on donor? It is the joiner 
who is having problems, not donor, so why would you look into donor log 
before checking the joiner? What sort of message should the donor 
produce that would not contain redundant information?

Besides it is not working as you think it is. Cluster is selecting a 
donor based on the information provided by joiner. If the donor is 
unavailable, then joiner receive error, but donor does not receive 
anything. It is unaware about joiner having problems. One reason it is 
so is that there can be several potential donors.

But, suppose we can go as far as see that there is only one possible 
donor and then we pass the state transfer request to it. What can donor 
do? It knows that it is in desynced state and something serious is going 
on. But it does not know

1) what is going on?
2) is it interruptible?
3) how to interrupt it?

In this particular case, it is an xtrabackup started externally. For 
mysqld to know something about it and being able to interrupt it we'll 
have to integrate backup process with mysqld - which (besides being a 
questionable idea in its own right) is far outside of our replication 
focus.

> 3) Can you explain this a little more? Even if it requests an SST
> instead of an IST, the donor node will still be in a desynced state due
> to the backup and it will still eventually timeout

Well, it depends on the SST script. rsync SST should timeout in 5 
minutes. Not sure about xtrabackup.

> 4) That is the million dollar question. The cluster only contains 3
> nodes, so by making the backup node the donor node for the cluster we
> reduce risk of bringing down our entire cluster. If we have one node
> trying to join and SST, and the backup node already in a desynced 
> state,
> and when the joiner node would failover to the last available node
> actively handling traffic we've effectively compromised our entire
> cluster which defeats the entire purpose of having a cluster for HA.

Yes.

Revision history for this message

Ryan Gordon (ryan-5) wrote on 2014-03-02:

#9

Download full text (5.5 KiB)

Hi Alex,

> It is not implemented at the moment, but should be possible to
> implement.
>
> However, from what I understand, in this case the only purpose of the
> timeout is to be a sort of notification for the operator that the donor
> node is currently unavailable doing the backup. Which operator may be
> already aware of, since the backup is a scheduled operation. Given that
> during the joining phase one can't connect to mysqld (by mysqld design),
> the only source of feedback is the error log. So it is only natural to
> expect the operator to check the log for the progress of the operation.
> So this is another source to get the information that the donor is
> unavailable.
>
> At the same time timeout will result in aborting the joiner operation
> (what else?), which is exactly the opposite to what you want: bringing
> up the joiner as fast as possible. Ideally the joiner should keep on
> working and waiting for donor to become available and operator should
> simply go to donor and make sure it is not busy with something else.
>
> So I'm not exactly sure if timeout is what you really want. It doesn't
> even save you from looking at the error log - you still have to do it
> after joiner aborts. It just aborts the normally progressing operation.

You're correct a timeout isn't exactly related problem I'm trying to solve, but in respect to the original ticket, it is still a very helpful feature because in the event that the cluster is in a compromised state and an SST is not responding, DBA's can tune or understand what to expect for the default SST timeout. Right now I have no idea how long it would be before the joiner node would give up.

> Really? doesn't it say that it temporarily cannot access a "resource"
> (prescribed donor) and will keep on retrying? It is kinda generic, but
> at the level it is logged this is all information that it has (error
> code).

I do still contend that the message is not very clear: "140228 18:54:41 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)"

"Resource temporarily unavailable" doesn't really tell me anything specific and I can't find a list of PXC error codes anywhere on the percona website. Why not "[Note]: WSREP: Requesting state transfer failed: (Error Code: -11 Donor already in DESYNCED state, must be SYNCED first)."? That would be a lot clearer to me

> Are you sure you would prefer a log message on donor? It is the joiner
> who is having problems, not donor, so why would you look into donor log
> before checking the joiner? What sort of message should the donor
> produce that would not contain redundant information?

Yes. In the situations I've experienced the donor is having a problem (it is already in a desynced state and thus unable to provide a state transfer), not the joiner (perhaps the joiner is trying to IST from a scheduled restart which would be a perfectly nominal scenario). And so in these situations it is useful to have debug information on the donor saying "[Note] WSREP: Joiner node XYZ requesting state transfer" and subsuquently "[Note] WSREP: Unable to provide state transfer to node XYZ, already in D...

Hi Alex,

> It is not implemented at the moment, but should be possible to
> implement.
> 
> However, from what I understand, in this case the only purpose of the
> timeout is to be a sort of notification for the operator that the donor
> node is currently unavailable doing the backup. Which operator may be
> already aware of, since the backup is a scheduled operation. Given that
> during the joining phase one can't connect to mysqld (by mysqld design),
> the only source of feedback is the error log. So it is only natural to
> expect the operator to check the log for the progress of the operation.
> So this is another source to get the information that the donor is
> unavailable.
> 
> At the same time timeout will result in aborting the joiner operation
> (what else?), which is exactly the opposite to what you want: bringing
> up the joiner as fast as possible. Ideally the joiner should keep on
> working and waiting for donor to become available and operator should
> simply go to donor and make sure it is not busy with something else.
> 
> So I'm not exactly sure if timeout is what you really want. It doesn't
> even save you from looking at the error log - you still have to do it
> after joiner aborts. It just aborts the normally progressing operation.

You're correct a timeout isn't exactly related problem I'm trying to solve, but in respect to the original ticket, it is still a very helpful feature because in the event that the cluster is in a compromised state and an SST is not responding, DBA's can tune or understand what to expect for the default SST timeout. Right now I have no idea how long it would be before the joiner node would give up.

> Really? doesn't it say that it temporarily cannot access a "resource"
> (prescribed donor) and will keep on retrying? It is kinda generic, but
> at the level it is logged this is all information that it has (error
> code).

I do still contend that the message is not very clear: "140228 18:54:41 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)"

"Resource temporarily unavailable" doesn't really tell me anything specific and I can't find a list of PXC error codes anywhere on the percona website. Why not "[Note]: WSREP: Requesting state transfer failed: (Error Code: -11 Donor already in DESYNCED state, must be SYNCED first)."? That would be a lot clearer to me

> Are you sure you would prefer a log message on donor? It is the joiner
> who is having problems, not donor, so why would you look into donor log
> before checking the joiner? What sort of message should the donor
> produce that would not contain redundant information?

Yes. In the situations I've experienced the donor is having a problem (it is already in a desynced state and thus unable to provide a state transfer), not the joiner (perhaps the joiner is trying to IST from a scheduled restart which would be a perfectly nominal scenario). And so in these situations it is useful to have debug information on the donor saying "[Note] WSREP: Joiner node XYZ requesting state transfer" and subsuquently "[Note] WSREP: Unable to provide state transfer to node XYZ, already in DESYNCED mode." I don't think anyone trying to manage/understand their distributed cluster minds a little redundant information if it means an increased time to clarity.

> Besides it is not working as you think it is. Cluster is selecting a
> donor based on the information provided by joiner. If the donor is
> unavailable, then joiner receive error, but donor does not receive
> anything. It is unaware about joiner having problems. One reason it is
> so is that there can be several potential donors.

Just so I understand better, do the nodes all publish their state and then the joiner checks to see which nodes are in a synced state and if it can ask that node to be the donor? In this way, there wouldn't be any actual request from the joiner to the donor if all of the donors are in a desynced state, right? Is this how the procedure currently works or am I mistaken?

Our setup right now has only one node (the backup node) predefined to be the donor (to prevent FTWRL's from running on a node that is serving production traffic). Maybe this isn't the best setup. What would you're recommendation be if you had to setup a cluster with xtrabackup running on one of the nodes (and have it interact nicely with the xtrabackup SST donor script)?

Does the node even need to be desynced when xtrabackup runs? During the FTWRL?

> But, suppose we can go as far as see that there is only one possible
> donor and then we pass the state transfer request to it. What can donor
> do? It knows that it is in desynced state and something serious is going
> on. But it does not know
> 
> 1) what is going on?
> 2) is it interruptible?
> 3) how to interrupt it?
> 
> In this particular case, it is an xtrabackup started externally. For
> mysqld to know something about it and being able to interrupt it we'll
> have to integrate backup process with mysqld - which (besides being a
> questionable idea in its own right) is far outside of our replication
> focus.

Sure, I'm certainly not saying the cluster should know which process had put the node into a desynced state but I'm thinking if there was a way to use the wsrep_notify_cmd setting or a new setting similar to that, to perform the rendezvous of the SST for more tricky setups like ours, that would be ideal. That script could be programmed to stop the current backup in progress and put the node back into a synced state so it could be available to provide a SST for the joiner node.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-03-08:

#10

Download full text (5.1 KiB)

So, this is how it looks:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
do {
end = strchr(begin, ',');

int len;

        if (NULL == end) {
            len = str_len - (begin - str);
        }
        else {
            len = end - begin;
        }

assert (len >= 0);

        int const idx = len > 0 ? /* consider empty name as "any" */
            group_find_node_by_name (group, joiner_idx, begin, len, status) :
            /* err == -EAGAIN here means that at least one of the nodes in the
             * list will be available later, so don't try others. */
            (err == -EAGAIN ?
             err : group_find_node_by_state(group, joiner_idx, status));

if (idx >= 0) return idx;

        /* once we hit -EAGAIN, don't try to change error code: this means
         * that at least one of the nodes in the list will become available. */
        if (-EAGAIN != err) err = idx;

begin = end + 1; /* skip comma */

} while (end != NULL);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Based on my tests, when wsrep_sst_donor='A1,A2,' and A1 is
unavailable (non-SYNCED), A3 does SST from A2 without any issues.

*However*, if wsrep_sst_donor='A1,' and A1 is unavailable, then
it keeps looping without any bounds (there are no strict bounds on
number of retries):

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    do
    {
        tries++;

gcs_seqno_t seqno_l;

ret = gcs_.request_state_transfer(req->req(), req->len(), sst_donor_,
&seqno_l);

        if (ret < 0)
        {
            if (!retry_str(ret))
            {
                log_error << "Requesting state transfer failed: "
                          << ret << "(" << strerror(-ret) << ")";
            }
            else if (1 == tries)
            {
                log_info << "Requesting state transfer failed: "
                         << ret << "(" << strerror(-ret) << "). "
                         << "Will keep retrying every " << sst_retry_sec_
                         << " second(s)";
            }
        }

        if (seqno_l != GCS_SEQNO_ILL)
        {
            /* Check that we're not running out of space in monitor. */
            if (local_monitor_.would_block(seqno_l))
            {
                long const seconds = sst_retry_sec_ * tries;
                log_error << "We ran out of resources, seemingly because "
                          << "we've been unsuccessfully requesting state "
                          << "transfer for over " << seconds << " seconds. "
                          << "Please check that there is "
                          << "at least one fully synced member in the group. "
                          << "Application must be restarted.";
                ret = -EDEADLK;
            }
            else
            {
                // we are already holding local monitor
                LocalOrder lo(seqno_l);
                local_monitor_.self_cancel(lo);
            }
        }
    }

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

From log:

140309 0:01:32 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Fre...

So,  this is how it looks:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
    do {
        end = strchr(begin, ',');

int len;

if (NULL == end) {
            len = str_len - (begin - str);
        }
        else {
            len = end - begin;
        }

assert (len >= 0);

int const idx = len > 0 ? /* consider empty name as "any" */
            group_find_node_by_name (group, joiner_idx, begin, len, status) :
            /* err == -EAGAIN here means that at least one of the nodes in the
             * list will be available later, so don't try others. */
            (err == -EAGAIN ?
             err : group_find_node_by_state(group, joiner_idx, status));

if (idx >= 0) return idx;

/* once we hit -EAGAIN, don't try to change error code: this means
         * that at least one of the nodes in the list will become available. */
        if (-EAGAIN != err) err = idx;

begin = end + 1; /* skip comma */

} while (end != NULL);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Based on my tests, when wsrep_sst_donor='A1,A2,' and A1 is 
unavailable (non-SYNCED), A3 does SST from A2 without any issues.

*However*, if  wsrep_sst_donor='A1,' and A1 is unavailable, then 
it keeps looping without any bounds (there are no  strict bounds on 
number of retries):

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

do
    {
        tries++;

gcs_seqno_t seqno_l;

ret = gcs_.request_state_transfer(req->req(), req->len(), sst_donor_,
                                          &seqno_l);

if (ret < 0)
        {
            if (!retry_str(ret))
            {
                log_error << "Requesting state transfer failed: "
                          << ret << "(" << strerror(-ret) << ")";
            }
            else if (1 == tries)
            {
                log_info << "Requesting state transfer failed: "
                         << ret << "(" << strerror(-ret) << "). "
                         << "Will keep retrying every " << sst_retry_sec_
                         << " second(s)";
            }
        }

if (seqno_l != GCS_SEQNO_ILL)
        {
            /* Check that we're not running out of space in monitor. */
            if (local_monitor_.would_block(seqno_l))
            {
                long const seconds = sst_retry_sec_ * tries;
                log_error << "We ran out of resources, seemingly because "
                          << "we've been unsuccessfully requesting state "
                          << "transfer for over " << seconds << " seconds. "
                          << "Please check that there is "
                          << "at least one fully synced member in the group. "
                          << "Application must be restarted.";
                ret = -EDEADLK;
            }
            else
            {
                // we are already holding local monitor
                LocalOrder lo(seqno_l);
                local_monitor_.self_cancel(lo);
            }
        }
    }

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

From log:

140309  0:01:32 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Freeing gcache buffer 0x7f528bfff528 after receiving -11
140309  0:01:32 [Note] WSREP: galera/src/replicator_str.cpp:send_state_request():560: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:16001,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140309 00:01:32.391)
140309  0:01:33 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Freeing gcache buffer 0x7f528bfff592 after receiving -11
140309  0:01:34 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Freeing gcache buffer 0x7f528bfff5fc after receiving -11
140309  0:01:35 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Freeing gcache buffer 0x7f528bfff666 after receiving -11
140309  0:01:36 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Freeing gcache buffer 0x7f528bfff6d0 after receiving -11
140309  0:01:37 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Freeing gcache buffer 0x7f528bfff73a after receiving -11
140309  0:01:38 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_replv():1568: Freeing gcache buffer 0x7f528bfff7a4 after receiving -11

So I see that, according to current design, it looks good. The 
current design being if a node in wsrep_sst_donor is unavailable 
then don't fall back at all (to group_find_node_by_state).

However, there is a flaw in that which is that in 
wsrep_sst_donor there is a choice given whether to leave a dangling 
comma or not. The former implies that check all nodes and try the 
fall provider logic whereas the latter implies a strict checking.

Currently, irrespective of whether the comma exists or not, it 
does only strict membership checking for donor without ever 
falling back.  In case when a dangling comma is provided by user 
(for precisely that reason) - "A1,A2,"     it should check A1, A2 
and if both are unavailable then check for others (may A3 or A5 are
available (in a cluster of A1,A2,A3,A4,A5) as well.

The number of retries should also be bounded, but that is for 
another bug.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-03-08:

#11

With following, it works as described in last comment:

=== modified file 'gcs/src/gcs_group.c'
--- gcs/src/gcs_group.c 2014-02-28 02:15:22 +0000
+++ gcs/src/gcs_group.c 2014-03-08 20:20:23 +0000
@@ -894,6 +894,10 @@
     const char* begin = str;
     const char* end;
     int err = -EHOSTDOWN; /* worst error */
+ bool dcomma = false; /* dangling comma */
+
+ if (!strcmp(str+strlen(str)-1,","))
+ dcomma = true;

     do {
         end = strchr(begin, ',');
@@ -912,8 +916,10 @@
         int const idx = len > 0 ? /* consider empty name as "any" */
             group_find_node_by_name (group, joiner_idx, begin, len, status) :
             /* err == -EAGAIN here means that at least one of the nodes in the
- * list will be available later, so don't try others. */
- (err == -EAGAIN ?
+ * list will be available later, so don't try others.
+ * Also, dangling comma is checked to see if fallback is attempted
+ * or not */
+ ((err == -EAGAIN && !dcomma) ?
              err : group_find_node_by_state(group, joiner_idx, status));

if (idx >= 0) return idx;

ie. with dangling comma, it falls back to provider discovery of
nodes based on their states, without it, it is strict in
conforming to the nodes.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-03-13:

#12

One more thing:

because of this condition:

            else if (node->status >= GCS_NODE_STATE_JOINER) {
                /* will eventually become SYNCED */
                return -EAGAIN;
            }

a node which is itself a joiner (other than this joiner) may end
up being chosen as the donor which can lead to really long wait
times for this donor.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-05-12:

#13

This should be fixed in upstream galera3 (not 3.5 but the HEAD) too (since the dangling comma is accounted there now).

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#14

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1633

	Status	Importance	Assigned to	Milestone
Galera	New	Undecided	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.37-25.10
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.15-25.5

Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Donor in desynced state makes the joiner wait indefinitely

Bug Description

Other bug subscribers

Remote bug watches