Galera doesn't detect that a node has diverged, same grastate.dat on nodes having different data

Bug #843752 reported by Henrik Ingo
30
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Galera
Confirmed
Wishlist
Unassigned

Bug Description

Hi

Here's something to scratch your heads on:

Summary:

It is possible to create a situation where disconnected galera nodes contain a different dataset, but same grastate.dat. When such nodes join together as a cluster, they will believe they are identical even if they are not, and different nodes have different contents.

Steps to repeat:

 - shutdown all nodes
 - one by one: start node, insert one unique row, shutdown node
   (the bug probably happens on any combination of equal amount of DML on each node)
 - start cluster

Note: On normal network partitioning Galera correctly chooses a quorum and prevents split brain from happening. In this test we specifically start node one by one in a state that allows them to diverge. The test is to see what happens when they re-join.

On all nodes we have the same state:

> select * from test.t;
+---+----------------------------+----------+-----------+
| k | h | t | v |
+---+----------------------------+----------+-----------+
| 1 | cluster130 | 12:38:52 | first row |
+---+----------------------------+----------+-----------+
1 row in set (0.00 sec)

./mysql-galera stop

One node at a time:

./mysql-galera -g gcomm:// start

>insert into t values (2, @@hostname, curtime(), 'second row - inserted in single node mode');
Query OK, 1 row affected (0.00 sec)

>select * from t;
+---+----------------------------+----------+-------------------------------------------+
| k | h | t | v |
+---+----------------------------+----------+-------------------------------------------+
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster130 | 12:48:42 | second row - inserted in single node mode |
+---+----------------------------+----------+-------------------------------------------+
2 rows in set (0.00 sec)

./mysql-galera -g gcomm:// stop

(Other nodes having different h and t values.)

On all nodes:

cat mysql/var/grastate.dat
# GALERA saved state, version: 0.8, date: (todo)
uuid: 80f3cf22-d3ca-11e0-0800-9edae8cb665c
seqno: 3479
cert_index:

(All nodes have same uuid and seqno.)

Now start all nodes as a cluster again (gcomm://nodename)

>insert into t values (3, @@hostname, curtime(), 'third row - cluster rejoined');
Query OK, 1 row affected (0.01 sec)

>select * from t;
+---+----------------------------+----------+-------------------------------------------+
| k | h | t | v |
+---+----------------------------+----------+-------------------------------------------+
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster130 | 12:48:42 | second row - inserted in single node mode |
| 3 | cluster130 | 13:02:48 | third row - cluster rejoined |
+---+----------------------------+----------+-------------------------------------------+
3 rows in set (0.00 sec)

>select * from test.t;
+---+----------------------------+----------+-------------------------------------------+
| k | h | t | v |
+---+----------------------------+----------+-------------------------------------------+
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster129 | 12:51:14 | second row - inserted in single node mode |
| 3 | cluster130 | 13:02:48 | third row - cluster rejoined |
+---+----------------------------+----------+-------------------------------------------+
3 rows in set (0.00 sec)

> select * from test.t;
+---+----------------------------+----------+-------------------------------------------+
| k | h | t | v |
+---+----------------------------+----------+-------------------------------------------+
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster128 | 12:53:59 | second row - inserted in single node mode |
| 3 | cluster130 | 13:02:48 | third row - cluster rejoined |
+---+----------------------------+----------+-------------------------------------------+
3 rows in set (0.00 sec)

(Cluster is rejoined, new data is replicated, but split brain has occured and nodes have diverged.)

Expected result:

Nodes should discover that they have different states and ask for a copy from the first node (cluster130).

Actual result:

Nodes do not discover that their data has diverged, and form a cluster where each node has different dataset.

Suggested fix:

I have not thought about this too much, but...

It seems to me you need to change the uuid in grastate.dat when:
 - one or more nodes leave the cluster
 - seqno is increased

Another way of saying the same thing is that you need to change uuid when cluster configuration changes, except that nodes joining a cluster are ok. New UUID needs only be saved to grastate.dat when seqno would be increased.

Note that this is not a critical bug. This state can only be entered by deliberate user action / user error. But it would be nice if Galera can detect the diverged state and discard the diverged replica.

Revision history for this message
Henrik Ingo (hingo) wrote :

More thought on suggested fix:

It is probably desirable to minimize the times an UUID change is needed. For instance, if you will later implement some incremental SST method, it must be possible for a node to disconnect and rejoin and discover the same uuid - this means it can ask for incremental SST and not full dump.

A possible tweak to my proposed solution would be for the selected quorum to keep the same UUID, but the non-quorum partition would need to change UUID (if seqno increases).

Also, it may be a simple approach to not implement a solution for this, rather provide command line options for the user to affect whether SST should happen:

--sst-force : Discard the grastate.dat on this node and do SST from the primary component.
--sst-disable : Don't do SST. If grastate.dat does not match with the primary component being joined, report error and shut down, but don't delete my data.
--sst-ondemand or --sst-initial : Do SST if needed when joining cluster.

(Currently the last one is default behavior.)

Revision history for this message
Alex Yurchenko (ayurchen) wrote : Re: [Bug 843752] Re: Galera doesn't detect that a node has diverged, same grastate.dat on nodes having different data
Download full text (3.1 KiB)

Henrik,

I think you're gradually arriving to the understanding we have now. And
this is good, because what you have reported as a bug is a useful
feature which you have masterfully abused ;)

On Wed, 07 Sep 2011 11:01:30 -0000, Henrik Ingo wrote:
> More thought on suggested fix:
>
> It is probably desirable to minimize the times an UUID change is
> needed.
> For instance, if you will later implement some incremental SST
> method,
> it must be possible for a node to disconnect and rejoin and discover
> the
> same uuid - this means it can ask for incremental SST and not full
> dump.

And not only that. For example you want to avoid state transfer
altogether if you shut down and then restart an idle cluster (if you're
doing bulk upgrade)

Conceptually, UUID:seqno pair is a database state identifier and
_ideally_ should be persistently stored in the database, grastate file
is just a workaround. And if you shut down the server at a given
uuid:seqno, then when you start it, like it or not, it has the same
state, and, therefore, should have the same ID. (We're not considering
modifying database files outside of the server process, cause that's
cheating and would lead us nowhere)

In the case that you provided you had three servers with the same
initial database state and, naturally, when you started them they had
the same initial uuid:seqno - that's where they left off. Starting one
node, changing its database (making it most updated) and then shutting
it down and starting another without prior synchronization with the most
updated node is a grieve misuse of a database cluster, and it is not
specific to Galera cluster.

While there might be a way to protect against such a misuse (like
incremental digest), we don't think we want to to focus on it at the
moment. In any case database state ID is a state ID, and changing it
arbitrarily would defeat its purpose.

> A possible tweak to my proposed solution would be for the selected
> quorum to keep the same UUID, but the non-quorum partition would need
> to
> change UUID (if seqno increases).

Note, that by supplying an trivial cluster address you implicitly tell
the node that is has the quorum (no other cluster members). Also, seqno
cannot increase in non-quorum partition.

> Also, it may be a simple approach to not implement a solution for
> this,
> rather provide command line options for the user to affect whether
> SST
> should happen:
>
> --sst-force : Discard the grastate.dat on this node and do SST from
> the primary component.
> --sst-disable : Don't do SST. If grastate.dat does not match with the
> primary component being joined, report error and shut down, but don't
> delete my data.
> --sst-ondemand or --sst-initial : Do SST if needed when joining
> cluster.
>
> (Currently the last one is default behavior.)

So I guess the last switch is not needed then.

You can obviously achieve the --sst-force functionality by simply
removing grastate.dat. --sst-disable does not seem to have a compelling
use case, since you normally can't expect states to match on a working
cluster. But we might consider implementing such switches if there is
some real need for this.

Regard...

Read more...

Revision history for this message
Henrik Ingo (hingo) wrote :
Download full text (5.3 KiB)

On Wed, Sep 7, 2011 at 4:51 PM, Alex Yurchenko
<email address hidden> wrote:
> I think you're gradually arriving to the understanding we have now. And
> this is good, because what you have reported as a bug is a useful
> feature which you have masterfully abused ;)

No, I understand the feature but this is still a bug.

> And not only that. For example you want to avoid state transfer
> altogether if you shut down and then restart an idle cluster (if you're
> doing bulk upgrade)

My proposal does that. In all cases must UUID only be changed if a
node is not idle.

> Conceptually, UUID:seqno pair is a database state identifier and
> _ideally_ should be persistently stored in the database, grastate file
> is just a workaround.

Actually, as far as I'm concerned if you store it in mysql/var and I
can access the value with SQL, it is part of the database. Since you
zero it out when the server is running, and write to it only on clean
shutdown, there also is no crash safety issues afaics.

> And if you shut down the server at a given
> uuid:seqno, then when you start it, like it or not, it has the same
> state, and, therefore, should have the same ID.

This is correct.

> In the case that you provided you had three servers with the same
> initial database state and, naturally, when you started them they had
> the same initial uuid:seqno - that's where they left off. Starting one
> node, changing its database (making it most updated) and then shutting
> it down and starting another without prior synchronization with the most
> updated node is a grieve misuse of a database cluster, and it is not
> specific to Galera cluster.

I can think of realistic use cases how this could happen. For
instance, from a cleanly shutdown cluster, you accidentally copy paste
"-g gcomm://" to all nodes when you really wanted to start a cluster.
That you then manage to issue the exact same amount of transactions on
each node is up to chance, but can happen.

I think I managed to create this kind of state early on in my testing,
which is how I found grastate.dat and then came up with this as a test
case.

> While there might be a way to protect against such a misuse (like
> incremental digest), we don't think we want to to focus on it at the
> moment. In any case database state ID is a state ID, and changing it
> arbitrarily would defeat its purpose.

I agree that this is not a high priority bug as it rarely happens and
is due to user action. Just that the UUID:seqno doesn't correctly in
all cases identify 2 diverged states, even if this is the intent.

Since you already calculate a checksum for each transaction that is
replicated, you could actually just XOR those into a checksum saved in
grastate.dat. It could replace or be used alongside seqno. This is
surely a much more correct approach than trying to detect changes in
cluster configuration and changing UUID as I proposed first. In fact,
it would even work correctly in the opposite situation: Suppose I have
two nodes in the same state, but not connected. I run the same
mysqldump against both. Then I connect them into each other to form a
cluster - since both seqno and the checksum match, Galera would
*correctl...

Read more...

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

On Thu, 08 Sep 2011 09:32:39 -0000, Henrik Ingo wrote:
> I agree that this is not a high priority bug as it rarely happens and
> is due to user action. Just that the UUID:seqno doesn't correctly in
> all cases identify 2 diverged states, even if this is the intent.

Well, it definitely was not intended to be fool-proof, but rather just
a bare minimum to identify a state.

> Since you already calculate a checksum for each transaction that is
> replicated, you could actually just XOR those into a checksum saved
> in
> grastate.dat. It could replace or be used alongside seqno. This is

That's what I meant by "incremental digest".

> surely a much more correct approach than trying to detect changes in
> cluster configuration and changing UUID as I proposed first. In fact,
> it would even work correctly in the opposite situation: Suppose I
> have
> two nodes in the same state, but not connected. I run the same
> mysqldump against both. Then I connect them into each other to form a
> cluster - since both seqno and the checksum match, Galera would
> *correctly* note that the databases are in the same state.

It is indeed a more general and reliable solution. And I guess
eventually we'll go with something like it.

>
> I now realize my style is wrong above. MySQL style of course would be
> to have a variable, such as:
> sst-on-startup=force
> sst-on-startup=disable
> sst-on-startup=ondemand
>
> (ondemand is a bit misleading, perhaps more correct is "ifneeded".
> ondemand is also used for query cache setting - but also there the
> meaning is different, the demand is provided by user, not a state.)
>
>> You can obviously achieve the --sst-force functionality by simply
>> removing grastate.dat. --sst-disable does not seem to have a
>> compelling
>> use case, since you normally can't expect states to match on a
>> working
>> cluster. But we might consider implementing such switches if there
>> is
>> some real need for this.
>
> I'm able to do that. Just thought if it happens commonly, then having
> a command line switch is more user friendly.

Agreed

Regards,
Alex

Revision history for this message
Henrik Ingo (hingo) wrote :

On Thu, Sep 8, 2011 at 1:57 PM, Alex Yurchenko
<email address hidden> wrote:
> That's what I meant by "incremental digest".

Ah, right. I was so proud of having thought of it "myself" :-)

> It is indeed a more general and reliable solution. And I guess
> eventually we'll go with something like it.

Actually it's kind of cool that it is possible to digest the database
state, including the whole history of transactions that went into it,
into a single checksum.

Revision history for this message
Henrik Ingo (hingo) wrote :

As per mailing-list, with the introduction of IST this bug is now more relevant than before. Inconsistency can also be introduced if JOINER seqno < JOINED seqno. In mailing-list discussion this happens 50% of the time in a 2 node pc.ignore_sb setup.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Henrik, please consider that in the mailing-list discussion the example involved different seqnos for JOINER and DONOR and "incremental digest" is not applicable here since we can't possibly remember digests for every seqno out there.

Also note that pc.ignore_sb setting simply should not be used when the user can't guarantee strictly master-slave operation.

On a related note, I'm thinking of dumping the idea of "incremental difest" (however cool it is) and force new UUID every time new PC is initialized. The problem here will be with whole cluster restart, when the nodes won't be able to recognize that they come from the same view. I see two possible strategfies here:

1) ignore the problem as it should be relatively rare in production (it will kill our internal testing though :) )
2) keep a history of 1-2 previous PCs and last seqno applied in that PC. That would allow to make a very simple check whether IST/no transfer is applicable.

My main problem with this approach is that the UUID stops being a cluster/history identificatior. I.e. if you restart idle cluster 3 times, it will change UUID three times. I can't see much technical issues with that (except that you can run out of remembered UUIDs), but intuitively it does not feel right.

Revision history for this message
Henrik Ingo (hingo) wrote :

In the scenario where JOINER and DONOR have diverged but still have the same cluster UUID, and JOINER seqno < DONOR seqno, you are right that at this point we cannot detect that JOINER state is different from the corresponding historical DONOR state. However, if IST then proceeds, and JOINER continues to XOR a checksum of each transaction into it's state information, by the time the seqno is equal on both sides, the checksum is not, and we can reliably detect split brain.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

you, Sir, are a genleman and a scholar.

Changed in galera:
status: New → Confirmed
Changed in galera:
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.