Percona XtraDB Cluster - HA scalable solution for MySQL

Able to query table under-going an RSU while node is desyned from Cluster

Reported by Patrick Zoblisein on 2012-03-27
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Undecided
Unassigned
Percona XtraDB Cluster
Medium
Hrvoje Matijakovic

Bug Description

As per the Percona FAQ:
http://www.percona.com/doc/percona-xtradb-cluster/faq.html#q-how-can-i-check-the-galera-node-health
“Unknown error (node is online but Galera is not connected/synced with the cluster)”

1) Wsrep_OSU_method=RSU on node1/2/3

2) Node1/2/3: mysql –u root –p –e’show create table t1’
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1 | CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `hostname` varchar(64) NOT NULL,
  `port` int(11) NOT NULL,
  `instime` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=51875647 DEFAULT CHARSET=latin1 |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

3) Beginning row counts of t1 on all three nodes:
+----------+----------------------------+---------------------+
| count(*) | @@hostname | now() |
+----------+----------------------------+---------------------+
| 17291893 | Node1 | 2012-03-26 10:45:31 |
+----------+----------------------------+---------------------+
+----------+----------------------------+---------------------+
| count(*) | @@hostname | now() |
+----------+----------------------------+---------------------+
| 17291893 | node2 | 2012-03-26 10:45:32 |
+---------+----------+----------------------------+---------------------+
+---------+----------+----------------------------+---------------------+
 | count(*) | @@hostname | now() |
+---------+----------+----------------------------+---------------------+
| 17291893 | node3 | 2012-03-26 10:45:31 |
+---------+----------+----------------------------+---------------------+

4) Add a column to table t1 on node1:

mysql> select now();
+---------------------+
| now() |
+---------------------+
| 2012-03-26 10:46:15 |
+---------------------+
1 row in set (0.00 sec)

mysql> ALTER TABLE repl.t1 add column10 int;
Query OK, 17291893 rows affected (2 min 55.97 sec)
Records: 17291893 Duplicates: 0 Warnings: 0

mysql> select now();
+---------------------+
| now() |
+---------------------+
| 2012-03-26 10:49:11 |
+---------------------+
1 row in set (0.00 sec)

5) Login to node1 - able to select a row from the table under-going RSU (node1) - was expecting to get an "Unknown Command" message based on the FAQ linked above.

mysql> select *,now() from t1 where port=9999; 
+----------+----------+------+---------------------+---------------------+
| id | hostname | port | instime | now() |
+----------+----------+------+---------------------+---------------------+
| 51875649 | RSU-test | 9999 | 2012-03-26 10:44:19 | 2012-03-26 10:46:57 |
+----------+----------+------+---------------------+---------------------+
1 row in set (7.04 sec)

6) Log messages from all three nodes:

[root@node1]# cat node1.err
120326 10:46:15 [Note] WSREP: Node 1 (node1) desyncs itself from group
120326 10:46:15 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 28807712)
120326 10:46:15 [Note] WSREP: Provider paused at ea8d8c3e-5bba-11e1-0800-0bdfe5c74f3a:28807712
120326 10:49:11 [Note] WSREP: Provider resumed.
120326 10:49:11 [Note] WSREP: Node 1 (node1) resyncs itself to group
120326 10:49:11 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 28807712)
120326 10:49:11 [Note] WSREP: Member 1 (node1) synced with group.
120326 10:49:11 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 28807712)
120326 10:49:11 [Note] WSREP: Synchronized with group, ready for connections

[root@node2]# cat node2.err
120326 10:46:16 [Note] WSREP: Node 1 (node1) desyncs itself from group
120326 10:49:12 [Note] WSREP: Node 1 (node1) resyncs itself to group
120326 10:49:12 [Note] WSREP: Member 1 (node1) synced with group.

[root@node3]# cat node3.err
120326 10:46:14 [Note] WSREP: Node 1 (node1) desyncs itself from group
120326 10:49:10 [Note] WSREP: Node 1 (node1) resyncs itself to group
120326 10:49:10 [Note] WSREP: Member 1 (node1) synced with group.

7) show create table t1

node1 (node that underwent RSU):
CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `hostname` varchar(64) NOT NULL,
  `port` int(11) NOT NULL,
  `instime` datetime NOT NULL,
  `column10` int(11) DEFAULT NULL,  New column has been added.
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=51875652 DEFAULT CHARSET=latin1

node2
CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `hostname` varchar(64) NOT NULL,
  `port` int(11) NOT NULL,
  `instime` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=51875650 DEFAULT CHARSET=latin1

node3
CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `hostname` varchar(64) NOT NULL,
  `port` int(11) NOT NULL,
  `instime` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=51875650 DEFAULT CHARSET=latin1

Thanks
Patrick

Tags: doc Edit Tag help
Vadim Tkachenko (vadim-tk) wrote :

FAQ needs to be updated. In addition to that check we need to add:
"show status like 'wsrep_local_state'"

If this is not equal 4 ( SYNC state), then Node is not synced.

Changed in percona-xtradb-cluster:
assignee: nobody → Hrvoje Matijakovic (hrvojem)
importance: Undecided → Medium
status: New → Confirmed
Henrik Ingo (hingo) wrote :

Vadim: I don't agree with the proposed fix. As I understand Galera, it gives assurance that a node will never let you write to or read from a node that is not synced - this is the contract it gives. So as I see it the FAQ is correct and this is a Galera bug. Let's ask Codership guys for advice. (Patrick, did your mail get through to the list already?)

Henrik Ingo (hingo) wrote :

To continue with the above: notice that checking wsrep_local_state really doesn't help since it is equally broken. It is possible that a node is in DONOR state yet it is fully in sync because SST method is xtrabackup or it is donor to an IST operation. So is you conclude that a node is not healthy because wsrep_local_state=DONOR then you will get a lot of false positives.

Alex Yurchenko (ayurchen) wrote :

Henrik,

The actual Galera contract is that it will never commit a transaction that won't be committed on other members of the cluster unless they fail (in other words it guarantees eventual consistency), and thus is more relaxed than what you think. In fact the guarantee that you mention is impossible to enforce without some sort of a global mutex that would synchronise Galera IO ops and state changes and at best the performance of that thing will be abysmal (if ever possible).

Technically you may try to commit to Galera node in any state and as soon as it gets replicated and certified, it gets committed. States like SYNCED, JOINED, etc. is only our best effort to provide temporal synchrony from virtual synchrony. As an example consider the situation when the node looses primary component. Until this event propagates to the top layer, it is still in SYNCED state and allows writes and reads. What Galera guarantees is that in this case COMMIT will fail - or maybe not - if by the time it is processed the node is back with primary component.

This is not to say that we can't guarantee 100% synchronous reads. But in this case I doubt that this is a bug. After all when you're performing RSU you're supposed to know what you're doing - and you remove all load from that node. Of course it is cool when the software is totally unbreakable - but at what cost? Personally, at the moment I'd rather spend effort on improving Galera functionality than protecting it from deliberately unreasonable user actions.

Henrik Ingo (hingo) wrote :

Alex

I still think in the case of RSU if node is not able to apply replication events, it can be handled as any other de-synced case. However, I realize now that this issue doesn't yet clearly demonstrate whether there is a bug or not. Our expectation was that the node undergoing RSU would always be de-synced and we were surprised to be able to read from it at all. Now we know that actually it may stay in sync. I will ask Patrick to re-do the test so that we take into account this new information.

Henrik Ingo (hingo) wrote :

Btw, I agree that the exact interpretation of "not reading stale data" is controversial, however in my comments here I'm refering to the fact that a node that has been disconnected from the primary component does not allow you to read data. (The fact that any node may always be milliseconds behind some other node is not important here.)

Alex Yurchenko (ayurchen) wrote :

Well,

1) "desynced" and "disconnected from primary component" are different things. In the case of RSU the node is still a part of the primary component and it has full state - except for the schema change it is undergoing, but doing RSU makes an implicit contract that "it does not matter" - either through using STATEMENT-based replication events or by not touching the table at all. The important part is that the node has all cluster data (in slave queue) and "desynced" state is just a helper to explain the application that the database may be _temporarily_ "behind the schedule".

2) If you want strictly synchronous reads, then setting wsrep_causal_reads will give you that, and in the RSU case SELECT will block until _at least_ all events that it can possibly depend on are applied - that's the guarantee. However, provided the slave queue is empty, it may return way before DDL concludes, but that will be a native server behaviour then.

3) We can prohibit any selects in desynced state as we do for non-primary configuration. Note however, that it is not as strict as you might be hoping for. It is impossible to do without races (as I explained above) and race interval can be arbitrarily long. E.g. in the lab we could easily have situation that the upper layer of Galera stack (the database) was several configuration changes behind the bottom layer - which essentially defines prim/non_prim. I.e. when the node becomes non-primary, the database does not learn about it until it has processed all of the preceding slave queue events - and theoretically that could be days. It is _virtual synchrony_ - events happen in the same order, but not at the same time.

Conclusion: want a strong guarantee not to read stale data? - Use wsrep_causal_reads. Everything else is weak.

tags: added: doc
Henrik Ingo (hingo) wrote :

I've entered a new bug with a different test case: https://bugs.launchpad.net/codership-mysql/+bug/1002714

That one is simpler and more clearly demonstrates the problem.

Changed in percona-xtradb-cluster:
status: Confirmed → Triaged
Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-25
milestone: 5.5.31-25 → 5.5.31-24.8
Changed in percona-xtradb-cluster:
milestone: 5.5.31-23.7.5 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers