Donor accepts query, which then blocks

Bug #1002714 reported by Henrik Ingo on 2012-05-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
High
Alex Yurchenko
5.1
High
Alex Yurchenko
5.5
High
Alex Yurchenko

Bug Description

Hi

This is the same issue we discussed in Santa Clara, and also reported in this bug: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/966679
But the test case is new.

I'm testing the loadbalancing feature of the MySQL JDBC driver. I created the following java program which inserts a single row each second:

******************************************************************************************
import java.sql.*;

public class MysqlTest
{
    public static void main (String[] args)
    {
        Connection conn = null;

        while(true)
        {

            // Connect to MySQL
            try
            {
                String userName = "root";
                String password = "password";
                String hosts = "esitbi125lab,esitbi126lab,esitbi127lab";
                String url = "jdbc:mysql:loadbalance://" + hosts + "/test";
                Class.forName ("com.mysql.jdbc.Driver").newInstance ();
                conn = DriverManager.getConnection (url, userName, password);
                System.out.println ("Database connection established");
            }
            catch (Exception e)
            {
                System.err.println ("Cannot connect to database server: " + e);
            }

            try
            {
                Statement s = conn.createStatement ();
                int count;
                count = s.executeUpdate("INSERT INTO jdbctests (t, h, d) VALUES ('Inserted from Java code.', @@hostname, now());");
                s.close ();
                System.out.println (count + " rows were inserted");
            }
            catch (SQLException e)
            {
                System.err.println ("Error when executing SQL statement: " + e);
            }

            // Close connection
            if (conn != null)
            {
                try
                {
                    conn.close ();
                    System.out.println ("Database connection terminated");
                }
                catch (Exception e)
                {
                    System.err.println( "Error when closing database connection: " + e );
                }
            }
            try { Thread.sleep(1000); } catch( Exception e ) { }
        }
    }
}
******************************************************************************************

I run this with: java -cp mysql-connector-java-5.1.20-bin.jar:. MysqlTest

When all 3 nodes are up, this results in following entries in the table:

| 31 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:36:45 |
| 34 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:36:46 |
| 37 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:36:47 |
| 39 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:36:50 |
| 43 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:36:50 |
| 44 | Inserted from Java code. | esitbi127lab.ntc.nokia.com | 2012-05-22 10:36:52 |
| 45 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:36:53 |
| 49 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:36:53 |
| 51 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:36:55 |
| 55 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:36:55 |
| 56 | Inserted from Java code. | esitbi127lab.ntc.nokia.com | 2012-05-22 10:36:57 |
| 57 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:36:58 |

Now, I stop node 127, delete grastate.dat, start node 127:

[root@esitbi127lab sqlclust]# /etc/init.d/mysql stop
Shutting down MySQL (Percona XtraDB Cluster)....... [ OK ]
[root@esitbi127lab sqlclust]# rm /data/pxc/grastate.dat
rm: remove regular file `/data/pxc/grastate.dat'? y
[root@esitbi127lab sqlclust]# /etc/init.d/mysql start
Starting MySQL (Percona XtraDB Cluster).... [ OK ]
[root@esitbi127lab sqlclust]#

Node 125 during this time produces following log:

******************************************************************************
120522 10:37:18 [Note] WSREP: declaring fa705874-a34b-11e1-0800-5abce767b90e stable
120522 10:37:18 [Note] WSREP: (050341a6-a348-11e1-0800-aea4c68a9053, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.21.212.24:4567
120522 10:37:18 [Note] WSREP: view(view_id(PRIM,050341a6-a348-11e1-0800-aea4c68a9053,22) memb {
 050341a6-a348-11e1-0800-aea4c68a9053,
 fa705874-a34b-11e1-0800-5abce767b90e,
} joined {
} left {
} partitioned {
 8b472f9c-a3df-11e1-0800-1691c818f1bb,
})
120522 10:37:18 [Note] WSREP: forgetting 8b472f9c-a3df-11e1-0800-1691c818f1bb (tcp://172.21.212.24:4567)
120522 10:37:18 [Note] WSREP: (050341a6-a348-11e1-0800-aea4c68a9053, 'tcp://0.0.0.0:4567') turning message relay requesting off
120522 10:37:18 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
120522 10:37:18 [Note] WSREP: STATE_EXCHANGE: sent state UUID: f5b2c462-a3e0-11e1-0800-df3e566cd019
120522 10:37:18 [Note] WSREP: STATE EXCHANGE: sent state msg: f5b2c462-a3e0-11e1-0800-df3e566cd019
120522 10:37:18 [Note] WSREP: STATE EXCHANGE: got state msg: f5b2c462-a3e0-11e1-0800-df3e566cd019 from 0 (esitbi125lab.ntc.nokia.com)
120522 10:37:18 [Note] WSREP: STATE EXCHANGE: got state msg: f5b2c462-a3e0-11e1-0800-df3e566cd019 from 1 (esitbi126lab.ntc.nokia.com)
120522 10:37:18 [Note] WSREP: Quorum results:
 version = 2,
 component = PRIMARY,
 conf_id = 21,
 members = 2/2 (joined/total),
 act_id = 920,
 last_appl. = 0,
 protocols = 0/3/1 (gcs/repl/appl),
 group UUID = ebeb1502-a344-11e1-0800-777510efcf73
120522 10:37:18 [Note] WSREP: Flow-control interval: [12, 23]
120522 10:37:18 [Note] WSREP: New cluster view: global state: ebeb1502-a344-11e1-0800-777510efcf73:920, view# 22: Primary, number of nodes: 2, my index: 0, protocol version 1
120522 10:37:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120522 10:37:18 [Note] WSREP: Assign initial position for certification: 920, protocol version: 2
120522 10:37:24 [Note] WSREP: cleaning up 8b472f9c-a3df-11e1-0800-1691c818f1bb (tcp://172.21.212.24:4567)
120522 10:37:42 [Note] WSREP: (050341a6-a348-11e1-0800-aea4c68a9053, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.21.212.24:4567
120522 10:37:42 [Note] WSREP: (050341a6-a348-11e1-0800-aea4c68a9053, 'tcp://0.0.0.0:4567') turning message relay requesting off
120522 10:37:42 [Note] WSREP: declaring 04e8d7a0-a3e1-11e1-0800-17e807e2c66b stable
120522 10:37:42 [Note] WSREP: declaring fa705874-a34b-11e1-0800-5abce767b90e stable
120522 10:37:42 [Note] WSREP: view(view_id(PRIM,04e8d7a0-a3e1-11e1-0800-17e807e2c66b,23) memb {
 04e8d7a0-a3e1-11e1-0800-17e807e2c66b,
 050341a6-a348-11e1-0800-aea4c68a9053,
 fa705874-a34b-11e1-0800-5abce767b90e,
} joined {
} left {
} partitioned {
})
120522 10:37:42 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
120522 10:37:42 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
120522 10:37:42 [Note] WSREP: STATE EXCHANGE: sent state msg: 05359ea0-a3e1-11e1-0800-18babe5c0c4c
120522 10:37:42 [Note] WSREP: STATE EXCHANGE: got state msg: 05359ea0-a3e1-11e1-0800-18babe5c0c4c from 0 (esitbi127lab.ntc.nokia.com)
120522 10:37:42 [Note] WSREP: STATE EXCHANGE: got state msg: 05359ea0-a3e1-11e1-0800-18babe5c0c4c from 1 (esitbi125lab.ntc.nokia.com)
120522 10:37:42 [Note] WSREP: STATE EXCHANGE: got state msg: 05359ea0-a3e1-11e1-0800-18babe5c0c4c from 2 (esitbi126lab.ntc.nokia.com)
120522 10:37:42 [Note] WSREP: Quorum results:
 version = 2,
 component = PRIMARY,
 conf_id = 22,
 members = 2/3 (joined/total),
 act_id = 943,
 last_appl. = 0,
 protocols = 0/3/1 (gcs/repl/appl),
 group UUID = ebeb1502-a344-11e1-0800-777510efcf73
120522 10:37:42 [Note] WSREP: Flow-control interval: [14, 28]
120522 10:37:42 [Note] WSREP: New cluster view: global state: ebeb1502-a344-11e1-0800-777510efcf73:943, view# 23: Primary, number of nodes: 3, my index: 1, protocol version 1
120522 10:37:42 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120522 10:37:42 [Note] WSREP: Assign initial position for certification: 943, protocol version: 2
120522 10:37:44 [Note] WSREP: Node 0 (esitbi127lab.ntc.nokia.com) requested state transfer from '*any*'. Selected 1 (esitbi125lab.ntc.nokia.com)(SYNCED) as donor.
120522 10:37:44 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 945)
120522 10:37:44 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120522 10:40:27 [Note] WSREP: 0 (esitbi127lab.ntc.nokia.com): State transfer from 1 (esitbi125lab.ntc.nokia.com) complete.
120522 10:40:27 [Note] WSREP: Member 0 (esitbi127lab.ntc.nokia.com) synced with group.
120522 10:40:27 [Note] WSREP: 1 (esitbi125lab.ntc.nokia.com): State transfer to 0 (esitbi127lab.ntc.nokia.com) complete.
120522 10:40:27 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 946)
120522 10:40:27 [Note] WSREP: Member 1 (esitbi125lab.ntc.nokia.com) synced with group.
120522 10:40:27 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 946)
120522 10:40:27 [Note] WSREP: Synchronized with group, ready for connections
120522 10:40:27 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

***************************************************************************************************

The JDBC driver correctly detects node 127 as gone, and uses only 125 and 126. However, when node 127 is joining and node 125 becomes donor, the insert on node 125 blocks for the duration of the SST.

| 151 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:37:37 |
| 152 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:37:40 |
| 155 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:37:39 |
| 157 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:37:40 |
| 159 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:37:41 |
| 162 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:37:44 |
| 165 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:37:45 |
| 167 | Inserted from Java code. | esitbi125lab.ntc.nokia.com | 2012-05-22 10:37:45 |
| 168 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:40:29 |
| 171 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:40:30 |
| 174 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:40:31 |
| 175 | Inserted from Java code. | esitbi127lab.ntc.nokia.com | 2012-05-22 10:40:33 |
| 177 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:40:33 |
| 180 | Inserted from Java code. | esitbi126lab.ntc.nokia.com | 2012-05-22 10:40:34 |

As you can see above, as long as the jdbc loadbalancer is hitting node 126, everything is fine, but an insert on 125 blocks for 3 minutes at row id 167. This is also seen in SHOW PROCESSLIST:

| 17 | system user | | NULL | Sleep | 32 | committed 933 | NULL
| 18 | system user | | NULL | Sleep | 21 | committed 944 | NULL
| 405 | root | esitbi128lab.ntc.nokia.com:52286 | test | Query | 20 | query end | INSERT INTO jdbctests (t, h, d)
| 406 | root | localhost:48432 | test | Query | 20 | Writing to net | SELECT /*!40001 SQL_NO_CACHE */
| 407 | root | esitbi126lab.ntc.nokia.com:56303 | NULL | Query | 0 | sleeping | show processlist

The blocked state is always "query end".

The sst method was mysqldump.

The desired behavior here is that a node in donor state, and using an sst method that is blocking (ie anything else than xtrabackup) should not accept incoming queries but should return an error (ie you typically return the Unknown error). This will allow the load balancer to detect the error and failover to another node. If the query is blocked, then the application/thread is stuck and can't do anything about it.

Alex Yurchenko (ayurchen) wrote :

Henrik,

1) what about ongoing transactions, shall they be aborted? there is an option to wait for them to end, but it may last forever.
2) shall existing connections to donor be terminated? Or only errors should be returned?
3) shall this be a default behaviour?

Henrik Ingo (hingo) wrote :

Good questions!

I'll start with 2)
It seems to me idle connections are not an issue, the problem is when client tries to execute a query which will hang. Only then should the Unknown error be returned. If user wants to connect to a node and do show status, that should definitively be allowed.

By this we get to 1)
Also here, if someone has left a transaction uncommitted, but is not currently executing a query, isn't this ok? As a rule of thumb, you should only return an error (Unknown error) for operations that are actually blocked by the SST.

But for this I have another question: Have you tested how this actually works in MySQL? What if I do:

BEGIN;
UPDATE t SET x=1 WHERE y > 0 AND y < 1000;
SELECT SLEEP(60);
UPDATE t SET x=1 WHERE y > 2000 AND y < 3000;
COMMIT;

Suppose someone starts a mysqldump at SLEEP, what will actually happen? Will the mysqldump block, or this transaction? For rsync SST I believe you do LOCK TABLES and it will block until this transaction is finished? So there is nothing Galera needs to do? (...or there might, but nothing relevant to the context of this bug.)

3) I'm not supporting the idea that an SST would override ongoing transactions - except for some corner cases (infinitely long transaction). Other than that, I'm of course for this behavior being default.

Alex Yurchenko (ayurchen) wrote :

Great. So just return an error only if the client attempts a query, don't touch connections otherwise?

With mysqldump we don't lock anything, everything happens inside donor callback which blocks all commits.

Henrik Ingo (hingo) wrote :

I know *you* don't take any special locks for mysqldump, my point was just that I don't know without testing what actually happens when mysqldump is started and there are other transactions ongoing.

Alex Yurchenko (ayurchen) wrote :

- other transactions just block on commit. So mysqldump has a consistent view.

BTW, there is a "slight" problem with mysqldump SST (exactly the one that you're using in this ticket): once we start rejecting client queries, mysqldump queries get rejected too. Unfortunately there seems to be no canonical way to distinguish mysqldump connection from regular client connections. So essentially mysqldump SST is impossible with this paradigm.

Download full text (5.5 KiB)

On Mon, Jun 18, 2012 at 12:13 AM, Alex Yurchenko
<email address hidden> wrote:
> - other transactions just block on commit. So mysqldump has a consistent
> view.

Ok, that's what I see in this bug report too then.

> BTW, there is a "slight" problem with mysqldump SST (exactly the one
> that you're using in this ticket): once we start rejecting client
> queries, mysqldump queries get rejected too. Unfortunately there seems
> to be no canonical way to distinguish mysqldump connection from regular
> client connections. So essentially mysqldump SST is impossible with this
> paradigm.

Good point :-)

I gave it a few hours of thought, and I'm inclined to accept that
mysqldump cannot be fixed. (You could probably fix it by issuing "SET
SESSION wsrep_on=OFF" at the beginning of the dump, but this would
require modifying mysqldump itself which seems to me like going to
far.)

So let's focus on Rsync SST then. (I have not tested, but I will
assume for now that xtrabackup sst works correctly since it is
non-blocking.)

I do the same test with rsync sst and when a node joins, I see the
following with SHOW PROCESSLIST:

mysql> show processlist;
+-------+----------------------+----------------------------------+------+---------+------+------------------------+------------------+-----------+---------------+-----------+
| Id | User | Host | db
  | Command | Time | State | Info |
Rows_sent | Rows_examined | Rows_read |
+-------+----------------------+----------------------------------+------+---------+------+------------------------+------------------+-----------+---------------+-----------+
| 1 | system user | |
NULL | Sleep | 36 | committed 35342 | NULL |
     0 | 0 | 1 |
| 2 | system user | |
NULL | Sleep | 3525 | wsrep aborter idle | NULL |
     0 | 0 | 1 |
| 3 | event_scheduler | localhost |
NULL | Daemon | 3522 | Waiting on empty queue | NULL |
     0 | 0 | 1 |
| 4 | system user | |
NULL | Sleep | 128 | committed 35321 | NULL |
     0 | 0 | 1 |
| 5 | system user | |
NULL | Sleep | 128 | committed 35327 | NULL |
     0 | 0 | 1 |
| 6 | system user | |
NULL | Sleep | 48 | committed 35330 | NULL |
     0 | 0 | 1 |
| 7 | system user | |
NULL | Sleep | 128 | committed 35319 | NULL |
     0 | 0 | 1 |
| 8 | system user | |
NULL | Sleep | 32 | committed 35345 | NULL |
     0 | 0 | 1 |
| 9 | system user | |
NULL | Sleep | 118 | committed 35329 | NULL ...

Read more...

Henrik Ingo (hingo) wrote :

Hi Alex

I was going to add here a copy paste from another test, where the incoming query is stuck executing the insert with "waiting for acquiring global read lock" (with rsync sst). But then IT department decided to force reboot my workstation, so now all you will get is this anecdotal account.

Alex Yurchenko (ayurchen) wrote :

Yes, rsync and all other scripted SSTs besides mysqld have an option (and generally should) request FTWRL before proceeding. So that what you may easily see. I'm not sure how to explain the processlist you posted above - could not reproduce it. Maybe because I'm running with the fix already.

So I'm going to push the fix after a bit more testing. A short description how it will look:
- the variable will be called wsrep_sst_donor_rejects_queries - so that it pretty much explains what it does.
- it won't close any connections, it will just reject queries with UNKNOWN COMMAND error code.
- it will be off by default since it may break mysqldump SST.
- it will have effect whenever the node has to donate SST regardless of SST method - I felt that this will be the only way to satisfy most of possible scenarios. The user may want to reject queries during xtrabackup as well. So basically this is a property of a node, not of the SST method.
- it will have no effect during IST.

Alex Yurchenko (ayurchen) wrote :

One more thing: should this variable affect mysqldump SST at all?
Plus: then there is always an option to hack mysqldump to add a --set option to set additional variables for the dump.
Minus: users can (and that means they will) misconfigure things and break SST.

Henrik Ingo (hingo) wrote :

Alex,

"- it will have effect whenever the node has to donate SST regardless of SST method - I felt that this will be the only way to satisfy most of possible scenarios. The user may want to reject queries during xtrabackup as well. So basically this is a property of a node, not of the SST method."

I agree with this. Also the idea with SST is that users can script anything, so you cannot know what to do. Either the script has to be able to command this state, or the user via my.cnf.

"One more thing: should this variable affect mysqldump SST at all?"

I think the simple case is that this variable is off by default and agnostic to the sst method. Of course, you could also implement it so that it is on by default but it is ignored if wsrep_sst_method=mysqldump (which is also default). That would in a way be a user friendly default, but very confusing since behavior controlled by one variable would change when that variable is constant and another is changing.

By the way, and sorry for bringing this up now that the fix is already there: When digging deeper and understanding this better, I'm beginning to think MySQL maybe has the answer to this already. I'm now going to run a test with rsync sst and using http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_lock_wait_timeout and other timeout paramaters. I'll let you know how it goes.

Even then, I think introducing a galera specific variable (since you've already implemented it) is great, because it's
a) friendlier than having to set n different timeout variables (some on the jdbc side)
b) specific to sst, I may want to have the default long/infinite timeouts in normal operation.

Alex Yurchenko (ayurchen) wrote :

Indeed, it is specific to SST and
1) if I understand correctly, FTWRL does a read lock - i.e. it still allows reads from client - and if client requires causality, then such read will block (for 30 seconds and it is not controlled by lock_wait_timeout)
2) with regards to mysqldump lock_wait_timeout still does not really help: some (if not all) clients will block in commit.

Otherwise I guess everything is good, I'll push it after some more testing.

Henrik Ingo (hingo) wrote :

Ok, with the lock timeouts lowered it kinda works, at least better:

Database connection established
Error when executing SQL statement: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
Database connection terminated
Database connection established
Error when executing SQL statement: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
Database connection terminated
Database connection established
Error when executing SQL statement: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
Database connection terminated
Database connection established
Error when executing SQL statement: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
Database connection terminated
Database connection established
Error when executing SQL statement: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction

Until the timeout triggers, show processlist is as usual:

| 18 | system user | | NULL | Sleep | 478 | NULL | NULL | 0 | 0 | 1 |
| 26 | root | esitbi128lab.ntc.nokia.com:58301 | test | Query | 10 | Waiting for global read lock | INSERT INTO jdbctests (t, h, d) VALUES ('Inserted from Java code.', @@hostname, now()) | 0 | 0 | 1 |
| 27 | root | localhost | NULL | Query | 0 | NULL | show processlist | 0 | 0 | 1 |

Returning Unknown error will still be a big improvement since it is a) immediate and b) handled better by the jdbc driver. At least I think b is true - I never saw an Unknown error propagate back to the output, it may be that it silently reconnects and retries such transactions. (If it is the first query in a transaction, of course.)

Anyway, these timeouts would probably work for mysqldump too and seem to be a "better than nothing" solution if anyone needs it. (I doubt in practice anyone will use mysqldump if they are also using state of the art load balancing :-)

Henrik Ingo (hingo) wrote :

For the record, here are the timeouts in my.cnf:

[mysqld]
# testing lock timeouts for sst and loadbalancer behavior
innodb_lock_wait_timeout = 10
lock_wait_timeout = 10
# inactivity timeouts
wait_timeout = 60
interactive_timeout = 60

Alex Yurchenko (ayurchen) wrote :

fix committed in 5.1 r3228

Alex Yurchenko (ayurchen) wrote :

fix committed in 5.5 r3761

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers