Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Multi-row inserts into a table with auto_increment column when PXC node is in async replication and binlog_format is STATEMENT, causes data inconsistencies

Bug #1168291 reported by Ovais Tariq on 2013-04-12

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Won't Fix	High	Unassigned
5.6	Fix Released	High	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.29-25.15

Bug Description

Suppose we have a PXC 3 node cluster which is replicating from an async master. Now if the binlog_format is STATEMENT and multi-row inserts are executed on a table with an auto_increment column such that values are automatically generated by MySQL, then the PXC node generates wrong auto_increment values which are different from what was generated on the async master.

The test case follows:

master> use test
Database changed
master> drop table t1;
Query OK, 0 rows affected (0.14 sec)

master> CREATE TABLE `t1` (
    -> `i` int(11) NOT NULL AUTO_INCREMENT,
    -> `c` char(32) DEFAULT 'dummy_text',
    -> PRIMARY KEY (`i`)
    -> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.37 sec)

master> select * from t1;
Empty set (0.00 sec)

master> insert into t1(i) values(null);
Query OK, 1 row affected (0.27 sec)

master> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
+---+------------+
1 row in set (0.00 sec)

pxc_node1> select * from t1; -- all good here because only a single value was inserted
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
+---+------------+
1 row in set (0.00 sec)

master> insert into t1(i) values(null), (null), (null); -- OK we now insert 3 values such that all 3 have to be automatically generated
Query OK, 3 rows affected (0.51 sec)
Records: 3 Duplicates: 0 Warnings: 0

master> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
| 2 | dummy_text |
| 3 | dummy_text |
| 4 | dummy_text |
+---+------------+
4 rows in set (0.00 sec)

pxc_node1> select * from t1; -- and there is an inconsistency for the auto_inc value generated for the fourth row
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
| 2 | dummy_text |
| 3 | dummy_text |
| 6 | dummy_text |
+---+------------+
4 rows in set (0.00 sec)

The variable wsrep_auto_increment_control is enabled:
pxc_node1> show variables like '%increment%';
+------------------------------+-------+
| Variable_name | Value |
+------------------------------+-------+
| auto_increment_increment | 3 |
| auto_increment_offset | 3 |
| div_precision_increment | 4 |
| innodb_autoextend_increment | 8 |
| wsrep_auto_increment_control | ON |
+------------------------------+-------+

The problem can be resolved by using binlog_format=ROW on the async master or by disabling wsrep_auto_increment_control. But suppose the async master is 5.0 then RBR is not available, and as such this is a serious issue.

It looks to me as though PXC node does not completely obey all the contextual information that is logged in case of SBR. For example, when I look at the relay log on the PXC node it has the following relevant variables set:
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;

However, either that is not applied correctly to the slave_sql_thread or the variable 'wsrep_auto_increment_control' overrides that.

Tags:

Ovais Tariq (ovais-tariq) on 2013-04-12

tags:

added: i30888

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-12:

Looks similar to lp:587170 in terms of workaround.

Also, there seem to be issues with STATEMENT binlog_format as
reported here:

lp:1161060
lp:1160854

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2013-04-12:

Hi Raghu,

Yes looks more similar to 587170

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2013-04-12:

I think if PXC node can correctly make use of the contextual information added when SBR is used then this issue can be fixed. For example in normal async replication, the slave thread sets session variables according to what it sees in the relay-log, in this case and in the case of lp:587170 it appears that those variables are not properly set:
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-04-15

tags:

added: auto-inc

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-14:

The problem here is that, you have two set of values:

a) auto-inc from master
b) auto-inc determined from cluster - determined from wsrep_auto_increment_control

It would be incorrect to take a) because if
wsrep_auto_increment_control is enabled, the cluster auto-inc
values get inconsistent with what master has provided.

Now, it may be possible to account for master while calculating
the auto-inc, however, the problem remains in that master is not
aware of this.

It may be better to make this combination fail, ie. STMT +
wsrep_auto_increment_control than introducing inconsistency in
the cluster.

Is there any other way this inconsistency can be resolved - ie.
taking into account both the value derived from
wsrep_auto_increment_control and the value from
master?

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2013-05-14:

Hi Raghu,

For events that come in via replication, its perfectly valid to use the auto_increment* variables as set on the master. I do not see why we shouldn't be doing that, this is how its managed in async replication, the master also injects its auto_increment* variables and the slave SQL_threads sets those for its session, I would think of that as the correct way in the context of PXC cluster as well. Any SET SESSION .... in the binary log should be honoured and should not be overridden by the global variable if the variable is allowed to be changed dynamically because that's how variable scoping works in MySQL :)

Revision history for this message

Nilnandan Joshi (nilnandan-joshi) wrote on 2014-08-12:

Verified with PXC 5.5.37.

On aync master:

mysql> use test;
Database changed
mysql>
mysql> CREATE TABLE `t1` (
    -> `i` int(11) NOT NULL AUTO_INCREMENT,
    -> `c` char(32) DEFAULT 'dummy_text',
    -> PRIMARY KEY (`i`)
    -> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.06 sec)

mysql> select * from t1;
Empty set (0.00 sec)

mysql> insert into t1(i) values(null);
Query OK, 1 row affected (0.05 sec)

mysql> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
+---+------------+
1 row in set (0.00 sec)

mysql> insert into t1(i) values(null), (null), (null);
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
| 2 | dummy_text |
| 3 | dummy_text |
| 4 | dummy_text |
+---+------------+
4 rows in set (0.00 sec)

on slave (PXC clusternode1/2/3)

mysql> use test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql>
mysql>
mysql> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
| 2 | dummy_text |
| 4 | dummy_text |
| 7 | dummy_text |
+---+------------+
4 rows in set (0.00 sec)

mysql>

Revision history for this message

Nilnandan Joshi (nilnandan-joshi) wrote on 2014-08-12:

Verified same with PXC 5.6.19

On async master:

mysql> use test
Database changed
mysql>
mysql> CREATE TABLE `t1` (
    -> `i` int(11) NOT NULL AUTO_INCREMENT,
    -> `c` char(32) DEFAULT 'dummy_text',
    -> PRIMARY KEY (`i`)
    -> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.33 sec)

mysql> select * from t1;
Empty set (0.00 sec)

mysql> insert into t1(i) values(null);
Query OK, 1 row affected (0.02 sec)

mysql> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
+---+------------+
1 row in set (0.00 sec)

mysql> insert into t1(i) values(null), (null), (null);
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
| 2 | dummy_text |
| 3 | dummy_text |
| 4 | dummy_text |
+---+------------+
4 rows in set (0.00 sec)

mysql>

on slave (PXC clusternode1/2/3)

mysql> use test
Database changed
mysql> select * from t1;
+---+------------+
| i | c |
+---+------------+
| 1 | dummy_text |
| 2 | dummy_text |
| 4 | dummy_text |
| 6 | dummy_text |
+---+------------+
4 rows in set (0.00 sec)

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2016-02-17:

commit f3e541aa0355c6e074c2be8782fbb6c907997140
Merge: 1e75388 3bf804b
Author: Krunal Bauskar <email address hidden>
Date: Tue Feb 9 10:08:54 2016 +0530

Merge pull request #86 from natsys/5.6-pxc-474

- #PXC-474: Async replication + auto-increment + log-format (STMT) = inconsistent cluster node

commit 3bf804b48eb6860509608c17eb822e9b77f9adec
Author: Julius Goryavsky <email address hidden>
Date: Wed Feb 3 21:28:25 2016 -0500

The causes and fixes:

1. We must ignore wsrep_auto_increment_control when the operator is executed as the result of the replication, including, if it is received from the slave thread. Currently execution from ‘classic’ slave thread treated as local execution and controlled by the automatic increment step adjustment. It is wrong. To correct this error, I changed one conditional operator in the THD::reset_for_next_command() function.

2. If wsrep auto_increment_control switched on during operation of the node, then we should immediately update the auto_increment_increment and auto_increment_offset global variables - without waiting of the next invocation of the wsrep_view_handler_cb() callback. Currently these variables retain its initial values if wsrep_auto_increment_control is switched on during operation of the node, which leads to inconsistent results on the different nodes (in some scenarios).

3. If wsrep auto_increment_control switched off during operation of the node, then we must return the original values of the auto_increment_increment and auto_increment_offset global variables, as the user has set. To make this possible, I created a "shadow copies" of these variables (which stores the latest values set by the user).

commit f3e541aa0355c6e074c2be8782fbb6c907997140
Merge: 1e75388 3bf804b
Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date:   Tue Feb 9 10:08:54 2016 +0530

Merge pull request #86 from natsys/5.6-pxc-474
    
    - #PXC-474: Async replication + auto-increment + log-format (STMT) = inconsistent cluster node

commit 3bf804b48eb6860509608c17eb822e9b77f9adec
Author: Julius Goryavsky <sysprg@gmail.com>
Date:   Wed Feb 3 21:28:25 2016 -0500

This is patch for the PXC-474 issue: suppose we have a PXC 2+ node cluster which is replicating from an async master. Now if the binlog_format is STATEMENT and multi-row inserts are executed on a table with an auto_increment column such that values are automatically generated by MySQL, then the PXC node generates wrong auto_increment values which are different from what was generated on the async master.
    
    The causes and fixes:
    
    1. We must ignore wsrep_auto_increment_control when the operator is executed as the result of the replication, including, if it is received from the slave thread. Currently execution from ‘classic’ slave thread treated as local execution and controlled by the automatic increment step adjustment. It is wrong. To correct this error, I changed one conditional operator in the THD::reset_for_next_command() function.
    
    2. If wsrep auto_increment_control switched on during operation of the node, then we should immediately update the auto_increment_increment and auto_increment_offset global variables - without waiting of the next invocation of the wsrep_view_handler_cb() callback. Currently these variables retain its initial values if wsrep_auto_increment_control is switched on during operation of the node, which leads to inconsistent results on the different nodes (in some scenarios).
    
    3. If wsrep auto_increment_control  switched off during operation of the node, then we must return the original values of the auto_increment_increment and auto_increment_offset global variables, as the user has set. To make this possible, I created a "shadow copies" of these variables (which stores the latest values set by the user).

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-955

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.