unavoidable wsrep certification failures with character unique key

Bug #1314854 reported by Swany
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Status tracked in 5.6
5.5
Fix Committed
Undecided
Seppo Jaakola
5.6
Fix Released
Undecided
Seppo Jaakola
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
High
Unassigned
5.6
Fix Released
High
Unassigned

Bug Description

Given table:
CREATE TABLE reg_0 (
reg_name varchar(127) NOT NULL,
reg_key varchar(64) NOT NULL DEFAULT '',
reg_secret varchar(128) NOT NULL,
description varchar(255) DEFAULT NULL,
reg_status varchar(16) NOT NULL,
user_id varchar(255) NOT NULL,
owner_id varchar(64) DEFAULT NULL,
category varchar(64) DEFAULT NULL,
creation_time bigint(20) DEFAULT NULL,
last_modification_time bigint(20) DEFAULT NULL,
app_platform varchar(64) DEFAULT NULL,
company_name varchar(64) DEFAULT NULL,
company_website varchar(256) DEFAULT NULL,
customer_lead_id varchar(36) DEFAULT NULL,
usage_type varchar(36) DEFAULT NULL,
country char(3) DEFAULT NULL,
state char(2) DEFAULT NULL,
customer_accounting_id varchar(20) DEFAULT NULL,
has_note tinyint(4) DEFAULT NULL,
UNIQUE KEY unq_reg_key (reg_key),
PRIMARY KEY (reg_key),
KEY reg_key (reg_key),
KEY user_id (user_id),
KEY reg_status (reg_status),
KEY owner_id (owner_id),
KEY category (category),
KEY last_modification_time (last_modification_time)
);

Multiple clients can not insert into this table without getting wsrep certification failures (reported as deadlock at commit) even though the inserts do no conflict for the given unique key. This problem ONLY occurs with alphanumeric unique keys. An auto_increment works properly.

Here is a script to demonstrate the problem:
------Insert Script foo.sh-----
#!/bin/bash
min=$1
max=$2
#!/bin/bash
min=$1
max=$2
database=test
increment=$3
prefix=`hostname -a`
counter=$min
table=reg_D
# for table in reg_0 reg_A reg_B reg_C ;do
#!/bin/bash
min=$1
max=$2
database=test
increment=$3
prefix=`hostname -a`
counter=$min
table=reg_F
# for table in reg_0 reg_A reg_B reg_C ;do
logfile="ins.$prefix.$database.$table.`date +%F-%H-%M`.log"

while [ ${counter} -lt ${max} ] ;do

mysql --debug-info=true -vvvvvvvvvv $database <
-- SET GLOBAL AUTOCOMMIT=off;

INSERT INTO
${table} (reg_name
, reg_key
, reg_secret
, description
, reg_status
, user_id
, owner_id)
VALUES ("${prefix}_name_${counter}"
,"${prefix}_key_${counter}"
,"${prefix}_pw_${counter}"
,"${prefix}_descr_${counter}"
,"active"
,"uid_dbs_${counter}"
,"A_accessorLoadtestCk_1") ;

-- INSERT INTO ${table} (id,c1,c2) values ("$counter}","${prefix}_${counter}","${counter}")
-- DELETE from ${table} where reg_key="${prefix}_DBS_${counter}";
-- truncate table $table ;

SQL
let counter=$counter+$increment
done >$logfile 2>&1
------------end------------

There are wsrep certification failures happen when min <= 99.

Tags: i41678
Revision history for this message
Swany (greenlion) wrote :

Just to be clear, there are no duplicate key violations being generated by the script, but wsrep certification fails regardless.

I will try to come up with a simplified test case.

description: updated
Revision history for this message
Swany (greenlion) wrote :

140430 17:01:22 [Note] WSREP: commit failed for reason: 3
140430 17:01:22 [Note] WSREP: conflict state: 0
140430 17:01:22 [Note] WSREP: cluster conflict due to certification failure for threads:
140430 17:01:22 [Note] WSREP: Victim thread:
THD: 622, mode: local, state: executing, conflict: cert failure, seqno: 128137
SQL: INSERT INTO
reg_Y (reg_name
, reg_key
, reg_secret
, description
, reg_status
, user_id
, owner_id)
VALUES ("akdrgdb001_name_15"
,"akdrgdb001_key_15"
,"akdrgdb001_pw_15"
,"akdrgdb001_descr_15"
,"active"
,"uid_dbs_15"
,"A_accessorLoadtestCk_1")
140430 17:01:22 [Note] WSREP: cert failure, thd: 622 is_AC: 1, retry: 1 - 1 SQL: INSERT INTO
reg_Y (reg_name
, reg_key
, reg_secret
, description
, reg_status
, user_id
, owner_id)
VALUES ("akdrgdb001_name_15"
,"akdrgdb001_key_15"
,"akdrgdb001_pw_15"
,"akdrgdb001_descr_15"
,"active"
,"uid_dbs_15"
,"A_accessorLoadtestCk_1")
140430 17:01:22 [Note] WSREP: releasing retry_query: conf 0 sent 0 kill 0 errno 1213 SQL INSERT INTO
reg_Y (reg_name
, reg_key
, reg_secret
, description
, reg_status
, user_id
, owner_id)
VALUES ("akdrgdb001_name_15"
,"akdrgdb001_key_15"
,"akdrgdb001_pw_15"
,"akdrgdb001_descr_15"
,"active"
,"uid_dbs_15"
,"A_accessorLoadtestCk_1")
140430 17:01:22 [Note] WSREP: commit failed for reason: 3
140430 17:01:22 [Note] WSREP: conflict state: 0
140430 17:01:22 [Note] WSREP: cluster conflict due to certification failure for threads:
140430 17:01:22 [Note] WSREP: Victim thread:

Revision history for this message
Celest Pantophlet (celestp) wrote :

Looks like some one else had the same issue:
https://bugs.launchpad.net/galera/+bug/1259977

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Few bugs have been fixed here, is it possible to test with PXC in CentOS testing/debian experimental?

Revision history for this message
Celest Pantophlet (celestp) wrote :

Please clarify .

Revision history for this message
Celest Pantophlet (celestp) wrote :
Download full text (12.5 KiB)

Can you explain what a certfication really means?
I ran inserts on one node and saw the following (No Deadlocks)

------------------------------------

140502 18:49:53 [Note] WSREP: trx conflict for key 0 6f7669617069 726567697374726174696f6e 000041004b004400520047004400 : source: f6fe4654-d146-11e3-98f6-07e3fb548fd3 version: 2 local: 1 state: CERTIFYING flags: 1 conn_id: 55000 trx_id: 9240293 seqnos (l: 12226, g: 157125, s: 157122, d: 156995, ts: 1399056593553436324) <--X--> source: 771d5f4a-d086-11e3-b81d-e62da7bc8426 version: 2 local: 0 state: COMMITTED flags: 1 conn_id: 104292 trx_id: 8692927 seqnos (l: 12225, g: 157124, s: 157123, d: 157123, ts: 1399056593163923614)
140502 18:49:53 [Note] WSREP: commit failed for reason: 3
140502 18:49:53 [Note] WSREP: conflict state: 0
140502 18:49:53 [Note] WSREP: cluster conflict due to certification failure for threads:
140502 18:49:53 [Note] WSREP: Victim thread:
   THD: 55000, mode: local, state: executing, conflict: cert failure, seqno: 157125
   SQL: INSERT INTO
        registration (reg_name
                , reg_key
                , reg_secret
                , description
                , reg_status
                , user_id
                , owner_id
                )
        VALUES ("MultiMasterInsert-186_akdrgdb003"
                ,"akdrgdb003186"
                ,"nopassword_186_akdrgdb003"
                ,"Testing multimaster insert"
                ,"active"
                ,"DBSCelestP186"
                ,"A_accessorLoadtestCk_1"
                )
140502 18:49:53 [Note] WSREP: wsrep retrying AC query: INSERT INTO
        registration (reg_name
                , reg_key
                , reg_secret
                , description
                , reg_status
                , user_id
                , owner_id
                )
        VALUES ("MultiMasterInsert-186_akdrgdb003"
                ,"akdrgdb003186"
                ,"nopassword_186_akdrgdb003"
                ,"Testing multimaster insert"
                ,"active"
                ,"DBSCelestP186"
                ,"A_accessorLoadtestCk_1"
                )
140502 18:49:54 [Note] WSREP: releasing retry_query: conf 0 sent 0 kill 0 errno 0 SQL INSERT INTO
        registration (reg_name
                , reg_key
                , reg_secret
                , description
                , reg_status
                , user_id
                , owner_id
                )
        VALUES ("MultiMasterInsert-186_akdrgdb003"
                ,"akdrgdb003186"
                ,"nopassword_186_akdrgdb003"
                ,"Testing multimaster insert"
                ,"active"
                ,"DBSCelestP186"
                ,"A_accessorLoadtestCk_1"
                )
140502 18:49:55 [Note] WSREP: trx conflict for key 0 6f7669617069 726567697374726174696f6e 000041004b004400520047004400 : source: 1f655260-d083-11e3-9c99-274f9aafca92 version: 2 local: 0 state: CERTIFYING flags: 1 conn_id: 103775 trx_id: 8718593 seqnos (l: 12231, g: 157130, s: 157128, d: 156995, ts: 1399056595415520640) <--X--> source: f6fe4654-d146-11e3-98f6-07e3fb548fd3 version: 2 local: 1 state: COMMITTED flags: 1 conn_id: 55002 trx_id: 9240298 seqnos (l: 12230, g: 1...

Revision history for this message
Przemek (pmalkowski) wrote :
Download full text (4.0 KiB)

I can confirm the issue on PXC 5.5.34 + Galera 2.8 and PXC 5.5.37 + Galera 2.10.

This is how I tested (thanks to Celest for test case):

- created 3 nodes PXC cluster using Vagrant and these wonderful pack: https://github.com/lefred/percona-cluster
- installed parallel-ssh on vagrant host
- created sample table:

CREATE TABLE `tbl_0` (
  `reg_name` varchar(127) NOT NULL,
  `reg_key` varchar(64) NOT NULL DEFAULT '',
  `reg_secret` varchar(128) NOT NULL,
  `description` varchar(255) DEFAULT NULL,
  `reg_status` varchar(16) NOT NULL,
  `user_id` varchar(255) NOT NULL,
  `owner_id` varchar(64) DEFAULT NULL,
  PRIMARY KEY (`reg_key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

- enabled debugging on one of the nodes:
SET GLOBAL wsrep_log_conflicts="ON";
SET GLOBAL wsrep_debug="ON";

- deployed attached bash script on each node
- ran it in parallel on each node with:

parallel-ssh -P -h cluster1.pssh -x "-i /home/przemyslaw.malkowski/.vagrant.d/insecure_private_key -oStrictHostKeyChecking=no" '/home/vagrant/test.sh 20 100 1'
[1] 08:43:10 [SUCCESS] vagrant@127.0.0.1:2211
[2] 08:43:11 [SUCCESS] vagrant@127.0.0.1:2205
[3] 08:43:12 [SUCCESS] vagrant@127.0.0.1:2204

percona3 mysql> select count(*) total from tbl_0; select count(*) node1 from tbl_0 where reg_key like 'percona1%'; select count(*) node2 from tbl_0 where reg_key like 'percona2%'; select count(*) node3 from tbl_0 where reg_key like 'percona3%';
+-------+
| total |
+-------+
| 178 |
+-------+
1 row in set (0.00 sec)

+-------+
| node1 |
+-------+
| 43 |
+-------+
1 row in set (0.00 sec)

+-------+
| node2 |
+-------+
| 63 |
+-------+
1 row in set (0.00 sec)

+-------+
| node3 |
+-------+
| 72 |
+-------+
1 row in set (0.00 sec)

Debug info example:

140505 14:43:11 [Note] WSREP: trx conflict for key 0 74657374 74626c5f30 010050004500520043004f004e004100 : source: 5a67fd6a-cd2c-11e3-9ec4-de1070c656ba version: 2 local: 1 state: CERTIFYING flags: 1 conn_id: 7900 trx_id: 38507 seqnos (l: 24976, g: 24850, s: 24848, d: 24429, ts: 1399293791018546985) <--X--> source: d067275c-d44d-11e3-b054-036e4c0d59ea version: 2 local: 0 state: COMMITTED flags: 1 conn_id: 186 trx_id: 31963 seqnos (l: 24975, g: 24849, s: 24848, d: 24847, ts: 1399293791095596655)
140505 14:43:11 [Note] WSREP: commit failed for reason: 3
140505 14:43:11 [Note] WSREP: conflict state: 0
140505 14:43:11 [Note] WSREP: cluster conflict due to certification failure for threads:
140505 14:43:11 [Note] WSREP: Victim thread:
   THD: 7900, mode: local, state: executing, conflict: cert failure, seqno: 24850
   SQL: INSERT INTO
        tbl_0 (reg_name
                , reg_key
                , reg_secret
                , description
                , reg_status
                , user_id
                , owner_id)
        VALUES ("percona1_name_75"
                ,"percona1_key_75"
                ,"percona1_pw_75"
                ,"percona1_descr_75"
                ,"active"
                ,"uid_dbs_75"
                ,"A_accessorLoadtestCk_1")
140505 14:43:11 [Note] WSREP: cert failure, thd: 7900 is_AC: 1, retry: 1 - 1 SQL: INSERT INTO
        tbl_0 (reg_name
                , reg_key
                , reg_secret
                ...

Read more...

Revision history for this message
Przemek (pmalkowski) wrote :

Cannot reproduce using PXC 5.6.15 + Galera 3.5:

parallel-ssh -P -h 3nodes56.pssh -x "-i /home/przemyslaw.malkowski/.vagrant.d/insecure_private_key -oStrictHostKeyChecking=no" 'sudo /root/test.sh 1 100 1'
[1] 08:50:33 [SUCCESS] vagrant@127.0.0.1:2222
[2] 08:50:33 [SUCCESS] vagrant@127.0.0.1:2203
[3] 08:50:33 [SUCCESS] vagrant@127.0.0.1:2202

pxc2>select count(*) total from tbl_0; select count(*) node1 from tbl_0 where reg_key like 'pxc1%'; select count(*) node2 from tbl_0 where reg_key like 'pxc2%'; select count(*) node3 from tbl_0 where reg_key like 'pxc3%';
+-------+
| total |
+-------+
| 297 |
+-------+
1 row in set (0.00 sec)

+-------+
| node1 |
+-------+
| 99 |
+-------+
1 row in set (0.00 sec)

+-------+
| node2 |
+-------+
| 99 |
+-------+
1 row in set (0.00 sec)

+-------+
| node3 |
+-------+
| 99 |
+-------+
1 row in set (0.00 sec)

No deadlocks noted.

tags: added: i41678
Revision history for this message
Przemek (pmalkowski) wrote :

This problem happens for a tables with single varchar PK or UK, or PK+UK, but does not happen when there is single int auto_increment PK. Happens again with auto_increment PK + UK on another varchar column, for example:

CREATE TABLE `tbl_0` (
  `reg_name` varchar(127) NOT NULL,
  `reg_key` varchar(64) NOT NULL DEFAULT '',
  `reg_secret` varchar(128) NOT NULL,
  `description` varchar(255) DEFAULT NULL,
  `reg_status` varchar(16) NOT NULL,
  `user_id` varchar(255) NOT NULL,
  `owner_id` varchar(64) DEFAULT NULL,
  `id` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
-- no deadlocks

CREATE TABLE `tbl_0` (
  `reg_name` varchar(127) NOT NULL,
  `reg_key` varchar(64) NOT NULL DEFAULT '',
  `reg_secret` varchar(128) NOT NULL,
  `description` varchar(255) DEFAULT NULL,
  `reg_status` varchar(16) NOT NULL,
  `user_id` varchar(255) NOT NULL,
  `owner_id` varchar(64) DEFAULT NULL,
  `id` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  UNIQUE KEY `reg_key` (`reg_key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
-- deadlocks

Revision history for this message
Przemek (pmalkowski) wrote :

I am able to reproduce the bug also on PXC 5.5 + Galera 3.5

percona3 mysql> select @@version,@@version_comment; show status like 'wsrep_provider_version';
+-----------+------------------------------------------------+
| @@version | @@version_comment |
+-----------+------------------------------------------------+
| 5.5.34-55 | Percona XtraDB Cluster (GPL), wsrep_25.9.r3928 |
+-----------+------------------------------------------------+
1 row in set (0.00 sec)

+------------------------+-----------+
| Variable_name | Value |
+------------------------+-----------+
| wsrep_provider_version | 3.5(r178) |
+------------------------+-----------+
1 row in set (0.00 sec)

Revision history for this message
Celest Pantophlet (celestp) wrote :

Is it possible that we are running into the the 'gap lock ' situation?

Revision history for this message
Celest Pantophlet (celestp) wrote :

When I disable innodb_unsafe_for_binlogs , and ran the test I see the following?
-------------
140430 23:30:47 [Note] WSREP: cluster conflict due to high priority abort for threads:
140430 23:30:47 [Note] WSREP: Winning thread:
   THD: 43095, mode: total order, state: executing, conflict: no conflict, seqno: 140251
   SQL: truncate table tbl_base
140430 23:30:47 [Note] WSREP: Victim thread:
   THD: 50048, mode: local, state: committing, conflict: must abort, seqno: -1
   SQL: INSERT INTO
        tbl_base (reg_name
                , reg_key
                , reg_secret
                , description
                , reg_status
                , user_id
                , owner_id)
        VALUES ("1618_akdrgdb003_name"
                ,"0031618"
                ,"1618_akdrgdb003_pw"
                ,"1618-akdrgdb003"
                ,"active"
                ,"1618_akdrgdb003_id"
                ,"A_accessorLoadtestCk_1")
140430 23:30:47 [Note] WSREP: BF kill (1, seqno: 140251), victim: (50048) trx: 9185454
140430 23:30:47 [Note] WSREP: Aborting query: INSERT INTO
        tbl_base (reg_name
                , reg_key
                , reg_secret
                , description
                , reg_status
                , user_id
                , owner_id)
        VALUES ("1618_akdrgdb003_name"
                ,"0031618"
                ,"1618_akdrgdb003_pw"
                ,"1618-akdrgdb003"
                ,"active"
                ,"1618_akdrgdb003_id"
                ,"A_accessorLoadtestCk_1")
140430 23:30:47 [Note] WSREP: victim 9185454 in MUST ABORT state
140430 23:30:47 [Note] WSREP: MDL victim killed at add_ticket
140430 23:30:47 [Note] WSREP: trx conflict for key 0 74657374 74626c5f62617365 : source: 4a6c8c29-cff5-11e3-bf2a-33f8b3f78fde versi
on: 2 local: 1 state: MUST_ABORT flags: 1 conn_id: 50048 trx_id: 9185454 seqnos (l: 16735, g: 140252, s: 140250, d: 139798, ts: 139
8900647665050495) <--X--> source: 4a6c8c29-cff5-11e3-bf2a-33f8b3f78fde version: 2 local: 1 state: APPLYING flags: 65 conn_id: 43095
 trx_id: -1 seqnos (l: 16734, g: 140251, s: 140249, d: 140250, ts: 1398900647649096183)
140430 23:30:47 [Note] WSREP: commit failed for reason: 3
--------------------------

Revision history for this message
Przemek (pmalkowski) wrote :

Confirmed for PXC 5.5.34 + Galera 2.8, 2.10 and 3.5

Revision history for this message
Przemek (pmalkowski) wrote :

Does not happen on PXC 5.6.15 + Galera 3.5

Changed in codership-mysql:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
status: New → In Progress
milestone: none → 5.6.19-25.6
Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Przemek,

Can you test with PXC in experimental (CentOS: http://jenkins.percona.com/experimental/CENTOS/percona-experimental.repo) and let us know if fixed.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

This fix caused regression for multi-part foreign keys having multi-byte varchar column as first element in the key. Fixes for this regression were pushed in:
wsrep-5.5: http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/4005
wsrep-5.6: http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/4106

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

one more problem arose for 5.5 version due to the fix. Problem is that 5.5 strnxfrm will always pad the full destination string buffer. We don't know upfront how many characters will be converted, so allocation a good destination buffer is best guess, using safe default in the fix in revision:

http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/4006

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

A regression emerged for wsrep-5.5 when using single-byte varchar keys, fix pushed in revision:
http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/4007

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1001

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.