Bug #1261836 “HA_ERR_KEY_NOT_FOUND” : Bugs : Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Revision history for this message

Chriss (bst2002) wrote on 2013-12-17:

#1

mysql_variables1.txt Edit (64.7 KiB, text/plain)

description:

updated

Revision history for this message

Valerii Kravchuk (valerii-kravchuk) wrote on 2013-12-18:

#2

I wonder what is the output of:

explain DELETE FROM cache_path WHERE (expire <> '0') AND (expire < '1387189708');
explain SELECT * FROM cache_path WHERE (expire <> '0') AND (expire < '1387189708');
select count(*) FROM cache_path WHERE (expire <> '0') AND (expire < '1387189708');

on node3 (where it worked) vs. node1 (where there was error during replication).

Changed in percona-xtradb-cluster:
status:	New → Incomplete

Revision history for this message

Chriss (bst2002) wrote on 2013-12-18:

#3

Raghavendra D Prabhu (raghavendra-prabhu) on 2014-01-14

Changed in percona-xtradb-cluster:
status:	Incomplete → New

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-01-14:

#4

@Chriss,

a) Can you provide PXC/Galera packages that you have installed?

b) Is the table and the query sufficient to bring down the node?
(I tried with empty table and query, didn't work). Can you
provide a minimal test sql to reproduce this?

Revision history for this message

Chriss (bst2002) wrote on 2014-01-14:

#5

test_table and mysqld_config Edit (8.5 KiB, application/x-tar)

@ raghavendra-prabhu

a) This was the packages installed at time of failure:

CentOS release 6.5 (Final)
kernel 2.6.32-431.el6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.14-25.1.571.rhel6.x86_64
Percona-XtraDB-Cluster-test-56-5.6.14-25.1.571.rhel6.x86_64
percona-xtrabackup-2.1.6-702.rhel6.x86_64
Percona-XtraDB-Cluster-galera-56-3.1-1.169.rhel6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.14-25.1.571.rhel6.x86_64
percona-release-0.0-1.x86_64
percona-toolkit-2.2.5-2.noarch
Percona-XtraDB-Cluster-client-56-5.6.14-25.1.571.rhel6.x86_64

b) and yes with the above query i was able to reproduce it.

i don't know if it was "thread concurrency" (var: wsrep_slave_threads=2 or 4) or not

after Upgrading (17.Dec 2013) with the new released packages (since today):

CentOS release 6.5 (Final) with latest updates
kernel 2.6.32-431.3.1.el6.x86_64
percona-xtrabackup-2.1.6-702.rhel6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.15-25.2.645.rhel6.x86_64
Percona-XtraDB-Cluster-galera-3-3.2-1.189.rhel6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.15-25.2.645.rhel6.x86_64
Percona-XtraDB-Cluster-client-56-5.6.15-25.2.645.rhel6.x86_64
percona-release-0.0-1.x86_64

and 3 nodes with same packages, same config expect node specific ip etc. it seems to work (see attachment config, table) but with wsrep_slave_threads=1, if i set this higher it fails.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-01-19:

#6

Valerii has asked the right questions: this is most likely inconsistency and most likely it is user-induced.

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-01-21:

#7

Experiencing the same thing.
After SST (no way a user may put the cluster into inconsistent state without cluster errors involved) we eventually see log records like:
Slave SQL: Could not execute Delete_rows event on table fbhub.rating_books_global; Can't find record in 'rating_books_global', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1098, Error_code: 1032

and then cluster stalls. 2 of 3 nodes have this HA_ERR_KEY_NOT_FOUND and die, survivor (he was distributing this delete, I may guess) hangs in "too many users" state. Bah.

First question is how could cluster get inconsistent in such a way?
Second question is why deletion of already inexistent row causes node to die instead of a simple warning? if a node will ignore such an error cluster will get OK for at least this issue, so why react so extreme? Warn and drop the delete request.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-01-21:

#8

First answer: of course there might be a bug. But I would first carefully scrutinize all your and client actions starting with the first node start. Were nodes ever restarted? Are there any root users? Was wsrep_on=OFF or wsrep_OSU_method=RSU ever used? Which node was the donor of that SST you mention?

Second answer: yes, this probably could be relaxed. However even if DELETE of a non-existing row may seem safe to ignore, this is most likely not the only inconsistency in the database and an operation that can't be ignored will soon follow. And that may result in a wrong data sent to client. Generally, the sooner you deal with inconsistency - the better.

Revision history for this message

Seppo Jaakola (seppo-jaakola) wrote on 2014-01-24:

#9

I tried test script from Chriss against fresh development trunk builds of wsrep-5.5 (mysql 5.5.35) and wsrep-5.6 (mysql 5.6.15), both using latest Galera plugin 3.2 and in both cases the test passes - no crashes

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-02-12:

#10

We never play with wsrep_on=OFF nor wsrep_OSU_method=RSU (why would anyone?)
We do not use root users (but we have some with full access to *.*, which is about the same thing, I may guess).
The donor in such a case is one. Se restart one node, then SST to another, then 3-d node takes SST from one of this. If there are no issues with full access on *.* not replicating properly by design there is surely a way to compromise a cluster by legal SQL statement.

> Generally, the sooner you deal with inconsistency - the better.
I admit this sounds right and I'd even could agree. But.
1. I have reasons to believe this inconsistency is fresh-made. Like the missing row was just deleted on A from C node and than comes B and requests deletion. Not that simple, probably, but something like this.
2. Survivor node is dead. No use of it's consistency anyway. At least neighbors could shut down without the whole cluster coming down.

Affected table looks like this:
CREATE TABLE `rating_books_global` (
  `art` int(10) unsigned NOT NULL,
  `face` smallint(5) unsigned NOT NULL,
  `period` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `atype` tinyint(3) unsigned NOT NULL COMMENT ,
  `genre` smallint(5) unsigned NOT NULL,
  `sales` int(10) unsigned NOT NULL DEFAULT '0',
  UNIQUE KEY `fpga` (`art`,`genre`,`face`,`period`),
  KEY `r_genre` (`genre`,`face`,`period`,`atype`,`sales`,`art`),
  KEY `r_face` (`face`,`period`,`art`,`sales`),
  KEY `fpsa` (`face`,`period`,`sales`,`art`),
  KEY `fpasa` (`face`,`period`,`atype`,`sales`,`art`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

It's kind of ugly speaking of indexes but at least it has an unique key.

There is one more specific thing I may notice, we delete from rating_books_global like this:
DELETE r
        FROM rating_books_global AS r
          JOIN lib_faces AS lf ON lf.id=r.face AND lf.lib = ?
          LEFT JOIN temp.rb_all_books AS a_tmp ON a_tmp.art = r.art
            AND a_tmp.face = r.face
            AND a_tmp.genre = r.genre
        WHERE a_tmp.art IS NULL
          AND r.period = ?

where temp.rb_all_books is a temporary table.

We never play with wsrep_on=OFF nor wsrep_OSU_method=RSU (why would anyone?)
We do not use root users (but we have some with full access to *.*, which is about the same thing, I may guess).
The donor in such a case is one. Se restart one node, then SST to another, then 3-d node takes SST from one of this. If there are no issues with full access on *.* not replicating properly by design there is surely a way to compromise a cluster by legal SQL statement.

> Generally, the sooner you deal with inconsistency - the better.
I admit this sounds right and I'd even could agree. But.
1. I have reasons to believe this inconsistency is fresh-made. Like the missing row was just deleted on A from C node and than comes B and requests deletion. Not that simple, probably, but something like this.
2. Survivor node is dead. No use of it's consistency anyway. At least neighbors could shut down without the whole cluster coming down.

Affected table looks like this:
CREATE TABLE `rating_books_global` (
  `art` int(10) unsigned NOT NULL,
  `face` smallint(5) unsigned NOT NULL,
  `period` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `atype` tinyint(3) unsigned NOT NULL COMMENT ,
  `genre` smallint(5) unsigned NOT NULL,
  `sales` int(10) unsigned NOT NULL DEFAULT '0',
  UNIQUE KEY `fpga` (`art`,`genre`,`face`,`period`),
  KEY `r_genre` (`genre`,`face`,`period`,`atype`,`sales`,`art`),
  KEY `r_face` (`face`,`period`,`art`,`sales`),
  KEY `fpsa` (`face`,`period`,`sales`,`art`),
  KEY `fpasa` (`face`,`period`,`atype`,`sales`,`art`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

It's kind of ugly speaking of indexes but at least it has an unique key.

There is one more specific thing I may notice, we delete from rating_books_global like this:
DELETE r
        FROM rating_books_global AS r
          JOIN lib_faces AS lf ON lf.id=r.face AND lf.lib = ?
          LEFT JOIN temp.rb_all_books AS a_tmp ON a_tmp.art = r.art
            AND a_tmp.face = r.face
            AND a_tmp.genre = r.genre
        WHERE a_tmp.art IS NULL
          AND r.period = ?

where temp.rb_all_books is a temporary table.

Revision history for this message

Jervin R (revin) wrote on 2014-03-04:

#11

Another report of possible similar incident:

node1 starts after clean shutdown:
140302 08:00:06 mysqld_safe Skipping wsrep-recover for d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147 pair

node2 starts after a clean shutdown:
140302 09:32:48 mysqld_safe Skipping wsrep-recover for d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147 pair

node2 requests an IST from node1, its looks like the writesets from the IST is causing the errors :
2014-03-02 09:28:27 35258 [Note] WSREP: State transfer required:
Group state: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:269525631
Local state: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147
...
2014-03-02 09:31:18 35258 [Note] WSREP: SST received: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147
2014-03-02 09:31:18 35258 [Note] WSREP: Receiving IST: 2054484 writesets, seqnos 267471147-269525631

2014-03-02 09:31:18 35258 [Note] /mysql/bin/mysqld: ready for connections.
Version: '5.6.15-log' socket: '/tmp/mysql.sock' port: 3306 Source distribution, wsrep_25.4.rXXXX
2014-03-02 09:31:18 35258 [ERROR] Slave SQL: Could not execute Update_rows event on table xxxdb.xxxtbl; Can't find record in 'xxxtbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 251, Error_code: 1032
2014-03-02 09:31:18 35258 [Warning] WSREP: RBR event 4 Update_rows apply warning: 120, 267471157
2014-03-02 09:31:18 35258 [ERROR] Slave SQL: Could not execute Update_rows event on table xxxdb.xxxtbl; Can't find record in 'xxxtbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 251, Error_code: 1032
2014-03-02 09:31:18 35258 [Warning] WSREP: RBR event 4 Update_rows apply warning: 120, 267471162
2014-03-02 09:31:18 35258 [Warning] WSREP: Failed to apply app buffer: seqno: 267471157, status: 1
at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-03-02 09:31:18 35258 [Warning] WSREP: Failed to apply app buffer: seqno: 267471162, status: 1
at galera/src/trx_handle.cpp:apply():340
Retrying 2th time

Another report of possible similar incident:

node1 starts after clean shutdown:
140302 08:00:06 mysqld_safe Skipping wsrep-recover for d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147 pair

node2 starts after a clean shutdown:
140302 09:32:48 mysqld_safe Skipping wsrep-recover for d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147 pair

node2 requests an IST from node1, its looks like the writesets from the IST is causing the errors :
2014-03-02 09:28:27 35258 [Note] WSREP: State transfer required: 
        Group state: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:269525631
        Local state: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147
...
2014-03-02 09:31:18 35258 [Note] WSREP: SST received: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147
2014-03-02 09:31:18 35258 [Note] WSREP: Receiving IST: 2054484 writesets, seqnos 267471147-269525631

2014-03-02 09:31:18 35258 [Note] /mysql/bin/mysqld: ready for connections.
Version: '5.6.15-log'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution, wsrep_25.4.rXXXX
2014-03-02 09:31:18 35258 [ERROR] Slave SQL: Could not execute Update_rows event on table xxxdb.xxxtbl; Can't find record in 'xxxtbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 251, Error_code: 1032
2014-03-02 09:31:18 35258 [Warning] WSREP: RBR event 4 Update_rows apply warning: 120, 267471157
2014-03-02 09:31:18 35258 [ERROR] Slave SQL: Could not execute Update_rows event on table xxxdb.xxxtbl; Can't find record in 'xxxtbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 251, Error_code: 1032
2014-03-02 09:31:18 35258 [Warning] WSREP: RBR event 4 Update_rows apply warning: 120, 267471162
2014-03-02 09:31:18 35258 [Warning] WSREP: Failed to apply app buffer: seqno: 267471157, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-03-02 09:31:18 35258 [Warning] WSREP: Failed to apply app buffer: seqno: 267471162, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time

Revision history for this message

Sergey (cyber-neo) wrote on 2014-03-31:

#12

Download full text (3.4 KiB)

I've got same error:
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [ERROR] WSREP: Failed to apply trx: source: 55c6879a-addb-11e3-8088-72173dd9b136 version: 3 local: 0 state: APPLYING flags: 129 conn_id: 39346883 trx_id: 270863489 seqnos (l: 22323196, g: 50196216, s: 50196215, d: 50196215, ts: 11665052428770158)
2014-03-29 20:11:43 23819 [ERROR] WSREP: Failed to apply trx 50196216 4 times
2014-03-29 20:11:43 23819 [ERROR] WSREP: Node consistency compromized, aborting...
2014-03-29 20:11:43 23819 [Note] WSREP: Closing send monitor...
2014-03-29 20:11:43 23819 [Note] WSREP: Close...

I've got same error:
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [ERROR] WSREP: Failed to apply trx: source: 55c6879a-addb-11e3-8088-72173dd9b136 version: 3 local: 0 state: APPLYING flags: 129 conn_id: 39346883 trx_id: 270863489 seqnos (l: 22323196, g: 50196216, s: 50196215, d: 50196215, ts: 11665052428770158)
2014-03-29 20:11:43 23819 [ERROR] WSREP: Failed to apply trx 50196216 4 times
2014-03-29 20:11:43 23819 [ERROR] WSREP: Node consistency compromized, aborting...
2014-03-29 20:11:43 23819 [Note] WSREP: Closing send monitor...
2014-03-29 20:11:43 23819 [Note] WSREP: Closed send monitor.
2014-03-29 20:11:43 23819 [Note] WSREP: gcomm: terminating thread
2014-03-29 20:11:43 23819 [Note] WSREP: gcomm: joining thread

After that 2 nodes die.
mysqld  Ver 5.6.15-56 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release 25.3, Revision 706, wsrep_25.3.r4034)

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-03-31:

#13

We have faced the same problem on the other table, where no temporary table involved. Yet there seem to be two concurring delete requests coming from the same node using different keys, both use "delete .. limit". Say one request "delete from rating_books_global where period = 1 limit 500" and another "delete from rating_books_global where genre = 16 limit 500".
Also we have altered the first table to add implicit primary bigint key, no help - the problem persists.

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-03-31:

#14

Or may be one of the requests is not using "limit", it's hard to see exactly, we only can guess. Going to try latest trunk to see if this has being fixed (doubtfully).

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-04-10:

#15

Latest testing build added next level:
140410 15:45:41 [ERROR] Slave SQL: Could not execute Write_rows event on table lib_area_100.users; Duplicate entry '<censored>' for key 'login', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 403, Error_code: 1062
140410 15:45:41 [Warning] WSREP: RBR event 2 Write_rows apply warning: 121, 7824565248
140410 15:45:41 [Warning] WSREP: Failed to apply app buffer: seqno: 7824565248, status: 1
at galera/src/replicator_smm.cpp:apply_wscoll():57

repeats quite often. Don't know if this is all the same or is it different.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-04-10:

#16

Dmitry, does it happen with wsrep_slave_threads=1 ?

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-04-14:

#17

wsrep_slave_threads=250

And, unlike HA_ERR_KEY_NOT_FOUND, which strikes all nodes but one, this one hits one node while leaving others intact. This is much better keeping in mind cluster does not stall, but still annoying.

ps. And we had no stall with HA_ERR_KEY_NOT_FOUND after migrating to the latest build. Not sure this means something, thou, sometimes it takes month to happen so we are prepared to the worst, but anyway this worth mentioning.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-04-14:

#18

Ok, so this may be not a real inconsistency but a parallel applying bug. Are there foreign keys involved? What is the definition of lib_area_100.user? Any references to that table?

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-04-14:

#19

Switched to wsrep_slave_threads=1 to see what will happen (no significant slowdown so far).
But this setup worked well with several previous releases (except for the HA_ERR_KEY_NOT_FOUND).

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-04-14:

#20

Download full text (3.5 KiB)

CREATE TABLE `users` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `login` varchar(100) DEFAULT NULL,
  `pwd` varchar(100) DEFAULT NULL,
  `s_mail` varchar(50) DEFAULT NULL,
  `s_www` varchar(255) DEFAULT NULL,
  `s_inn` varchar(50) DEFAULT NULL,
  `s_descr` text,
  `s_phone` varchar(100) DEFAULT NULL,
  `offert_accepted` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `s_full_name` varchar(255) DEFAULT NULL,
  `s_first_name` varchar(100) DEFAULT NULL,
  `s_middle_name` varchar(100) DEFAULT NULL,
  `s_last_name` varchar(100) DEFAULT NULL,
  `s_city` varchar(100) DEFAULT NULL,
  `s_address` varchar(255) DEFAULT NULL,
  `last_used` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `last_host_id` smallint(5) unsigned DEFAULT NULL,
  `mail_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirm_code` varchar(4) DEFAULT NULL,
  `msisdn_req_time` datetime DEFAULT NULL,
  `java_phone_model` varchar(30) DEFAULT NULL,
  `java_font` varchar(30) DEFAULT NULL,
  `java_safe_mode` tinyint(4) NOT NULL DEFAULT '0',
  `user_pic` varchar(16) DEFAULT NULL,
  `denied_libs` varchar(255) DEFAULT NULL,
  `recenser_type` int(11) DEFAULT NULL,
  `partner_id` int(11) DEFAULT NULL,
  `creat_date` datetime DEFAULT NULL,
  `partner` int(10) unsigned DEFAULT NULL,
  `partner_valid_till` date DEFAULT NULL,
  `partner_pin` varchar(32) DEFAULT NULL,
  `account` decimal(9,2) NOT NULL DEFAULT '0.00',
  `abonement_start` datetime DEFAULT NULL,
  `abonement_expires` date NOT NULL DEFAULT '2006-01-01',
  `abonement_period` smallint(6) DEFAULT NULL,
  `abonement_delay` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_max_price` decimal(6,2) DEFAULT NULL,
  `abonement_downloads` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_left_clicks` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_left_summ` decimal(6,2) DEFAULT NULL,
  `user_pic_height` tinyint(3) unsigned DEFAULT NULL,
  `user_pic_width` tinyint(3) unsigned DEFAULT NULL,
  `show_pay_btn` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `s_puid` varchar(255) DEFAULT NULL,
  `discount` decimal(4,4) NOT NULL DEFAULT '0.0000',
  `money_bonus` decimal(9,2) NOT NULL DEFAULT '0.00',
  `subscr_last_reminded` datetime DEFAULT NULL,
  `subscr_free_arts_given` datetime DEFAULT NULL,
  `subscr_type` tinyint(3) unsigned NOT NULL DEFAULT '2',
  `subscr_period` tinyint(3) unsigned DEFAULT '1',
  `subscr_content` tinyint(3) unsigned DEFAULT '2',
  `subscr_genres` text,
  `s_subscr_text_authors` text,
  `subscr_date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `s_subscr_text_pattern` text,
  `subscr_languages` text,
  `prefered_currency` char(3) DEFAULT NULL,
  `last_paymethod` tinyint(3) unsigned DEFAULT NULL,
  `utc_offset` char(6) DEFAULT NULL,
  `last_ip` varchar(15) DEFAULT NULL,
  `socnet_last_reminded` datetime DEFAULT '0000-01-01 00:00:00',
  `moved_from` smallint(5) unsigned DEFAULT NULL,
  `moved_to` smallint(5) unsigned DEFAULT NULL,
  `subscribe_new_buys` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `login` (`login`),
  KEY `s_mail` (`s_mail`),
  KEY `partner` (`partner`),
  KEY `partner_valid_till` (`partn...

Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

HA_ERR_KEY_NOT_FOUND

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches