HA_ERR_KEY_NOT_FOUND

Bug #1261836 reported by Chriss
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
New
Undecided
Unassigned

Bug Description

Im having trouble with the multimaster setup using 3 nodes (all same my.cnf + node specific changes), possibly u could help me.

node1 (started with bootstrap)
node2
node3 web-app via php-fpm

Table def.:

CREATE TABLE IF NOT EXISTS `cache_path` (
  `cid` varchar(255) NOT NULL DEFAULT '' COMMENT 'Primary Key: Unique cache ID.',
  `data` longblob COMMENT 'A collection of data to cache.',
  `expire` int(11) NOT NULL DEFAULT '0' COMMENT 'A Unix timestamp indicating when the cache entry
should expire, or 0 for never.',
  `created` int(11) NOT NULL DEFAULT '0' COMMENT 'A Unix timestamp indicating when the cache entry
was created.',
  `serialized` smallint(6) NOT NULL DEFAULT '0' COMMENT 'A flag to indicate whether content is
serialized (1) or not (0).',
  PRIMARY KEY (`cid`),
  KEY `expire` (`expire`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Cache table for path alias lookup.';

if i make this delete query

DELETE FROM cache_path WHERE (expire <> '0') AND (expire < '1387189708')

node1/node2 goes donwn with this error msg:

2013-12-17 13:43:51 14824 [Note] WSREP: Member 1 (ISP01-Node03-web01)
synced with group.
2013-12-17 13:45:36 14824 [ERROR] Slave SQL: Could not execute Delete_rows
event on table c9promente.cache_path; Can't find record in 'cache_path',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master
log FIRST, end_log_pos 352, Error_code: 1032
2013-12-17 13:45:36 14824 [Warning] WSREP: RBR event 3 Delete_rows apply
warning: 120, 8
2013-12-17 13:45:36 14824 [Warning] WSREP: Failed to apply app buffer:
seqno: 8, status: 1
          at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2013-12-17 13:45:36 14824 [ERROR] Slave SQL: Could not execute Delete_rows
event on table c9promente.cache_path; Can't find record in 'cache_path',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master
log FIRST, end_log_pos 352, Error_code: 1032
2013-12-17 13:45:36 14824 [Warning] WSREP: RBR event 3 Delete_rows apply
warning: 120, 8
2013-12-17 13:45:36 14824 [Warning] WSREP: Failed to apply app buffer:
seqno: 8, status: 1
          at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2013-12-17 13:45:36 14824 [ERROR] Slave SQL: Could not execute Delete_rows
event on table c9promente.cache_path; Can't find record in 'cache_path',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master
log FIRST, end_log_pos 352, Error_code: 1032
2013-12-17 13:45:36 14824 [Warning] WSREP: RBR event 3 Delete_rows apply
warning: 120, 8
2013-12-17 13:45:36 14824 [Warning] WSREP: Failed to apply app buffer:
seqno: 8, status: 1
          at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2013-12-17 13:45:36 14824 [ERROR] Slave SQL: Could not execute Delete_rows
event on table c9promente.cache_path; Can't find record in 'cache_path',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master
log FIRST, end_log_pos 352, Error_code: 1032
2013-12-17 13:45:36 14824 [Warning] WSREP: RBR event 3 Delete_rows apply
warning: 120, 8
2013-12-17 13:45:36 14824 [ERROR] WSREP: Failed to apply trx: source:
cadcc688-6718-11e3-b498-276df7845977 version: 3 local: 0 state: APPLYING
flags: 129 conn_id: 4 trx_id: 228264 seqnos (l: 14, g: 8, s: 7, d: 7, ts:
9153877911069)
2013-12-17 13:45:36 14824 [ERROR] WSREP: Failed to apply trx 8 4 times
2013-12-17 13:45:36 14824 [ERROR] WSREP: Node consistency compromized,
aborting...

Revision history for this message
Chriss (bst2002) wrote :
description: updated
Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

I wonder what is the output of:

explain DELETE FROM cache_path WHERE (expire <> '0') AND (expire < '1387189708');
explain SELECT * FROM cache_path WHERE (expire <> '0') AND (expire < '1387189708');
select count(*) FROM cache_path WHERE (expire <> '0') AND (expire < '1387189708');

on node3 (where it worked) vs. node1 (where there was error during replication).

Changed in percona-xtradb-cluster:
status: New → Incomplete
Revision history for this message
Chriss (bst2002) wrote :

explain DELETE FROM cache_path WHERE (expire <> '0') AND (expire <
    -> '1387189708');
+----+-------------+------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | cache_path | ALL | expire | NULL | NULL | NULL | 36 | Using where |
+----+-------------+------------+------+---------------+------+---------+------+------+-------------+

Changed in percona-xtradb-cluster:
status: Incomplete → New
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Chriss,

a) Can you provide PXC/Galera packages that you have installed?

b) Is the table and the query sufficient to bring down the node?
(I tried with empty table and query, didn't work). Can you
provide a minimal test sql to reproduce this?

Revision history for this message
Chriss (bst2002) wrote :

@ raghavendra-prabhu

a) This was the packages installed at time of failure:

CentOS release 6.5 (Final)
kernel 2.6.32-431.el6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.14-25.1.571.rhel6.x86_64
Percona-XtraDB-Cluster-test-56-5.6.14-25.1.571.rhel6.x86_64
percona-xtrabackup-2.1.6-702.rhel6.x86_64
Percona-XtraDB-Cluster-galera-56-3.1-1.169.rhel6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.14-25.1.571.rhel6.x86_64
percona-release-0.0-1.x86_64
percona-toolkit-2.2.5-2.noarch
Percona-XtraDB-Cluster-client-56-5.6.14-25.1.571.rhel6.x86_64

b) and yes with the above query i was able to reproduce it.

i don't know if it was "thread concurrency" (var: wsrep_slave_threads=2 or 4) or not

after Upgrading (17.Dec 2013) with the new released packages (since today):

CentOS release 6.5 (Final) with latest updates
kernel 2.6.32-431.3.1.el6.x86_64
percona-xtrabackup-2.1.6-702.rhel6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.15-25.2.645.rhel6.x86_64
Percona-XtraDB-Cluster-galera-3-3.2-1.189.rhel6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.15-25.2.645.rhel6.x86_64
Percona-XtraDB-Cluster-client-56-5.6.15-25.2.645.rhel6.x86_64
percona-release-0.0-1.x86_64

and 3 nodes with same packages, same config expect node specific ip etc. it seems to work (see attachment config, table) but with wsrep_slave_threads=1, if i set this higher it fails.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Valerii has asked the right questions: this is most likely inconsistency and most likely it is user-induced.

Revision history for this message
Dmitry Gribov (grib-d) wrote :

Experiencing the same thing.
After SST (no way a user may put the cluster into inconsistent state without cluster errors involved) we eventually see log records like:
Slave SQL: Could not execute Delete_rows event on table fbhub.rating_books_global; Can't find record in 'rating_books_global', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1098, Error_code: 1032

and then cluster stalls. 2 of 3 nodes have this HA_ERR_KEY_NOT_FOUND and die, survivor (he was distributing this delete, I may guess) hangs in "too many users" state. Bah.

First question is how could cluster get inconsistent in such a way?
Second question is why deletion of already inexistent row causes node to die instead of a simple warning? if a node will ignore such an error cluster will get OK for at least this issue, so why react so extreme? Warn and drop the delete request.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

First answer: of course there might be a bug. But I would first carefully scrutinize all your and client actions starting with the first node start. Were nodes ever restarted? Are there any root users? Was wsrep_on=OFF or wsrep_OSU_method=RSU ever used? Which node was the donor of that SST you mention?

Second answer: yes, this probably could be relaxed. However even if DELETE of a non-existing row may seem safe to ignore, this is most likely not the only inconsistency in the database and an operation that can't be ignored will soon follow. And that may result in a wrong data sent to client. Generally, the sooner you deal with inconsistency - the better.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

I tried test script from Chriss against fresh development trunk builds of wsrep-5.5 (mysql 5.5.35) and wsrep-5.6 (mysql 5.6.15), both using latest Galera plugin 3.2 and in both cases the test passes - no crashes

Revision history for this message
Dmitry Gribov (grib-d) wrote :

We never play with wsrep_on=OFF nor wsrep_OSU_method=RSU (why would anyone?)
We do not use root users (but we have some with full access to *.*, which is about the same thing, I may guess).
The donor in such a case is one. Se restart one node, then SST to another, then 3-d node takes SST from one of this. If there are no issues with full access on *.* not replicating properly by design there is surely a way to compromise a cluster by legal SQL statement.

> Generally, the sooner you deal with inconsistency - the better.
I admit this sounds right and I'd even could agree. But.
1. I have reasons to believe this inconsistency is fresh-made. Like the missing row was just deleted on A from C node and than comes B and requests deletion. Not that simple, probably, but something like this.
2. Survivor node is dead. No use of it's consistency anyway. At least neighbors could shut down without the whole cluster coming down.

Affected table looks like this:
CREATE TABLE `rating_books_global` (
  `art` int(10) unsigned NOT NULL,
  `face` smallint(5) unsigned NOT NULL,
  `period` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `atype` tinyint(3) unsigned NOT NULL COMMENT ,
  `genre` smallint(5) unsigned NOT NULL,
  `sales` int(10) unsigned NOT NULL DEFAULT '0',
  UNIQUE KEY `fpga` (`art`,`genre`,`face`,`period`),
  KEY `r_genre` (`genre`,`face`,`period`,`atype`,`sales`,`art`),
  KEY `r_face` (`face`,`period`,`art`,`sales`),
  KEY `fpsa` (`face`,`period`,`sales`,`art`),
  KEY `fpasa` (`face`,`period`,`atype`,`sales`,`art`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

It's kind of ugly speaking of indexes but at least it has an unique key.

There is one more specific thing I may notice, we delete from rating_books_global like this:
DELETE r
        FROM rating_books_global AS r
          JOIN lib_faces AS lf ON lf.id=r.face AND lf.lib = ?
          LEFT JOIN temp.rb_all_books AS a_tmp ON a_tmp.art = r.art
            AND a_tmp.face = r.face
            AND a_tmp.genre = r.genre
        WHERE a_tmp.art IS NULL
          AND r.period = ?

where temp.rb_all_books is a temporary table.

Revision history for this message
Jervin R (revin) wrote :

Another report of possible similar incident:

node1 starts after clean shutdown:
140302 08:00:06 mysqld_safe Skipping wsrep-recover for d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147 pair

node2 starts after a clean shutdown:
140302 09:32:48 mysqld_safe Skipping wsrep-recover for d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147 pair

node2 requests an IST from node1, its looks like the writesets from the IST is causing the errors :
2014-03-02 09:28:27 35258 [Note] WSREP: State transfer required:
        Group state: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:269525631
        Local state: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147
...
2014-03-02 09:31:18 35258 [Note] WSREP: SST received: d24a83b1-9b3d-11e3-a7ff-fe26fb202da8:267471147
2014-03-02 09:31:18 35258 [Note] WSREP: Receiving IST: 2054484 writesets, seqnos 267471147-269525631

2014-03-02 09:31:18 35258 [Note] /mysql/bin/mysqld: ready for connections.
Version: '5.6.15-log' socket: '/tmp/mysql.sock' port: 3306 Source distribution, wsrep_25.4.rXXXX
2014-03-02 09:31:18 35258 [ERROR] Slave SQL: Could not execute Update_rows event on table xxxdb.xxxtbl; Can't find record in 'xxxtbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 251, Error_code: 1032
2014-03-02 09:31:18 35258 [Warning] WSREP: RBR event 4 Update_rows apply warning: 120, 267471157
2014-03-02 09:31:18 35258 [ERROR] Slave SQL: Could not execute Update_rows event on table xxxdb.xxxtbl; Can't find record in 'xxxtbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 251, Error_code: 1032
2014-03-02 09:31:18 35258 [Warning] WSREP: RBR event 4 Update_rows apply warning: 120, 267471162
2014-03-02 09:31:18 35258 [Warning] WSREP: Failed to apply app buffer: seqno: 267471157, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-03-02 09:31:18 35258 [Warning] WSREP: Failed to apply app buffer: seqno: 267471162, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time

Revision history for this message
Sergey (cyber-neo) wrote :
Download full text (3.4 KiB)

I've got same error:
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Warning] WSREP: Failed to apply app buffer: seqno: 50196216, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2014-03-29 20:11:43 23819 [ERROR] Slave SQL: Could not execute Update_rows event on table smproduction.b_sm_store_rest; Can't find record in 'b_sm_store_rest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 203, Error_code: 1032
2014-03-29 20:11:43 23819 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 50196216
2014-03-29 20:11:43 23819 [Note] WSREP: (c62bcfa7-ae73-11e3-8763-6ea9625e6a5d, 'tcp://0.0.0.0:4567') address 'tcp://10.74.184.39:4567' pointing to uuid c62bcfa7-ae73-11e3-8763-6ea9625e6a5d is blacklisted, skipping
2014-03-29 20:11:43 23819 [ERROR] WSREP: Failed to apply trx: source: 55c6879a-addb-11e3-8088-72173dd9b136 version: 3 local: 0 state: APPLYING flags: 129 conn_id: 39346883 trx_id: 270863489 seqnos (l: 22323196, g: 50196216, s: 50196215, d: 50196215, ts: 11665052428770158)
2014-03-29 20:11:43 23819 [ERROR] WSREP: Failed to apply trx 50196216 4 times
2014-03-29 20:11:43 23819 [ERROR] WSREP: Node consistency compromized, aborting...
2014-03-29 20:11:43 23819 [Note] WSREP: Closing send monitor...
2014-03-29 20:11:43 23819 [Note] WSREP: Close...

Read more...

Revision history for this message
Dmitry Gribov (grib-d) wrote :

We have faced the same problem on the other table, where no temporary table involved. Yet there seem to be two concurring delete requests coming from the same node using different keys, both use "delete .. limit". Say one request "delete from rating_books_global where period = 1 limit 500" and another "delete from rating_books_global where genre = 16 limit 500".
Also we have altered the first table to add implicit primary bigint key, no help - the problem persists.

Revision history for this message
Dmitry Gribov (grib-d) wrote :

Or may be one of the requests is not using "limit", it's hard to see exactly, we only can guess. Going to try latest trunk to see if this has being fixed (doubtfully).

Revision history for this message
Dmitry Gribov (grib-d) wrote :

Latest testing build added next level:
140410 15:45:41 [ERROR] Slave SQL: Could not execute Write_rows event on table lib_area_100.users; Duplicate entry '<censored>' for key 'login', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 403, Error_code: 1062
140410 15:45:41 [Warning] WSREP: RBR event 2 Write_rows apply warning: 121, 7824565248
140410 15:45:41 [Warning] WSREP: Failed to apply app buffer: seqno: 7824565248, status: 1
         at galera/src/replicator_smm.cpp:apply_wscoll():57

repeats quite often. Don't know if this is all the same or is it different.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Dmitry, does it happen with wsrep_slave_threads=1 ?

Revision history for this message
Dmitry Gribov (grib-d) wrote :

wsrep_slave_threads=250

And, unlike HA_ERR_KEY_NOT_FOUND, which strikes all nodes but one, this one hits one node while leaving others intact. This is much better keeping in mind cluster does not stall, but still annoying.

ps. And we had no stall with HA_ERR_KEY_NOT_FOUND after migrating to the latest build. Not sure this means something, thou, sometimes it takes month to happen so we are prepared to the worst, but anyway this worth mentioning.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Ok, so this may be not a real inconsistency but a parallel applying bug. Are there foreign keys involved? What is the definition of lib_area_100.user? Any references to that table?

Revision history for this message
Dmitry Gribov (grib-d) wrote :

Switched to wsrep_slave_threads=1 to see what will happen (no significant slowdown so far).
But this setup worked well with several previous releases (except for the HA_ERR_KEY_NOT_FOUND).

Revision history for this message
Dmitry Gribov (grib-d) wrote :
Download full text (3.5 KiB)

CREATE TABLE `users` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `login` varchar(100) DEFAULT NULL,
  `pwd` varchar(100) DEFAULT NULL,
  `s_mail` varchar(50) DEFAULT NULL,
  `s_www` varchar(255) DEFAULT NULL,
  `s_inn` varchar(50) DEFAULT NULL,
  `s_descr` text,
  `s_phone` varchar(100) DEFAULT NULL,
  `offert_accepted` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `s_full_name` varchar(255) DEFAULT NULL,
  `s_first_name` varchar(100) DEFAULT NULL,
  `s_middle_name` varchar(100) DEFAULT NULL,
  `s_last_name` varchar(100) DEFAULT NULL,
  `s_city` varchar(100) DEFAULT NULL,
  `s_address` varchar(255) DEFAULT NULL,
  `last_used` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `last_host_id` smallint(5) unsigned DEFAULT NULL,
  `mail_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirm_code` varchar(4) DEFAULT NULL,
  `msisdn_req_time` datetime DEFAULT NULL,
  `java_phone_model` varchar(30) DEFAULT NULL,
  `java_font` varchar(30) DEFAULT NULL,
  `java_safe_mode` tinyint(4) NOT NULL DEFAULT '0',
  `user_pic` varchar(16) DEFAULT NULL,
  `denied_libs` varchar(255) DEFAULT NULL,
  `recenser_type` int(11) DEFAULT NULL,
  `partner_id` int(11) DEFAULT NULL,
  `creat_date` datetime DEFAULT NULL,
  `partner` int(10) unsigned DEFAULT NULL,
  `partner_valid_till` date DEFAULT NULL,
  `partner_pin` varchar(32) DEFAULT NULL,
  `account` decimal(9,2) NOT NULL DEFAULT '0.00',
  `abonement_start` datetime DEFAULT NULL,
  `abonement_expires` date NOT NULL DEFAULT '2006-01-01',
  `abonement_period` smallint(6) DEFAULT NULL,
  `abonement_delay` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_max_price` decimal(6,2) DEFAULT NULL,
  `abonement_downloads` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_left_clicks` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_left_summ` decimal(6,2) DEFAULT NULL,
  `user_pic_height` tinyint(3) unsigned DEFAULT NULL,
  `user_pic_width` tinyint(3) unsigned DEFAULT NULL,
  `show_pay_btn` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `s_puid` varchar(255) DEFAULT NULL,
  `discount` decimal(4,4) NOT NULL DEFAULT '0.0000',
  `money_bonus` decimal(9,2) NOT NULL DEFAULT '0.00',
  `subscr_last_reminded` datetime DEFAULT NULL,
  `subscr_free_arts_given` datetime DEFAULT NULL,
  `subscr_type` tinyint(3) unsigned NOT NULL DEFAULT '2',
  `subscr_period` tinyint(3) unsigned DEFAULT '1',
  `subscr_content` tinyint(3) unsigned DEFAULT '2',
  `subscr_genres` text,
  `s_subscr_text_authors` text,
  `subscr_date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `s_subscr_text_pattern` text,
  `subscr_languages` text,
  `prefered_currency` char(3) DEFAULT NULL,
  `last_paymethod` tinyint(3) unsigned DEFAULT NULL,
  `utc_offset` char(6) DEFAULT NULL,
  `last_ip` varchar(15) DEFAULT NULL,
  `socnet_last_reminded` datetime DEFAULT '0000-01-01 00:00:00',
  `moved_from` smallint(5) unsigned DEFAULT NULL,
  `moved_to` smallint(5) unsigned DEFAULT NULL,
  `subscribe_new_buys` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `login` (`login`),
  KEY `s_mail` (`s_mail`),
  KEY `partner` (`partner`),
  KEY `partner_valid_till` (`partn...

Read more...

Revision history for this message
Dmitry Gribov (grib-d) wrote :

Table referenced from, well... From anywhere :)
New records added in transaction, with "SELECT login FROM users WHERE login=? FOR UPDATE", then insert into users, then insert into groupusers (ref to users), then commit.
Perhaps "SELECT login FROM users WHERE login=? FOR UPDATE" does something it should not

Revision history for this message
Dmitry Gribov (grib-d) wrote :

It's all the same with wsrep_slave_threads=1, no improvement
140416 10:00:55 [ERROR] Slave SQL: Could not execute Write_rows event on table lib_area_100.users; Duplicate entry '<censored>' for key 'login', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 393, Error_code: 1062
140416 10:00:55 [Warning] WSREP: RBR event 2 Write_rows apply warning: 121, 7970604452
140416 10:00:55 [Warning] WSREP: Failed to apply app buffer: seqno: 7970604452, status: 1

Revision history for this message
Dmitry Gribov (grib-d) wrote :

ps. after setting wsrep_slave_threads=1 all servers were restartrd, so it was 1 for sure when Duplicate entry happened

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Thanks Dmitry, this rules out parallel applying from the suspect list. Applier threads are now dynamic, you can step up and down appliers online. Number of active applier threads is visualized in SHOW PROCESSLIST output.

This problem is probably due to unique key handling, and this bug may be related: https://bugs.launchpad.net/codership-mysql/+bug/1299116

Revision history for this message
Dmitry Gribov (grib-d) wrote :

#1299116 looks quite possible

Revision history for this message
Naresh (naresh-gartner) wrote :
Download full text (7.6 KiB)

We're running a 3 node Percona xtradb cluster 5.6.24-72.2-56-log, using Percona xtrabackup as our SST method and we are seeing repeated crashes on one node referencing error code 1032.
We've had 11 crashes on a double node in the last 20 days or so. Each time, the mysql error log reports an error which always starts off with Error code 1032 or 1610, which looks like this:

Node -2
=======

2015-10-09 13:42:57 14456 [ERROR] Slave SQL: Could not execute Delete_rows event on table gartner1.nosql_mysql; Can't find record in 'nosql_mysql', Error_code: 1032; handler error

HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 205, Error_code: 1032
2015-10-09 13:42:57 14456 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 2070
2015-10-09 13:42:57 14456 [Warning] WSREP: Failed to apply app buffer: seqno: 2070, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
2015-10-09 13:42:57 14456 [ERROR] Slave SQL: Could not execute Delete_rows event on table gartner1.nosql_mysql; Can't find record in 'nosql_mysql', Error_code: 1032; handler error

HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 205, Error_code: 1032
2015-10-09 13:42:57 14456 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 2070
2015-10-09 13:42:57 14456 [Warning] WSREP: Failed to apply app buffer: seqno: 2070, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-10-09 13:42:57 14456 [ERROR] Slave SQL: Could not execute Delete_rows event on table gartner1.nosql_mysql; Can't find record in 'nosql_mysql', Error_code: 1032; handler error

HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 205, Error_code: 1032
2015-10-09 13:42:57 14456 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 2070
2015-10-09 13:42:57 14456 [Warning] WSREP: Failed to apply app buffer: seqno: 2070, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2015-10-09 13:42:57 14456 [ERROR] Slave SQL: Could not execute Delete_rows event on table gartner1.nosql_mysql; Can't find record in 'nosql_mysql', Error_code: 1032; handler error

HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 205, Error_code: 1032
2015-10-09 13:42:57 14456 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 2070
2015-10-09 13:42:57 14456 [ERROR] WSREP: Failed to apply trx: source: a8283cf6-518d-11e5-94c0-eafb6f6639d1 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 89827 trx_id: 61225252 seqnos (l:

11, g: 2070, s: 2069, d: 2067, ts: 13408212760377206)
2015-10-09 13:42:57 14456 [ERROR] WSREP: Failed to apply trx 2070 4 times
2015-10-09 13:42:57 14456 [ERROR] WSREP: Node consistency compromized, aborting...
2015-10-09 13:42:57 14456 [Note] WSREP: Closing send monitor...
2015-10-09 13:42:57 14456 [Note] WSREP: Closed send monitor.
2015-10-09 13:42:57 14456 [Note] WSREP: gcomm: terminating thread
2015-10-09 13:42:57 14456 [Note] WSREP: gcomm: joining thread
2015-10-09 13:42:57 14456 [Note] WSREP: gcomm: closing backend

Node - 3
========

2015-10-08 22:42:01 1235 [Note] WSREP: (2751e6a3, 'tcp://0.0.0.0:4567') turning message relay requesting off
2015-10-09 13:42:57 1235 ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.