wsrep_recover position unexpectedly gets RBR error

Bug #1180802 reported by Jay Janssen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
High
Raghavendra D Prabhu

Bug Description

Scenario:
-- 3 node cluster, latest PXC, running sysbench on node1
-- killall -9 mysqld on node3, restart

node3 gets recovery position, does IST and promptly gets RBR error:

130516 9:36:25 [Note] Event Scheduler: Loaded 0 events
130516 9:36:25 [Note] WSREP: Signalling provider to continue.
130516 9:36:25 [Note] WSREP: SST received: ecec24ed-ab72-11e2-0800-008b24f29e8e:194044
130516 9:36:25 [Note] WSREP: Receiving IST: 259 writesets, seqnos 194044-194303
130516 9:36:25 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4.r3843
130516 9:36:25 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find record in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 750, Error_code: 1032
130516 9:36:25 [Warning] WSREP: RBR event 4 Delete_rows apply warning: 120, 194045
130516 9:36:25 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno: 194045, status: WSREP_FATAL
  at galera/src/replicator_smm.cpp:apply_wscoll():52
  at galera/src/replicator_smm.cpp:apply_trx_ws():118
130516 9:36:25 [Note] WSREP: Closing send monitor...
130516 9:36:25 [Note] WSREP: Closed send monitor.
130516 9:36:25 [Note] WSREP: gcomm: terminating thread
130516 9:36:25 [Note] WSREP: gcomm: joining thread
130516 9:36:25 [Note] WSREP: gcomm: closing backend
130516 9:36:26 [Note] WSREP: view(view_id(NON_PRIM,0807cf68-b18d-11e2-0800-0a143587da15,11) memb {
 981d5954-be2d-11e2-0800-1be3b6228da4,
} joined {
} left {
} partitioned {
 0807cf68-b18d-11e2-0800-0a143587da15,
 eab9e1dc-b18c-11e2-0800-08406de49074,
})
130516 9:36:26 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130516 9:36:26 [Note] WSREP: view((empty))
130516 9:36:26 [Note] WSREP: gcomm: closed
130516 9:36:26 [Note] WSREP: Flow-control interval: [16, 16]
130516 9:36:26 [Note] WSREP: Received NON-PRIMARY.
130516 9:36:26 [Note] WSREP: Shifting JOINER -> OPEN (TO: 194339)
130516 9:36:26 [Note] WSREP: Received self-leave message.
130516 9:36:26 [Note] WSREP: Flow-control interval: [0, 0]
130516 9:36:26 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130516 9:36:26 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 194339)
130516 9:36:26 [Note] WSREP: RECV thread exiting 0: Success
130516 9:36:26 [Note] WSREP: recv_thread() joined.
130516 9:36:26 [Note] WSREP: Closing slave action queue.
130516 9:36:26 [Note] WSREP: /usr/sbin/mysqld: Terminated.
130516 09:36:26 mysqld_safe Number of processes running now: 0
130516 09:36:26 mysqld_safe WSREP: not restarting wsrep node automatically

Is this a regression in recovery? I can reproduce this in multiple environments

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Jay,
1) what is innodb_doublewrite setting in your cluster?
2) this is an InnoDB-related thing, so it would be great if you could reproduce it with Codership's binaries.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Interesting.

a) Has it been always 1032 or has it been 1062 (Duplicate entry) anytime?

b) What is the sysbench that is running (since there is a Delete_rows event). Also, I presume, from description, that sysbench is running while you SIGKILL the other node's mysqld?

c) Now, from ecec24ed-ab72-11e2-0800-008b24f29e8e:194044 and the error, it looks like it is failing at the very next event.

Now, this means, when it got killed, only the events upto a seqno lower than 194044 actually made to InnoDB whereas to the log it was recorded as 194044.

What is the value of innodb-flush-log-at-trx-commit.

@Alexey,

How do the values flushed to log interact with innodb-flush-log-at-trx-commit.

Revision history for this message
Jay Janssen (jay-janssen) wrote : Re: [Bug 1180802] wsrep_recover position unexpectedly gets RBR error

On May 16, 2013, at 12:18 PM, Alex Yurchenko <email address hidden> wrote:

> 1) what is innodb_doublewrite setting in your cluster?

Defaults.

> 2) this is an InnoDB-related thing, so it would be great if you could reproduce it with Codership's binaries.

no guarantee I have time for this. I know this worked in the past, so this seems to be some kind of recent regression (whether PS's fault or not, unsure).

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Raghu, great points!

Except that in (c) you can also have a situation when 194045 trx also made it into InnoDB. And this is more likely, since a missing trx may be not noticed in a while (not 100% sure about sysbench, but its transactions seem pretty independent), whereas a duplicate trx would be hit immediately with certainty. So I'd speculate that here 145045 seqno didn't make it into InnoDB but the transaction did.

As for the question, I'm not the one to answer. But I will ;). I don't think there is any "interaction", at least by design. I think a more important question is how InnoDB tablespace is recovered. Does it honour the logs and is it rolled back properly?

Revision history for this message
Jay Janssen (jay-janssen) wrote :
Download full text (3.8 KiB)

On May 16, 2013, at 12:49 PM, Raghavendra D Prabhu <email address hidden> wrote:

> Interesting.
>
> a) Has it been always 1032 or has it been 1062 (Duplicate entry)
> anytime?
>

When I saw it it was only the Delete Event.

> b) What is the sysbench that is running (since there is a Delete_rows
> event). Also, I presume, from description, that sysbench is running
> while you SIGKILL the other node's mysqld?

some version of sysbench 0.5 from lefred (search for 'lefred sysbench rpm').

sysbench --test=sysbench_tests/db/common.lua \
        --mysql-host=node1 --mysql-user=test --mysql-db=test --oltp-table-size=250000 \
        prepare

sysbench --test=sysbench_tests/db/oltp.lua --mysql-host=node1 --mysql-user=test --mysql-db=test --oltp-table-size=250000 --report-interval=1 --max-requests=0 --tx-rate=10 run

>
> c) Now, from ecec24ed-ab72-11e2-0800-008b24f29e8e:194044 and the error,
> it looks like it is failing at the very next event.
>
> Now, this means, when it got killed, only the events upto a seqno lower
> than 194044 actually made to InnoDB whereas to the log it was recorded
> as 194044.

Ok, so I can reproduce it the problem (more reliably than I'm comfortable with):

130516 12:28:01 InnoDB: Waiting for the background threads to start
130516 12:28:02 Percona XtraDB (http://www.percona.com) 5.5.30-rel30.2 started; log sequence number 885672889
130516 12:28:02 [Note] Event Scheduler: Loaded 0 events
130516 12:28:02 [Note] WSREP: Signalling provider to continue.
130516 12:28:02 [Note] WSREP: SST received: ecec24ed-ab72-11e2-0800-008b24f29e8e:281000
130516 12:28:02 [Note] WSREP: Receiving IST: 183 writesets, seqnos 281000-281183
130516 12:28:02 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4.r3843
130516 12:28:02 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find record in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 750, Error_code: 1032
130516 12:28:02 [Warning] WSREP: RBR event 4 Delete_rows apply warning: 120, 281001
130516 12:28:02 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno: 281001, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118
130516 12:28:02 [Note] WSREP: Closing send monitor...
130516 12:28:02 [Note] WSREP: Closed send monitor.
130516 12:28:02 [Note] WSREP: gcomm: terminating thread
130516 12:28:02 [Note] WSREP: gcomm: joining thread
130516 12:28:02 [Note] WSREP: gcomm: closing backend
130516 12:28:03 [Note] WSREP: view(view_id(NON_PRIM,0807cf68-b18d-11e2-0800-0a143587da15,15) memb {
        917fa1cd-be45-11e2-0800-a5f5946af022,

Seems like we fail on 281001 again here. If I hack the grastate to start at 281002, it does indeed restart right.

>
> What is the value of innodb-flush-log-at-trx-commit.

Default. Entire config:

[root@node3 mysql]# cat /etc/my.cnf
[mysqld]
datadir = /var/lib/mysql
log_error = error....

Read more...

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

"

Except that in (c) you can also have a situation when 194045 trx also made it into InnoDB. And this is more likely, since a missing trx may be not noticed in a while (not 100% sure about sysbench, but its transactions seem pretty independent), whereas a duplicate trx would be hit immediately with certainty. So I'd speculate that here 145045 seqno didn't make it into InnoDB but the transaction did.

"

But, in the case where seqno didn't make into logs but
transaction did, won't it fail with 1062 than with 1032?

"
As for the question, I'm not the one to answer. But I will ;). I don't think there is any "interaction", at least by design. I think a more important question is how InnoDB tablespace is recovered. Does it honour the logs and is it rolled back properly?"
"

I will have a look at the wsrep_recover code, but it should be
fine (since InnoDB recovery is idempotent).

"
Seems like we fail on 281001 again here. If I hack the grastate to start at 281002, it does indeed restart right.
"

Is it always off by 1? If so, then it looks like a bug in
calculation of log co-ordinates.

Also, does/can wsrep_recovery advance the log by one? If yes,
then that may need to be considered.

Also, I have been looking at trx_prepare_off_kernel in view of
another bug - https://bugs.launchpad.net/percona-server/+bug/1092593

+ trx_sys_update_mysql_binlog_offset(
126 + sys_header,
127 + trx->mysql_relay_log_file_name,
128 + trx->mysql_relay_log_pos,
129 + TRX_SYS_MYSQL_RELAY_LOG_INFO, &mtr);
130 +
131 + trx_sys_update_mysql_binlog_offset(
132 + sys_header,
133 + trx->mysql_master_log_file_name,
134 + trx->mysql_master_log_pos,
135 + TRX_SYS_MYSQL_MASTER_LOG_INFO, &mtr);
136 +

This may needed for TRX_SYS_WSREP_XID_INFO as well.

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

This seems to be related to innodb_flush_log_at_trx_commit setting with PXC. Managed to reproduce the crash with values 1 and 2, but not with 0. Could not reproduce this at all with Codership's version (5.5-23 branch head).

Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-24.8
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Teemu,

Thanks for testing! I will look into
innodb_flush_log_at_trx_commit variable.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

I doubt it is related to innodb_flush_log_at_trx_commit. Setting it to 1 simply makes it bulletproof. Innodb should recover to consistent state regardless of this setting.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

a) Yes, innodb_flush_log_at_trx_commit being 1 should make it pretty
reliable.

b) I also did a diff of trx0trx.c of both pxc and codership, I
didn't see any differences at the moment.

@Teemu,

c) So, with value of 0 is it repeatable in not crashing?

In the meantime I am also looking at fix for https://bugs.launchpad.net/percona-server/+bug/1092593

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

Crashed also with innodb_flush_log_at_trx_commit=0, it just required many attempts.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I tested with following config:

Node3 is the one killed by -9.

Node3 log: http://sprunge.us/dDjY

Node1 log: http://sprunge.us/KiPB

Node2 log: http://sprunge.us/aZWI

Node3 config: https://gist.github.com/a730749c9d1d5cada2db

Node2 config: https://gist.github.com/702a668a466389c7a201

Node1 config: https://gist.github.com/f19ee8a14c86f41804a8

Sysbench cli: http://sprunge.us/IiAF

Kill cli: http://sprunge.us/FRHE

Bazaar branch: https://code.launchpad.net/~raghavendra-prabhu/percona-xtradb-cluster/release-5.5.31

Having reported above, I was not able to reproduce it. The only
difference as you can see is I am using the next release branch
release-5.5.31
However, that branch doesn't have PS5.5.31 merges in it yet.

Let me know if you have any questions. Also, let me know if
anything is different with the config (I tested with flush_trx of
2)

I will also try if I can get RPM of that branch for others to
test.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

One more thing, this is a debug build but it shouldn't matter I think (it should crash more if there is a bug). But I will also test with release build next.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Also, as you can see I was not using mysqld_safe, I was manually getting the wsrep-start co-ords and feeding them to mysqld in next step.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@#13, I hit https://bugs.launchpad.net/codership-mysql/+bug/1182187 while testing with binlogging disabled. Need to test with release build then.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Fixed that bug, now testing with binlogging disabled.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Testing with binlogging disabled to see if https://bugs.launchpad.net/percona-server/+bug/1092593 may be affecting this.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

=== modified file 'Percona-Server/storage/innobase/trx/trx0trx.c'
--- Percona-Server/storage/innobase/trx/trx0trx.c 2013-04-05 17:15:41 +0000
+++ Percona-Server/storage/innobase/trx/trx0trx.c 2013-05-20 20:22:02 +0000
@@ -2222,6 +2222,10 @@

   mutex_exit(&(rseg->mutex));

+ if (wsrep_is_wsrep_xid(&trx->xid)) {
+ trx_sys_update_wsrep_checkpoint(&trx->xid, &mtr);
+ }
+
   if (trx->mysql_master_log_file_name[0] != '\0') {
    /* This database server is a MySQL replication slave */
    trx_sysf_t* sys_header = trx_sysf_get(&mtr);

is what may be required here. (because of #define TRX_SYS_WSREP_XID_INFO TRX_SYS_MYSQL_MASTER_LOG_INFO)

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Teemu, @Jay:

When you tested it, was binlogging turned on or off? Also, a 'workaround' can be to turn on binlogging.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Able to reproduce the 1032 easily with binlogging disabled. So, the 'workaround' (of enabling binlog) seems to be working.

Also reported a related bug here https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1182441

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

a) Not able to reproduce this with 5.5-23 branch of codership.

b) Not able to reproduce with binlog enabled and
log_slave_updates ON for PXC.

c) 1032 also on Update_rows event.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I checked and matched the InnoDB code of PXC and codership, there
isn't any significant difference.

However, there a few differences with binlog group commits since
vanilla mysql doesn't have them.

Does MariaDB galera cluster (which also has xtradb and group
commits) have this issue?

Also, there were a few changes done to PS trx in 5.5.30.2 which
may not be transaction safe.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I will test this again after the merge of PS 5.5.31 is complete.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

So, I am still hitting this bug with 5.5.31 tree. To bisect I
tested with 5.5.29 and I am not able to reproduce there. Not able
to reproduce with codership tree as well. (at least the 5.5.30
tree, didn't check with 5.5.31 tree). So, some changes in PS
5.5.30{.2} have broken crash recovery of PXC. (Note: It crashes
with binlogging enabled as well).

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

One more thing, I also noticed that for the co-ordinates in one case were so that none worked. ie.

for uuid:X it gave duplicate key error and for uuid:X+1 it gave key not found error. So, it looks like the writesets weren't atomic either.

a)
Is there a fsync also in galera code (I see one in gcache but don't see where it is called from)?

b) I also see trx_sys_update_wsrep_checkpoint being called from innobase_wsrep_set_checkpoint and trx_sys_read_wsrep_checkpoint but don't see an associated log flush with them. Is there a fsync called along side them somewhere else?

Changed in percona-xtradb-cluster:
importance: Undecided → High
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I ran with 3 nodes, out of which one was codership-mysql and two were PXC ones.

Crashed the codership and one of the PXC node. (almost simultaneously)

Logs: http://paste.wnohang.net/354b25 - codership node

         http://paste.wnohang.net/16ab2b - PXC node

As you can see the seqno is almost same in both cases whereas in PXC's case it is failing.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

One more thing, I tested with single node instance and crashing
it, it doesn't affect that.

Also, in one instance it also crashed http://paste.wnohang.net/9da106

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Resolved stack for crash http://paste.wnohang.net/df2053

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Changed in percona-xtradb-cluster:
assignee: nobody → Raghavendra D Prabhu (raghavendra-prabhu)
status: New → Incomplete
status: Incomplete → In Progress
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Since the earlier commit was a merge commit, the commit specifically is https://bazaar.launchpad.net/~akopytov/percona-server/bug1131189/revision/452

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Pushed a test branch to https://code.launchpad.net/~raghavendra-prabhu/percona-xtradb-cluster/release-5.5.31-fix without that commit. It has not crashed so far.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Also backed out commit for lp:1170103 (since the earlier commit introduced the memory leak).

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Fix of https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1182441 seems to have fixed this without backing out of any commits from PS.

Changed in percona-xtradb-cluster:
status: In Progress → Fix Committed
Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-964

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.