Simultanious Crash of two nodes

Bug #1019473 reported by Frederik Kraus
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
Medium
Seppo Jaakola
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Unassigned

Bug Description

For some days now we have two (out of 4) nodes crash at exactly the same time with the following error messages in the log.

Recovery is only possible with a full SST.

-- snip --

-- snip --

today:

Slave SQL: Could not execute Delete_rows event on table somedb.some_table_one; Can't find record in 'some_table_one', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 2724, Error_code: 1032
120630 1:30:10 [Warning] WSREP: RBR event 4 Delete_rows apply warning: 120, 30131914
120630 1:30:10 [ERROR] WSREP: Failed to apply trx: source: f879462c-c10d-11e1-0800-1f65fede6e7d version: 2 local: 0 state: CERTIFYING flags: 1 conn_id: 8664969 trx_id: 3980374692 seqnos (l: 4549962, g: 30131914, s: 30131911, d: 30131862, ts: 1341012609977406000)

thursday:

box1:

120628 8:51:06 [ERROR] Slave SQL: Could not execute Delete_rows event on
table somedb.some_table_two; Can't find record in 'some_table_two',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master
log FIRST, end_log_pos 398, Error_code: 1032
120628 8:51:06 [Warning] WSREP: RBR event 4 Delete_rows apply warning:
120, 25220082
120628 8:51:06 [ERROR] WSREP: Failed to apply trx: source:
1337a766-b992-11e1-0800-185c16a204a2 version: 2 local: 0 state:
CERTIFYING flags: 1 conn_id: 52899913 trx_id: 3397040790 seqnos (l:
17019008, g: 25220082, s: 25220081, d: 25220080, ts: 1340866265202857000)
120628 8:51:06 [ERROR] WSREP: Failed to apply app buffer:
ÃþëO^SÃ<8a>^A, seqno: 25220082, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():50
at galera/src/replicator_smm.cpp:apply_trx_ws():121
120628 8:51:06 [ERROR] WSREP: Node consistency compromized, aborting...
120628 8:51:06 [Note] WSREP: Closing send monitor...
120628 8:51:06 [Note] WSREP: Closed send monitor.

box2:

120628 8:51:06 [ERROR] Slave SQL: Could not execute Delete_rows event on
table somedb.some_table_two; Can't find record in 'some_table_two',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master
log FIRST, end_log_pos 9845, Error_code: 1032
120628 8:51:06 [Warning] WSREP: RBR event 10 Delete_rows apply warning:
120, 25220082
120628 8:51:06 [ERROR] WSREP: Failed to apply trx: source:
1337a766-b992-11e1-0800-185c16a204a2 version: 2 local: 0 state:
CERTIFYING flags: 1 conn_id: 52899913 trx_id: 3397040790 seqnos (l:
17537022, g: 25220082, s: 25220081, d: 25220080, ts: 1340866265202857000)
120628 8:51:06 [ERROR] WSREP: Failed to apply app buffer:
ÃþëO^SÃ<8a>^A, seqno: 25220082, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():50
at galera/src/replicator_smm.cpp:apply_trx_ws():121
120628 8:51:06 [ERROR] WSREP: Node consistency compromized, aborting...
120628 8:51:06 [Note] WSREP: Closing send monitor...
120628 8:51:06 [Note] WSREP: Closed send monitor.

Tags: i29551 i29618
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Please attach following information:

1. full mysql error logs from all the cluster nodes.

2. mysql configuration, preferably by:
    mysql> show global variables\G

3. table definitions for:
   mysql> show create table somedb.some_table_one;
   mysql> show create table somedb.some_table_two;

If this is sensitive information, you can send directly to: <email address hidden>

What kind of SQL load has been issued for the cluster? Especially how does the DELETE statement for these tables look like?

What is the life time of the rows in 'some_table_one': and 'some_table_two'
* when is the row first inserted?
* are there later updates on the row?
* can several sessions try to delete the row?

Changed in codership-mysql:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Seeing that the nodes crash at the same seqno suggests (but not limited to) simple db inconsistency. In this case it is important to know how those nodes were initialized, is it a LAN or WAN cluster and was there any tinkering with pc.ignore_sb Galera parameter?

Revision history for this message
Oliver Hankeln (21-oliver-vp) wrote :

The cluster is 3 nodes (including the failing two) in one LAN and the 4th node in a different data center (connected via VPN).

Changed in codership-mysql:
status: New → Incomplete
assignee: Seppo Jaakola (seppo-jaakola) → nobody
Revision history for this message
Josiah Poirier (josiahp) wrote :

I am also experiencing this error. It happened yesterday and again today. I have a 3 node cluster and mine is running on Amazon EC2 on m2.xlarge instances. Each node is in a different Availability Zone, but they are all in the same region.

Each time this happens I'm left with one node that still has a bunch of mysql threads running, but it won't allow me to log in with the mysql client, it just hangs after I enter the password and I have to close the terminal and ssh to the box again.

The only way I'm able to recover the cluster is to kill all the mysql processes, enable split brain mode, start one node, then the next node does a full re-sync, and then start up the third node which also does a full re-sync.

There is one php process that inserts records into the table frequently and then there is another process that can every 20 seconds and can hit any of the three nodes in the cluster, and could possibly hit more than one node at the same time. The queries look like this:

$time = time();
$query1 = "SELECT column1 FROM some_table_one WHERE column2 <=".$time;
...php while loop...
$query2 = "DELETE FROM some_table_one WHERE column2 <=".$time;

This code is from a third party software, but it is open source...

I have pulled the error logs from the SQL servers and can pull the variables and table create info. If you need them, let me know where to send them.

Revision history for this message
Josiah Poirier (josiahp) wrote :

I did some digging in the application and specifically looked into the table that was causing the cluster to crash. One thing I noticed is that there was not a primary key on the table that was causing the problem, so I added a generic ID column with auto-increment to the table and I also moved the process that does the select/loop/delete to a seperate process called by a cron job on one server, rather than from within the web code on any of the three web servers. Hopefully that will prevent the cluster from crashing.

Changed in percona-xtradb-cluster:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona XtraDB Cluster because there has been no activity for 60 days.]

Changed in percona-xtradb-cluster:
status: Incomplete → Expired
Revision history for this message
Dan Rogers (drogers-l) wrote :

We are also affected by this issue.

Three weeks ago, we started using Percona Cluster as our main database for our production site. Twice during those three weeks, we've had the entire cluster go down due to failed replication of DELETEs. As with Josiah's report, we are:

- On Amazon EC2, with each node in a different AZ or the same region (us-east1). (Though our servers are hi1.4xlarge, with the data on the local SSDs)
- We ended up with two (or three) dead nodes, and one that was hung.
- The tables that were being deleted from have no primary keys.

One other thing to note, we use auto-commit on our queries, so every time an INSERT/UPDATE/DELETE is run, it'll go across the entire cluster right away.

The first failure was on 1/28 (five days after we started using the cluster), when we had three cluster nodes, all running Percona Cluster 5.5.28-23.7 (the newest then available). The crash was triggered by the following statement:

DELETE FROM profile_values WHERE fid = 20 AND uid = 3692832

This table contains user profile data, and each row gets deleted and reinserted when a user object is saved (bad, I know). So, it would be entirely possible for the same user to get saved twice at the same time, on two different nodes.

The second failure was on 2/12, by which time we had added a fourth node to use as a dedicated donor. db00 (the dedicated donor) and db01 where both running 5.5.29-23.7.1, but db02 and db03 were still running 5.5.28-23.7

The statement causing the second crash was:

DELETE FROM ons_relations WHERE right_id = 27902927 AND (relationship = 'gallery-image' || relationship = 'gallery-image-unpublished')

This table is used to connect gallery images to their galleries, and, similar to the above, in the case of an update, the entries are all deleted and then re-saved. While less probably, a user could have gotten twitchy and hit the save button twice, causing everything to try to happen twice at the same time.

My theory here is that two processes are racing through these entries, deleting and re-inserting them, on two different cluster nodes. At some point, node A and node B both run the same delete nearly simultaneously. Then, when it goes to send the replication data to the rest of the cluster, someone gets a delete for a record that has already been removed, but not yet re-inserted, and the whole thing goes sideways.

After last night's crash, I've upgraded all four servers to 5.5.29-23.7.2, though I don't have any particular reason to think that the problem is solved in this release (though I see that others are, so it's a win regardless).

I've attached the error logs from all four servers, and also the table definitions for the tables.

Thanks for your help.

Dan.

Revision history for this message
Dan Rogers (drogers-l) wrote :
Revision history for this message
Dan Rogers (drogers-l) wrote :
Revision history for this message
Dan Rogers (drogers-l) wrote :
Revision history for this message
Dan Rogers (drogers-l) wrote :
Revision history for this message
Dan Rogers (drogers-l) wrote :
Revision history for this message
Dan Rogers (drogers-l) wrote :
Changed in percona-xtradb-cluster:
status: Expired → Confirmed
Changed in codership-mysql:
status: Incomplete → Confirmed
Jaime Sicam (jssicam)
tags: added: i29618
Changed in codership-mysql:
status: Confirmed → In Progress
importance: Undecided → Medium
assignee: nobody → Seppo Jaakola (seppo-jaakola)
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Trying to reproduce this issue. Dan, can you post my.cnf file used in these nodes?

Revision history for this message
Dan Rogers (drogers-l) wrote :
Revision history for this message
Dan Rogers (drogers-l) wrote :

Config file added.

Revision history for this message
Jervin R (revin) wrote :

We also have had reports of similar nature on 5.5.29-23.7.2:

130220 19:29:14 [ERROR] Slave SQL: Could not execute Write_rows event on table db.tbl1; Cannot add or update a child row: a foreign key constraint fails (`db`.`tbl1`, CONSTRAINT `cm_email_customcm_activity_email` FOREIGN KEY (`custom_email_id`) REFERENCES `tbl1` (`custom_email_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 115, Error_code: 1452
130220 19:29:14 [Warning] WSREP: RBR event 2 Write_rows apply warning: 151, 2726260
130220 19:29:14 [ERROR] WSREP: Failed to apply trx: source: bce7b305-7b8e-11e2-0800-6d35e4e6862e version: 2 local: 0 state: APPLYING flags: 1 conn_id: 7389 trx_id: 38835836 seqnos (l: 4436, g: 2726260, s: 2726259, d: 2726084, ts: 1361413754140954569)
130220 19:29:14 [ERROR] WSREP: Failed to apply app buffer: z%Q, seqno: 2726260, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():49
at galera/src/replicator_smm.cpp:apply_trx_ws():116
130220 19:29:14 [ERROR] WSREP: Node consistency compromized, aborting...

And 5.5.27 - Version: '5.5.27-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.6.r356

130215 1:02:15 [ERROR] Slave SQL: Could not execute Write_rows event on table db.tbl2; Cannot add or update a child row: a foreign key constraint fails (`db`.`tbl2`, CONSTRAINT `lms_xxxxx_instance` FOREIGN KEY (`user_id`) REFERENCES `lms_user` (`user_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 120, Error_code: 1452
130215 1:02:15 [Warning] WSREP: RBR event 2 Write_rows apply warning: 151, 10968
130215 1:02:15 [ERROR] WSREP: Failed to apply trx: source: 54b4a428-76e0-11e2-0800-907af715c84c version: 2 local: 0 state: APPLYING flags: 1 conn_id: 8649 trx_id: 400479 seqnos (l: 37, g: 10968, s: 10967, d: 10961, ts: 1360890117848877021)
130215 1:02:15 [ERROR] WSREP: Failed to apply app buffer: �
Q, seqno: 10968, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():49
at galera/src/replicator_smm.cpp:apply_trx_ws():120
130215 1:02:15 [ERROR] WSREP: Node consistency compromized, aborting...

Full error log is available and can be sent privately.

Revision history for this message
Jervin R (revin) wrote :

With regards to my comments above, still waiting to hear back the results of verifying for any pre-existing inconsistencies.

tags: added: i29551
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Jervin, I believe you are hitting this case: https://bugs.launchpad.net/codership-mysql/+bug/1130888 it deals with cascading foreign key constraints.

The bug troubleshooted here, is about parallel applying of replication events for table with no primary key. I have hard time getting this issue reproduced. I'm trying with Dan's parameters and just started a test with new scenario, it sure does not surface easily for me.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

okay, I now have a regression test, which reproduces this bug reliably, can move to next step in troubleshooting

Revision history for this message
Dan Rogers (drogers-l) wrote :

Glad to hear it!

Please keep me posted and let me know if there's anything else you need!

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

the method of generating md5 digest over tuples in a table with no primary key was not deterministic. The digest calculation was re-factored in revision: http://bazaar.launchpad.net/~codership/codership-mysql/5.5-23/revision/3848

Changed in codership-mysql:
status: In Progress → Fix Committed
milestone: none → 5.5.29-23.7.3
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Previous fix caused regression with bug: lp:1003929, fixed to use row digest as key only for tables with no PK nor UK as. Fix pushed in revision: http://bazaar.launchpad.net/~codership/codership-mysql/5.5-23/revision/3852

Revision history for this message
Dan Rogers (drogers-l) wrote :

Seppo,

Thank you for tracking this down.

Could you please contact someone at Percona to get this into their release? I feel like our cluster is something of a ticking time bomb without this fix.

Thanks.

Dan.

Revision history for this message
Vadim Tkachenko (vadim-tk) wrote :

Dan,

You are writing in Percona XtraDB Cluster bug tracker, so we in Percona is aware.

I any case, our releases are based on Codership releases,
so Codership should do a release first and we will merge.

Revision history for this message
Dan Rogers (drogers-l) wrote :

Thanks, Vadim. It's good to know you guys are on top of these things.

Dan.

Changed in codership-mysql:
status: Fix Committed → Fix Released
Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-23.7.4
status: Confirmed → Fix Released
Revision history for this message
walito (walito) wrote :

Hi, I have the same problem in MariaDB 10.0.X

It is a bug?

Thanks in advance

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1229

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.