Comment 7 for bug 1019473

Revision history for this message
Dan Rogers (drogers-l) wrote :

We are also affected by this issue.

Three weeks ago, we started using Percona Cluster as our main database for our production site. Twice during those three weeks, we've had the entire cluster go down due to failed replication of DELETEs. As with Josiah's report, we are:

- On Amazon EC2, with each node in a different AZ or the same region (us-east1). (Though our servers are hi1.4xlarge, with the data on the local SSDs)
- We ended up with two (or three) dead nodes, and one that was hung.
- The tables that were being deleted from have no primary keys.

One other thing to note, we use auto-commit on our queries, so every time an INSERT/UPDATE/DELETE is run, it'll go across the entire cluster right away.

The first failure was on 1/28 (five days after we started using the cluster), when we had three cluster nodes, all running Percona Cluster 5.5.28-23.7 (the newest then available). The crash was triggered by the following statement:

DELETE FROM profile_values WHERE fid = 20 AND uid = 3692832

This table contains user profile data, and each row gets deleted and reinserted when a user object is saved (bad, I know). So, it would be entirely possible for the same user to get saved twice at the same time, on two different nodes.

The second failure was on 2/12, by which time we had added a fourth node to use as a dedicated donor. db00 (the dedicated donor) and db01 where both running 5.5.29-23.7.1, but db02 and db03 were still running 5.5.28-23.7

The statement causing the second crash was:

DELETE FROM ons_relations WHERE right_id = 27902927 AND (relationship = 'gallery-image' || relationship = 'gallery-image-unpublished')

This table is used to connect gallery images to their galleries, and, similar to the above, in the case of an update, the entries are all deleted and then re-saved. While less probably, a user could have gotten twitchy and hit the save button twice, causing everything to try to happen twice at the same time.

My theory here is that two processes are racing through these entries, deleting and re-inserting them, on two different cluster nodes. At some point, node A and node B both run the same delete nearly simultaneously. Then, when it goes to send the replication data to the rest of the cluster, someone gets a delete for a record that has already been removed, but not yet re-inserted, and the whole thing goes sideways.

After last night's crash, I've upgraded all four servers to 5.5.29-23.7.2, though I don't have any particular reason to think that the problem is solved in this release (though I see that others are, so it's a win regardless).

I've attached the error logs from all four servers, and also the table definitions for the tables.

Thanks for your help.

Dan.