a new value of slave_exec_mode to handle slave error such as "HA_ERR_KEY_NOT_FOUND" or "HA_ERR_FOUND_DUPP_KEY"

Bug #1018685 reported by yinfeng
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.1
Won't Fix
Wishlist
Unassigned
5.5
Triaged
Wishlist
Unassigned
5.6
Triaged
Wishlist
Unassigned
5.7
Triaged
Wishlist
Unassigned

Bug Description

Replication is often interrupted because of some errors, and the most common errors we encountered recently are "HA_ERR_KEY_NOT_FOUND" and "HA_ERR_FOUND_DUPP_KEY".

there are two methods to handle the errors:
1). ignore the error by setting "sql_slave_skip_counter".
2).set slave_exec_mode = "idempotent" to handle "HA_ERR_FOUND_DUPP_KEY" (overwritten the record) and "HA_ERR_KEY_NOT_FOUND"(just ignore the error).

Both methods may lead to inconsistencies between master and slave.

if we are using row-based replication and innodb storage engine , why don't we fix these errors instead of simply ignore it?

So I introduced in a new value of slave_exec_mode: SMART

a simple idea is :

1) HA_ERR_KEY_NOT_FOUND
       UPDATE_ROWS_EVENT: write 'Before Image' of the record, and then update it
       DELETE_ROWS_EVENT: write the record and then delete it , or just ignore the error

2)HA_ERR_FOUND_DUPP_KEY
      WRITE_ROWS_EVENT: just overwrite the record
      UPDATE_ROWS_EVENT : delete the duplicated record and then update (if the error is caused because of duplicated unique key and the table also has a primary key , the fix may failed and all changes can be rollback)

The attached file is a simple patch based on Percona-Server-5.5.24-rel26.0, and it is still under the testing.

Tags: contribution
Revision history for this message
yinfeng (yinfeng-zwx) wrote :
description: updated
Revision history for this message
Hui Liu (hickey) wrote :

Add stats for smart handle.

tags: added: contribution
Changed in percona-server:
assignee: nobody → Raghavendra D Prabhu (raghavendra-prabhu)
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@hickey, @yinfeng,

Thanks for the patch. I tested the latest patch on latest PS bzr head and the patch applied well and built. -- http://sprunge.us/abXX

My.cnf -- http://sprunge.us/JZKD

Now,

1) HA_ERR_KEY_NOT_FOUND
       UPDATE_ROWS_EVENT: write 'Before Image' of the record, and then update it
       DELETE_ROWS_EVENT: write the record and then delete it , or just ignore the error

--------> Worked fine for both the events.

2)HA_ERR_FOUND_DUPP_KEY
      WRITE_ROWS_EVENT: just overwrite the record
      UPDATE_ROWS_EVENT : delete the duplicated record and then update (if the error is caused because of duplicated unique key and the table also has a primary key , the fix may failed and all changes can be rollback)

Worked for WRITE_ROWS_EVENT but failed for UPDATE_ROWS_EVENT for primary key as well unique key (on non primary key column).

I will also attach full master and slave log.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Master sql client log: http://sprunge.us/WFUS

Slave sql client log: http://sprunge.us/agQb

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Marking this incomplete (since this is a wishlist) till UPDATE_ROWS_EVENT is fixed as mentioned in #3.

Revision history for this message
yinfeng (yinfeng-zwx) wrote :

Hi, Raghavendra

Yes ,you are right. if the HA_ERR_FOUND_DUPP_KEY is caused by a duplicate unique key with a different primary key , the fix for UPDATE_ROWS_EVENT would failed.

the HA_ERR_KEY_NOT_FOUND caused by UPDATE_ROWS_EVENT may also be failed to fix if inserting a record was failed (there is a duplicate unique key)

The two kinds of cases have let me confused for a long time.

Actually the code to handle HA_ERR_FOUND_DUPP_KEY caused by UPDATE_ROWS_EVENT is removed in our production environment because someone think it's too crude to delete a record.

I hope someone of Percona Or other people who are familiar with the source code can help perfect this patch :)

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@hickey,
 Thanks for the feedback.

Revision history for this message
Hui Liu (hickey) wrote :

Hi Raghavendra, thanks for the deep thought and feedback of this patch. If update result is failed as duplicate key error, we did not use the "delete and then update" strategy, so the what exactly as you tested, you could refer to the second patch(only HA_ERR_KEY_NOT_FOUND is handled):

@@ -9782,13 +9810,30 @@ Update_rows_log_event::do_exec_row(const
+ if ((slave_exec_mode == SLAVE_EXEC_MODE_SMART) &&
+ (error == HA_ERR_KEY_NOT_FOUND)) {

That's the unfinished part of that patch, even it's used well in our product environment, as we did not update PK in most of APPs of Alibaba.

Why we did not handle such situation, my thought is:
No matter update failed as UNIQ_KEY or PRIMARY_KEY( both return as DUP_KEY), the after image of update event is really conflict with the existing records in table. Then, how should we handle such scenario?
1) If failed as UNIQ_KEY, then the existing record R' already with the UNIQ_KEY should not be change, so the only way we could do is to ignore the UNIQ_KEY columns in record R, just update the rest columns.
2) If failed as PRIMARY_KEY, we might update the conflicted record with after image of update event and then delete the old record R by before image of update event.

Lots of changes should be done, such as split UNIQ_KEY and PRIMARY_KEY from DUP_KEY in some functions, and the risk of code change enlarged much. What's more, the record after handle of #1 and #2 is not really correct for slave. That's the reason why did not do it yet.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2374

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.