Percona Server with XtraDB

overflow caused replication sql thread failed to execute events

Reported by Hui Liu on 2012-10-23
36
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MySQL Server
Unknown
Unknown
Percona Server
Status tracked in 5.6
5.1
Undecided
Unassigned
5.5
High
Sergei Glushchenko
5.6
High
Unassigned

Bug Description

For a product application with 4 masters and 4 slaves of MySQL, which configured with BINLOG_FORMAT=ROW, 2 of the slaves can not replicate DML but the DDL works, seens two days ago. That is, all the events in the master binlogs could be sent to relay logs and applied by slave, everything seems OKay from "show slave status", but the inserted record can not be found in slave.

We spent most half day on this issue, and at last, we found the root reason.
In Table_map_log_event::do_apply_event, the m_table_id is correct, parsed from relay log event of Table_map_log_event, which is 5126232549. That value is assigned to table_list->table_id, but actually, it's overflowed and it's 831265253. Then, for the followed Rows_log_event::do_apply_event, table_list->table_id(831265253) is inserted into HASH m_table_ids, but later looked into HASH m_table_ids with entry of m_table_id for Rows_log_even, which is 5126232549. Obviously, it failed.

Here is part of GDB output:
-----------------------------------------------------
Breakpoint 2, table_mapping::set_table (this=0x65720df8, table_id=831265084, table=0x7fbe8c011320) at /home/xiyu.lh/alimysql/sql/rpl_tblmap.cc:109
109 entry *e= find_entry(table_id);
(gdb) c
Continuing.

Breakpoint 3, table_mapping::get_table (this=0x65720df8, table_id=5126232380) at /home/xiyu.lh/alimysql/sql/rpl_tblmap.cc:64
64 entry *e= find_entry(table_id);
(gdb) c
Continuing.

Breakpoint 1, Table_map_log_event::do_apply_event (this=0x7fbe8c00d6a0, rli=0x6571eb08) at /home/xiyu.lh/alimysql/sql/log_event.cc:8518
8518 thd->set_query_id(next_query_id());
(gdb) n
8520 if (!(memory= my_multi_malloc(MYF(MY_WME),
(gdb) n
8527 strmov(db_mem, rpl_filter->get_rewrite_db(m_dbnam, &dummy_len));
(gdb)
8528 strmov(tname_mem, m_tblnam);
(gdb)
8532 tname_mem, TL_WRITE);
(gdb)
8534 table_list->table_id= DBUG_EVALUATE_IF("inject_tblmap_same_id_maps_diff_tToable", 0, m_table_id);
(gdb) p table_list->table_id
$8 = 0
(gdb) p m_table_id
$9 = 5126232549
(gdb) s
8535 table_list->updating= 1;
(gdb) p (uint)m_table_id
$10 = 831265253
(gdb) p table_list->table_id
$11 = 831265253
(gdb) set var table_list->table_id=m_table_id
(gdb) p table_list->table_id
$12 = 831265253
-----------------------------------------------------

The proble is the routine below:
 table_list->table_id= DBUG_EVALUATE_IF("inject_tblmap_same_id_maps_diff_table", 0, m_table_id);
as table_list->table is uint, while m_table_id is ulong, caused overflow for table_list->table.

This caused the following logic as unexpected:
-----------------------------------------------------
int Rows_log_event::do_apply_event(Relay_log_info const *rli)
{

for (TABLE_LIST *ptr= rli->tables_to_lock ; ptr ; ptr= ptr->next_global)
    {
      const_cast<Relay_log_info*>(rli)->m_table_map.set_table(ptr->table_id, ptr->table);
    }
#ifdef HAVE_QUERY_CACHE
    query_cache.invalidate_locked_for_write(rli->tables_to_lock);
#endif
  }

  TABLE*
    table=
    m_table= const_cast<Relay_log_info*>(rli)->m_table_map.get_table(m_table_id);

  DBUG_PRINT("debug", ("m_table: 0x%lx, m_table_id: %lu", (ulong) m_table, m_table_id));

  if (table)
  {
-----------------------------------------------------

table_mapping::set_table insert into m_table_id with overflowed value(831265253), and then
table_mapping::get_table could not find m_table_id(5126232549) entry in HASH m_table_ids, caused the table==0x0 and left the core part do_apply_event.

Solution:
Change table_id from uint to ulong in sql/table.h to avoid overflow.
This bug should be exist in 5.1/5.5/5.6 with 64bit OS kernels.

yinfeng (yinfeng-zwx) wrote :

This bug is very serious and leads to data loss in slave.
More seriously, there is not any error information printed in error log or displayed by "show slave status".

yinfeng (yinfeng-zwx) wrote :

There are some operations that may increase the value of last_table_id, such as create table , flush tables or frequent add/remove table_share from table_def_cache if the value of table_definition_cache is not big enough.

After longtime running , last_table_id would become extremely large. if last_table_id is larger than 2^32 (max value of uint), this bug will be triggered.

Verified by code review. In current mysql-5.5 from Oracle we see the following:

sql/log_event.h:3495: ulong m_table_id;
sql/table.h:1457: uint table_id; /* table id (from binlog) for opened table */

Obviously there is overflow potential here.

Changed in percona-server:
status: New → Confirmed

Reported http://bugs.mysql.com/bug.php?id=67352 for versions 5.1, 5.5 and 5.6.

Davi Arnaut (davi) wrote :

I think even ulong is wrong, it should be ulonglong everywhere because ulong can be 32-bits long yet the table_id (in table map events) is 48-bits long.

Hui Liu (hickey) wrote :

For 32-bits, all the related type should be changed due to 6-bytes table_map. We used 64-bits OS, no need to consider such situation, but still hope Oracle/Percona could fix the risk.

Stewart Smith (stewart) on 2012-12-12
Changed in percona-server:
importance: Undecided → High
status: Confirmed → Triaged

Closing the bug due to Oracle fix in 5.6.11 and 5.7.1. We may backport the fix to earlier versions if there is interest.

tags: added: upstream
Agustín G (guriandoro) on 2014-02-28
tags: added: i39577
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.