Slave crash with malformed binlog event

Bug #1464468 reported by markus_albe on 2015-06-12
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Fix Released
High
Vlad Lesin
5.5
New
Undecided
Unassigned
5.6
Fix Released
High
Vlad Lesin

Bug Description

The fix for https://bugs.launchpad.net/percona-server/+bug/1438990 seems to have introduced a bug of it's own:

mysqlbinlog attempts on binlog from surviving (master) node show:

"ERROR: Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 2180319, event_type: 0
ERROR: Could not read entry at offset 216385052: Error in log format or read error.
WARNING: The range of printed events ends with a row event or a table map event that does not have the STMT_END_F flag set. This might be because the last statement was not fully written to the log, or because you are using a --stop-position or --stop-datetime that refers to an event in the middle of a statement. The event(s) from the partial statement have not been written to output."

Customer added:
If I look in the error log of a crashed node I see that the last applied Xid is:
"2015-06-11 09:51:25 48997 [ERROR] WSREP: applier could not read binlog event, seqno: 12575799, len: 551
2015-06-11 09:51:25 48997 [Warning] WSREP: Failed to apply app buffer: seqno: 12575799, status: 1
at galera/src/trx_handle.cpp:apply():351"

I searched for that Xid in binary logs on the primary and could not find it. The last one to be displayed 12575798.

So, given the fix for lp:1438990 was related to adding a table map, I'm assuming that it could have introduced a bug where it would not add a STMT_END_F under certain conditions.

Customer is working to isolate steps to reproduce, I will update here when we have more details.

Changed in percona-server:
assignee: nobody → Vlad Lesin (vlad-lesin)
Vlad Lesin (vlad-lesin) wrote :
Download full text (12.2 KiB)

The following analysis is based on the code revision number c250ec9791ed6c7695ae6a191397a3b65db2e994:

Analysis #1.

1) The assertion takes place on debug build here:

(gdb) bt
#0 0x00007ffff67c6cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff67ca0d8 in __GI_abort () at abort.c:89
#2 0x00007ffff67bfb86 in __assert_fail_base (fmt=0x7ffff6910830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x10886c9 "is_binlog_empty()", file=file@entry=0x1088548 "./sql/binlog.cc",
line=line@entry=433,
function=function@entry=0x108cdc0 <binlog_cache_data::reset()::PRETTY_FUNCTION> "virtual void binlog_cache_data::reset()") at assert.c:92
#3 0x00007ffff67bfc32 in GI_assert_fail (assertion=0x10886c9 "is_binlog_empty()",
file=0x1088548 "./sql/binlog.cc", line=433,
function=0x108cdc0 <binlog_cache_data::reset()::PRETTY_FUNCTION> "virtual void binlog_cache_data::reset()") at assert.c:101
#4 0x0000000000a47967 in binlog_cache_data::reset (this=0x7fff98024fe0) at ./sql/binlog.cc:433
#5 0x0000000000a47d2c in binlog_trx_cache_data::reset (this=0x7fff98024fe0) at ./sql/binlog.cc:648
#6 0x0000000000a32a7e in binlog_cache_data::flush (this=0x7fff98024fe0, thd=0x20adb10, bytes_written=0x7fffdc0d14b8, wrote_xid=0x7fffdc0d14eb)
at ./sql/binlog.cc:1270
#7 0x0000000000a4827a in binlog_cache_mngr::flush (this=0x7fff98024e70, thd=0x20adb10, bytes_written=0x7fffdc0d14f0, wrote_xid=0x7fffdc0d14eb)
at ./sql/binlog.cc:778
#8 0x0000000000a40b63 in MYSQL_BIN_LOG::flush_thread_caches (this=0x1859dc0 <mysql_bin_log>, thd=0x20adb10)
at ./sql/binlog.cc:6779
#9 0x0000000000a40cfc in MYSQL_BIN_LOG::process_flush_stage_queue (this=0x1859dc0 <mysql_bin_log>, total_bytes_var=0x7fffdc0d15f0,
rotate_var=0x7fffdc0d15de, out_queue_var=0x7fffdc0d15f8) at ./sql/binlog.cc:6835
#21 0x00007ffff737d182 in start_thread (arg=0x7fffdc0d5700) at pthread_create.c:312
#22 0x00007ffff688a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

binlog_cache_data::is_binlog_empty() returns false because binlog_cache_data::cache_log.pos_in_file have some suspicious value (10344644715844964239).

2) binlog_cache_data::cache_log.pos_in_file is set from here:

Old value = 368
New value = 10344644715844964239
reinit_io_cache (info=0x7fff9404db48, type=WRITE_CACHE, seek_offset=10344644715844964239, use_async_io=0 '\000', clear_cache=0 '\000')
at ./mysys/mf_iocache.c:387
387 info->seek_not_done=1;
(gdb) bt
#0 reinit_io_cache (info=0x7fff9404db48, type=WRITE_CACHE, seek_offset=10344644715844964239, use_async_io=0 '\000', clear_cache=0 '\000')
at ./mysys/mf_iocache.c:387
#1 0x0000000000a47aaa in binlog_cache_data::truncate (this=0x7fff9404db40, pos=10344644715844964239)
at ./sql/binlog.cc:486
#2 0x0000000000a48011 in binlog_trx_cache_data::restore_savepoint (this=0x7fff9404db40, pos=10344644715844964239)
at ./sql/binlog.cc:691
#3 0x0000000000a34368 in binlog_savepoint_rollback (hton=0x188ca40, thd=0x20b38c0, sv=0x7fff94007040)
at ./sql/binlog.cc:1900
#4 0x00000000006391c7 in ha_rollback_to_savepoint (thd=0x20b38c0, sv=0x7fff94007010)
at ./sql/handler.cc:2132
#5 0x00000000008aac33 in trans_rollback_to_savepoint (thd=0x20b38c0, name=...) at ./...

Vlad Lesin (vlad-lesin) wrote :
Download full text (13.1 KiB)

I have simplified the test case a little bit more:

DROP TABLE IF EXISTS `t`;
CREATE TABLE `t` (
`f1` int(10) unsigned NOT NULL,
PRIMARY KEY (`f1`)
) ENGINE=InnoDB;

DROP TABLE IF EXISTS `t1`;
CREATE TABLE `t1` (
`f1` int(10) unsigned NOT NULL,
PRIMARY KEY (`f1`)
) ENGINE=InnoDB;

DROP TRIGGER IF EXISTS t_insert_trig;

DELIMITER ;;
CREATE TRIGGER t_insert_trig AFTER INSERT ON t
FOR EACH ROW
BEGIN

DECLARE EXIT HANDLER FOR SQLEXCEPTION
BEGIN
ROLLBACK TO savepoint_1;
INSERT INTO t1 VALUES(NEW.f1); //<------- Crash Line
END;

SAVEPOINT savepoint_1; //<---------SAVEPOINT Line

INSERT INTO dummy VALUES (1);

END ;;
DELIMITER ;

INSERT INTO `t` VALUES (2);
INSERT INTO `t` VALUES (3); //<-------SECOND INSERT Line

"Crash Line" is the line which absence cause the crash. If this line is commented out we have crash.

Why trans->ha_list does not contain binlog handlerton item to initialize it's per-engine savepoint data.

So there are two cases. The one is without crash (Case_Good) when "Crash Line" presents, the other one is with crash (Case_Crash) when "Crash Line" absents. The question is why thd->transaction.stmt.ha_list does not contain item which corresponds to binlog_handlerton for Case_Crash?

1) Consider Case_Good. The question can be restated as: "Why thd->transaction.stmt.ha_list contains item which corresponds to binlog_handlerton for the Case_Good?"

Consider events after executing first ha_savepoint() and ha_rollback_to_savepoint() calls.

a) binlog handlerton is set in write_locked_table_maps() if thd->binlog_table_maps is 0.
b) thd->binlog_table_maps is set to 0 only if binlog_cache_data::m_pending is not 0

#0 THD::clear_binlog_table_maps (this=0x20adb10)
at ./sql/sql_class.h:2646
#1 0x0000000000a47b30 in binlog_cache_data::flush_pending_event (this=0x7fffa000e660, thd=0x20adb10)
at ./sql/binlog.cc:499
#2 0x0000000000a3280f in binlog_cache_data::finalize (this=0x7fffa000e660, thd=0x20adb10,
end_event=0x7fffdc0d16d0) at ./sql/binlog.cc:1207
#3 0x0000000000a406e3 in MYSQL_BIN_LOG::commit (this=0x1859dc0 <mysql_bin_log>, thd=0x20adb10, all=false)
at ./sql/binlog.cc:6675
#4 0x0000000000637b29 in ha_commit_trans (thd=0x20adb10, all=false, ignore_global_read_lock=false)
at ./sql/handler.cc:1513
#5 0x00000000008aa3d2 in trans_commit_stmt (thd=0x20adb10)
at ./sql/transaction.cc:436
#6 0x00000000007dcace in mysql_execute_command (thd=0x20adb10)
at ./sql/sql_parse.cc:5573
#7 0x00000000007dfe00 in mysql_parse (thd=0x20adb10, rawbuf=0x7fffa0004ff0 "INSERT INTO `t` VALUES (2)",
length=26, parser_state=0x7fffdc0d41d0) at ./sql/sql_parse.cc:6955
#8 0x00000000007d1da0 in dispatch_command (command=COM_QUERY, thd=0x20adb10,
packet=0x2193b61 "INSERT INTO `t` VALUES (2)", packet_length=26)
at ./sql/sql_parse.cc:1442
#9 0x00000000007d0c55 in do_command (thd=0x20adb10)
at ./sql/sql_parse.cc:1054
#10 0x000000000079a5d8 in do_handle_one_connection (thd_arg=0x20adb10)
at ./sql/sql_connect.cc:1532
#11 0x000000000079a0ef in handle_one_connection (arg=0x20adb10)
at ./sql/sql_connect.cc:1443
#12 0x0000000000da1942 in pfs_spawn_thread (arg=0x214f5c0)
at ./storage/perfschema/pfs.cc:1860
#13 0x00007ffff737d182 in start_thread (arg=0x7fffdc0d5700) at pthread_create.c:3...

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-908

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers