slave applier broken on SMP (row_insert_for_mysql())

Bug #343238 reported by Alex Yurchenko
2
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
High
Seppo Jaakola

Bug Description

During DBT2 benchmarking on Amazon large nodes the following happened:

On three nodes (out of 4) all connections (5) are sleeping waiting for flow control event.

On the fourth node, 4 connections are waiting for TO in commit phase, like that:

#0 0x00002b06d20b0376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00002aac000f2c6c in gcs_to_grab (to=0xbe72990, seqno=87944) at gcs_to.c:190
#2 0x00002aabffcd3dd9 in mm_galera_commit (gh=<value optimized out>, trx_id=159180, conn_id=<value optimized out>, rbr_data=<value optimized out>, rbr_data_len=<value optimized out>) at mm_galera.c:585
#3 0x0000000000747dc4 in wsrep_run_wsrep_commit (thd=0xd29afb0, trx=0x2aaaabb0f0b8, hton=0xd81830, all=true) at handler/ha_innodb.cc:8413
#4 0x0000000000748140 in innobase_commit (hton=0xd81830, thd=0xd29afb0, all=true) at handler/ha_innodb.cc:1965
#5 0x00000000006a898c in ha_commit_one_phase (thd=0xd29afb0, all=true) at handler.cc:1202
#6 0x00000000006a9615 in ha_commit_trans (thd=0xd29afb0, all=<value optimized out>) at handler.cc:1171
#7 0x00000000005c144c in end_trans (thd=0xd29afb0, completion=COMMIT) at sql_parse.cc:742
#8 0x00000000005c43d0 in mysql_execute_command (thd=0xd29afb0) at sql_parse.cc:4273
#9 0x00000000005ca0b1 in mysql_parse (thd=0xd29afb0, inBuf=0xd2fefb0 "COMMIT", length=6, found_semicolon=0x451cfe48) at sql_parse.cc:6140
#10 0x00000000005cb02a in dispatch_command (command=COM_QUERY, thd=0xd29afb0, packet=<value optimized out>, packet_length=<value optimized out>) at sql_parse.cc:1339
#11 0x00000000005cc036 in do_command (thd=0xd29afb0) at sql_parse.cc:925
#12 0x00000000005bb0d9 in handle_one_connection (arg=<value optimized out>) at sql_connect.cc:1144
#13 0x00002b06d20ac1b5 in start_thread () from /lib64/libpthread.so.0
#14 0x00002b06d2e7536d in clone () from /lib64/libc.so.6

The 5th connection does not show in mysql processlist, but can be seen in debugger:

#0 0x00002b06d20b0376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00002aac000f2c6c in gcs_to_grab (to=0xbe72990, seqno=87943) at gcs_to.c:190
#2 0x00002aabffcd4191 in process_query_write_set_applying (applier=0xbf52a58, app_ctx=0xc3d48d0, ws=0xcb7b7d0, seqno_g=87210, seqno_l=87943) at mm_galera.c:585
#3 0x00002aabffcd4bc3 in mm_galera_replay_trx (gh=<value optimized out>, trx_id=159061, app_ctx=0xc3d48d0) at mm_galera.c:1876
#4 0x00000000005caa28 in dispatch_command (command=<value optimized out>, thd=0xc3d48d0, packet=<value optimized out>, packet_length=272014576) at sql_parse.cc:1752
#5 0x00000000005cc036 in do_command (thd=0xc3d48d0) at sql_parse.cc:925
#6 0x00000000005bb0d9 in handle_one_connection (arg=<value optimized out>) at sql_connect.cc:1144
#7 0x00002b06d20ac1b5 in start_thread () from /lib64/libpthread.so.0
#8 0x00002b06d2e7536d in clone () from /lib64/libc.so.6

Slave thread at the same time goes as follows:

#0 0x00002b06d20b0376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000007af9e5 in os_event_wait_low (event=0x1521af0, reset_sig_count=<value optimized out>) at os/os0sync.c:422
#2 0x00000000007dfdf8 in srv_suspend_mysql_thread (thr=0x2aaaabb8d0b8) at srv/srv0srv.c:1533
#3 0x00000000007c5c6e in row_mysql_handle_errors (new_err=0x4508a628, trx=0x2aaaabaff0b8, thr=0x2aaaabb8d0b8, savept=0x4508a610) at row/row0mysql.c:496
#4 0x00000000007c703d in row_insert_for_mysql ( mysql_rec=<value optimized out>, prebuilt=0x2aaaabb89cb8) at row/row0mysql.c:1170
#5 0x00000000007487cd in ha_innobase::write_row (this=0xc8ec070, record=0xc8ec260 "��\v") at handler/ha_innodb.cc:3779
#6 0x00000000006ac22f in handler::ha_write_row (this=0xc8ec070, buf=0xc8ec260 "��\v") at handler.cc:4578
#7 0x000000000067a0b6 in Rows_log_event::write_row (this=0x2aac04d6daa0, rli=0xc0b9480, overwrite=false) at log_event.cc:8352
#8 0x000000000067a442 in Write_rows_log_event::do_exec_row (this=0x1521b2c, rli=0x0) at log_event.cc:8513
#9 0x0000000000674f5b in Rows_log_event::do_apply_event (this=0x2aac04d6daa0, rli=0xc0b9480) at log_event.cc:7281
#10 0x0000000000684176 in wsrep_bf_execute_cb_rbr (ctx=0xc0a6950, rbr_buf=<value optimized out>, buf_len=8626, ts=<value optimized out>, rand=<value optimized out>) at log_event.h:1056
#11 0x00002aabffcd228d in apply_write_set (app_ctx=0xc0a6950, ws=0x4508afd0) at mm_galera.c:504
#12 0x00002aabffcd415d in process_query_write_set_applying (applier=0xbf52a08, app_ctx=0xc0a6950, ws=0x4508afd0, seqno_g=87209, seqno_l=87942) at mm_galera.c:723
#13 0x00002aabffcd46ec in process_query_write_set (applier=0xbf52a08, app_ctx=0xc0a6950, ws=0x4508afd0, seqno_g=87209, seqno_l=87942) at mm_galera.c:843
#14 0x00002aabffcd51b1 in mm_galera_recv (gh=<value optimized out>, app_ctx=0xc0a6950) at mm_galera.c:902
#15 0x00000000005c03ff in wsrep_replication_process (thd=0xc0a6950) at sql_parse.cc:7911
#16 0x00000000005b7f83 in start_wsrep_THD (arg=0x5c0330) at mysqld.cc:4301
#17 0x00002b06d20ac1b5 in start_thread () from /lib64/libpthread.so.0
#18 0x00002b06d2e7536d in clone () from /lib64/libc.so.6

Note that replay in 5th connection waits for 87943, while slave thread holds 87942

Tags: apply slave smp
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Situation is quite reproduceable. With debug messages turned on we get the following in error log (transaction being replayed is 258545, slave thread deals with 258544):

090315 13:45:21 [Note] DEBUG: local.c:wsdb_deref_seqno():123: DISCARD cont: safe = 256590, next = 256690, next refs = 4
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_recv():1048: worker: 0 with seqno: (256696 - 258544) type: 0 recvd
090315 13:45:21 [Note] DEBUG: mm_galera.c:process_query_write_set():839: remote trx seqno: 256696 258544 last_seen_trx: 256671, cert: 0
090315 13:45:21 [Note] DEBUG: job_queue.c:job_queue_start_job():141: job: 0 starting

WSREP BF kill, with seqno: 258544
090315 13:45:21 [Warning] kill trx QUERY_COMMITTING for 593775
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_cancel_commit():1158: interrupting trx commit: trx_id 593775 seqno 258545
090315 13:45:21 [Note] DEBUG: gcs_to.c:gcs_to_interrupt():392: trying to interrupt in use seqno: seqno = 258545, TO seqno = 258545
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_cancel_commit():1162: trx interupt fail in to_queue: -34
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_commit():1491: local trx certified, seqno: 256697 258545 last_seen_trx: 256672
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_commit():1538: interrupted in commit queue for 258545
090315 13:45:21 [ERROR] WSREP commit failed for: 4
090315 13:45:21 [Note] WSREP replay trx
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_replay_trx():1840: trx_replay for: 258545 256697 state: 3, rbr len: 33996
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_replay_trx():1858: applier 1
090315 13:45:21 [Note] DEBUG: job_queue.c:job_queue_start_job():141: job: 1 starting
090315 13:45:21 [Note] DEBUG: local.c:read_next_block():612: could not read full cache block, last:96574 cur:96574
090315 13:45:21 [Note] DEBUG: local.c:read_next_block():612: could not read full cache block, last:96574 cur:96574

WSREP BF conflict,order: 258544 258545
WSREP BF kill, with seqno: 258544
090315 13:45:21 [Note] WSREP withdraw for BF trx: 594053, state: 5
090315 13:45:21 [Note] WSREP victim 594053 in state 5
090315 13:45:21 [Note] DEBUG: mm_galera.c:mm_galera_rolledback():1280: trx state: 0 at galera_rolledback for: 9223372036854775807
090315 13:52:13 [Note] DEBUG: mm_galera.c:mm_galera_set_variable():1680: GALERA set var: sql_mode
090315 13:52:14 [Note] DEBUG: mm_galera.c:mm_galera_set_variable():1680: GALERA set var: time_zone

These are the last words of the node.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

This was identified as a bug in galera library and has been fixed already

Changed in codership-mysql:
assignee: nobody → seppo-jaakola
importance: Undecided → High
milestone: none → demo-2
status: New → Fix Committed
Changed in codership-mysql:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.