slave applier broken on SMP (row_insert_for_mysql())
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL patches by Codership |
Fix Released
|
High
|
Seppo Jaakola |
Bug Description
During DBT2 benchmarking on Amazon large nodes the following happened:
On three nodes (out of 4) all connections (5) are sleeping waiting for flow control event.
On the fourth node, 4 connections are waiting for TO in commit phase, like that:
#0 0x00002b06d20b0376 in pthread_
#1 0x00002aac000f2c6c in gcs_to_grab (to=0xbe72990, seqno=87944) at gcs_to.c:190
#2 0x00002aabffcd3dd9 in mm_galera_commit (gh=<value optimized out>, trx_id=159180, conn_id=<value optimized out>, rbr_data=<value optimized out>, rbr_data_len=<value optimized out>) at mm_galera.c:585
#3 0x0000000000747dc4 in wsrep_run_
#4 0x0000000000748140 in innobase_commit (hton=0xd81830, thd=0xd29afb0, all=true) at handler/
#5 0x00000000006a898c in ha_commit_one_phase (thd=0xd29afb0, all=true) at handler.cc:1202
#6 0x00000000006a9615 in ha_commit_trans (thd=0xd29afb0, all=<value optimized out>) at handler.cc:1171
#7 0x00000000005c144c in end_trans (thd=0xd29afb0, completion=COMMIT) at sql_parse.cc:742
#8 0x00000000005c43d0 in mysql_execute_
#9 0x00000000005ca0b1 in mysql_parse (thd=0xd29afb0, inBuf=0xd2fefb0 "COMMIT", length=6, found_semicolon
#10 0x00000000005cb02a in dispatch_command (command=COM_QUERY, thd=0xd29afb0, packet=<value optimized out>, packet_
#11 0x00000000005cc036 in do_command (thd=0xd29afb0) at sql_parse.cc:925
#12 0x00000000005bb0d9 in handle_
#13 0x00002b06d20ac1b5 in start_thread () from /lib64/
#14 0x00002b06d2e7536d in clone () from /lib64/libc.so.6
The 5th connection does not show in mysql processlist, but can be seen in debugger:
#0 0x00002b06d20b0376 in pthread_
#1 0x00002aac000f2c6c in gcs_to_grab (to=0xbe72990, seqno=87943) at gcs_to.c:190
#2 0x00002aabffcd4191 in process_
#3 0x00002aabffcd4bc3 in mm_galera_
#4 0x00000000005caa28 in dispatch_command (command=<value optimized out>, thd=0xc3d48d0, packet=<value optimized out>, packet_
#5 0x00000000005cc036 in do_command (thd=0xc3d48d0) at sql_parse.cc:925
#6 0x00000000005bb0d9 in handle_
#7 0x00002b06d20ac1b5 in start_thread () from /lib64/
#8 0x00002b06d2e7536d in clone () from /lib64/libc.so.6
Slave thread at the same time goes as follows:
#0 0x00002b06d20b0376 in pthread_
#1 0x00000000007af9e5 in os_event_wait_low (event=0x1521af0, reset_sig_
#2 0x00000000007dfdf8 in srv_suspend_
#3 0x00000000007c5c6e in row_mysql_
#4 0x00000000007c703d in row_insert_
#5 0x00000000007487cd in ha_innobase:
#6 0x00000000006ac22f in handler:
#7 0x000000000067a0b6 in Rows_log_
#8 0x000000000067a442 in Write_rows_
#9 0x0000000000674f5b in Rows_log_
#10 0x0000000000684176 in wsrep_bf_
#11 0x00002aabffcd228d in apply_write_set (app_ctx=0xc0a6950, ws=0x4508afd0) at mm_galera.c:504
#12 0x00002aabffcd415d in process_
#13 0x00002aabffcd46ec in process_
#14 0x00002aabffcd51b1 in mm_galera_recv (gh=<value optimized out>, app_ctx=0xc0a6950) at mm_galera.c:902
#15 0x00000000005c03ff in wsrep_replicati
#16 0x00000000005b7f83 in start_wsrep_THD (arg=0x5c0330) at mysqld.cc:4301
#17 0x00002b06d20ac1b5 in start_thread () from /lib64/
#18 0x00002b06d2e7536d in clone () from /lib64/libc.so.6
Note that replay in 5th connection waits for 87943, while slave thread holds 87942
Changed in codership-mysql: | |
status: | Fix Committed → Fix Released |
Situation is quite reproduceable. With debug messages turned on we get the following in error log (transaction being replayed is 258545, slave thread deals with 258544):
090315 13:45:21 [Note] DEBUG: local.c: wsdb_deref_ seqno() :123: DISCARD cont: safe = 256590, next = 256690, next refs = 4 c:mm_galera_ recv(): 1048: worker: 0 with seqno: (256696 - 258544) type: 0 recvd c:process_ query_write_ set():839: remote trx seqno: 256696 258544 last_seen_trx: 256671, cert: 0 c:job_queue_ start_job( ):141: job: 0 starting
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [Note] DEBUG: job_queue.
WSREP BF kill, with seqno: 258544 c:mm_galera_ cancel_ commit( ):1158: interrupting trx commit: trx_id 593775 seqno 258545 c:gcs_to_ interrupt( ):392: trying to interrupt in use seqno: seqno = 258545, TO seqno = 258545 c:mm_galera_ cancel_ commit( ):1162: trx interupt fail in to_queue: -34 c:mm_galera_ commit( ):1491: local trx certified, seqno: 256697 258545 last_seen_trx: 256672 c:mm_galera_ commit( ):1538: interrupted in commit queue for 258545 c:mm_galera_ replay_ trx():1840: trx_replay for: 258545 256697 state: 3, rbr len: 33996 c:mm_galera_ replay_ trx():1858: applier 1 c:job_queue_ start_job( ):141: job: 1 starting read_next_ block() :612: could not read full cache block, last:96574 cur:96574 read_next_ block() :612: could not read full cache block, last:96574 cur:96574
090315 13:45:21 [Warning] kill trx QUERY_COMMITTING for 593775
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [Note] DEBUG: gcs_to.
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [ERROR] WSREP commit failed for: 4
090315 13:45:21 [Note] WSREP replay trx
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:45:21 [Note] DEBUG: job_queue.
090315 13:45:21 [Note] DEBUG: local.c:
090315 13:45:21 [Note] DEBUG: local.c:
WSREP BF conflict,order: 258544 258545 c:mm_galera_ rolledback( ):1280: trx state: 0 at galera_rolledback for: 9223372036854775807 c:mm_galera_ set_variable( ):1680: GALERA set var: sql_mode c:mm_galera_ set_variable( ):1680: GALERA set var: time_zone
WSREP BF kill, with seqno: 258544
090315 13:45:21 [Note] WSREP withdraw for BF trx: 594053, state: 5
090315 13:45:21 [Note] WSREP victim 594053 in state 5
090315 13:45:21 [Note] DEBUG: mm_galera.
090315 13:52:13 [Note] DEBUG: mm_galera.
090315 13:52:14 [Note] DEBUG: mm_galera.
These are the last words of the node.