tokudb+mariadb: server stalled
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Server moved to https://jira.percona.com/projects/PS |
Opinion
|
Undecided
|
Unassigned |
Bug Description
We have a server running several processes of Mariadb (10.0.22) and tables running tokudb engine (tokudb-7.5.7) replicating from a master in a normal master-slave setup.
We realised that one of the proceses was lagging a lot found a query that was stuck. The process was completely stuck (stop slave didn't even work, it was just hanging trying to kill the replication thread). The only solution was to kill -9 mysqld.
This is some troubleshooting.
Affected table structure:
CREATE TABLE `user_groups` (
`ug_user` int(5) unsigned NOT NULL DEFAULT '0',
`ug_group` varbinary(255) NOT NULL DEFAULT '',
PRIMARY KEY (`ug_user`
KEY `ug_group` (`ug_group`)
) ENGINE=TokuDB DEFAULT CHARSET=binary `compression`
The following query has been running for a long time:
*******
Id: 44
User: system user
Host:
db: zhwiki
Command: Connect
Time: 81310
State: update
Info: INSERT /* User::addGroup 127.0.0.1 */ IGNORE INTO `user_groups` (ug_user,ug_group) VALUES ('1546118','bot')
Progress: 0.000
Nothing else is running
Id User Host db Command Time State Info Progress
2 event_scheduler localhost NULL Daemon 177 Waiting for next activation NULL 0.000
42 system user NULL Connect 6676093 Waiting for master to send event NULL 0.000
43 system user NULL Connect 81133 Waiting for prior transaction to commit NULL 0.000
44 system user zhwiki Connect 81133 update INSERT /* User::addGroup 127.0.0.1 */ IGNORE INTO `user_groups` (ug_user,ug_group) VALUES ('1546118','bot') 0.000
45 system user NULL Connect 97087 Waiting for room in worker thread event queue NULL 0.000
830653 repl 10.64.37.5:57304 NULL Binlog Dump 2815399 Master has sent all binlog to slave; waiting for binlog to be updated NULL 0.000
839855 repl 10.64.4.11:44118 NULL Binlog Dump 2769700 Master has sent all binlog to slave; waiting for binlog to be updated NULL 0.000
1254774 watchdog 10.64.0.15:51206 information_schema Sleep 5 NULL 0.000
1254775 watchdog 10.64.0.15:51627 mysql Sleep 0 NULL 0.000
1410875 root localhost NULL Sleep 469 NULL 0.000
1410978 root localhost NULL Query 0 init show full processlist 0.000
*******
Type: InnoDB
Name:
Status:
=======
2016-09-08 13:16:53 7ffa5a5e7700 INNODB MONITOR OUTPUT
=======
Per second averages calculated from the last 9 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 6475205 srv_active, 0 srv_shutdown, 177284 srv_idle
srv_master_thread log flush and writes: 6652339
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 34309329
OS WAIT ARRAY INFO: signal count 51354558
Mutex spin waits 57127135, rounds 892504058, OS waits 25672980
RW-shared spins 13506675, rounds 235123829, OS waits 6441900
RW-excl spins 9307856, rounds 143680432, OS waits 1919206
Spin rounds per wait: 15.62 mutex, 17.41 RW-shared, 15.44 RW-excl
------------
TRANSACTIONS
------------
Trx id counter 5158085477
Purge done for trx's n:o < 5158085476 undo n:o < 0 state: running but idle
History list length 572
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 1410965, OS thread handle 0x7ffa5a5e7700, query id 1309267452 localhost root init
show engine innodb status
---TRANSACTION 4225399955, not started
MySQL thread id 42, OS thread handle 0x7ffa5a585700, query id 0 Waiting for master to send event
---TRANSACTION 5158070750, not started
MySQL thread id 44, OS thread handle 0x7ffa56dce700, query id 1309267436 update
INSERT /* User::addGroup 127.0.0.1 */ IGNORE INTO `user_groups` (ug_user,ug_group) VALUES ('1546118','bot')
---TRANSACTION 4225399812, not started
MySQL thread id 1, OS thread handle 0x7ffa5a73e700, query id 0 Waiting for background binlog tasks
---TRANSACTION 5158070754, ACTIVE (PREPARED) 81069 sec
3 lock struct(s), heap size 1184, 1 row lock(s), undo log entries 3
MySQL thread id 43, OS thread handle 0x7ffa58433700, query id 1309131308 Waiting for prior transaction to commit
Trx #rec lock waits 3121 #table lock waits 0
Trx total rec lock wait time 18 SEC
Trx total table lock wait time 0 SEC
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (read thread)
I/O thread 4 state: waiting for i/o request (read thread)
I/O thread 5 state: waiting for i/o request (read thread)
I/O thread 6 state: waiting for i/o request (read thread)
I/O thread 7 state: waiting for i/o request (read thread)
I/O thread 8 state: waiting for i/o request (read thread)
I/O thread 9 state: waiting for i/o request (read thread)
I/O thread 10 state: waiting for i/o request (read thread)
I/O thread 11 state: waiting for i/o request (read thread)
I/O thread 12 state: waiting for i/o request (read thread)
I/O thread 13 state: waiting for i/o request (read thread)
I/O thread 14 state: waiting for i/o request (read thread)
I/O thread 15 state: waiting for i/o request (read thread)
I/O thread 16 state: waiting for i/o request (read thread)
I/O thread 17 state: waiting for i/o request (read thread)
I/O thread 18 state: waiting for i/o request (write thread)
I/O thread 19 state: waiting for i/o request (write thread)
I/O thread 20 state: waiting for i/o request (write thread)
I/O thread 21 state: waiting for i/o request (write thread)
I/O thread 22 state: waiting for i/o request (write thread)
I/O thread 23 state: waiting for i/o request (write thread)
I/O thread 24 state: waiting for i/o request (write thread)
I/O thread 25 state: waiting for i/o request (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0, 0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
4463767 OS file reads, 562387708 OS file writes, 535312345 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------
Ibuf: size 1, free list len 2600, seg size 2602, 888183 merges
merged operations:
insert 2856041, delete mark 2732633, delete 855981
discarded operations:
insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 2594485252999
Log flushed up to 2594485252999
Pages flushed up to 2594485252999
Last checkpoint at 2594485252999
Max checkpoint age 3478212404
Checkpoint age target 3369518267
Modified age 0
Checkpoint age 0
0 pending log writes, 0 pending chkp writes
514432583 log i/o's done, 0.00 log i/o's/second
-------
BUFFER POOL AND MEMORY
-------
Total memory allocated 8791261184; in additional pool allocated 0
Total memory allocated by read views 432
Internal hash tables (constant factor + variable factor)
Adaptive hash index 833192288 (151384312 + 681807976)
Page hash 1107208 (buffer pool 0 only)
Dictionary cache 37243056 (35402768 + 1840288)
File system 905296 (812272 + 93024)
Lock system 21253192 (21250568 + 2624)
Recovery system 0 (0 + 0)
Dictionary memory allocated 1840288
Buffer pool size 524280
Buffer pool size, bytes 8589803520
Free buffers 8904
Database pages 473762
Old database pages 174877
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 2754509, not young 811412628
0.00 youngs/s, 0.00 non-youngs/s
Pages read 5226290, created 4170968, written 54158543
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 473762, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
-------
INDIVIDUAL BUFFER POOL INFO
-------
---BUFFER POOL 0
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1144
Database pages 59149
Old database pages 21829
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 351299, not young 98612103
0.00 youngs/s, 0.00 non-youngs/s
Pages read 661252, created 524914, written 7393879
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59149, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 1
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1196
Database pages 59098
Old database pages 21829
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 347217, not young 98373232
0.00 youngs/s, 0.00 non-youngs/s
Pages read 660394, created 522916, written 7025071
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59098, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 2
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1144
Database pages 59210
Old database pages 21860
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 343753, not young 104864852
0.00 youngs/s, 0.00 non-youngs/s
Pages read 658594, created 526885, written 6525218
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59210, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 3
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1124
Database pages 59221
Old database pages 21870
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 338562, not young 100821994
0.00 youngs/s, 0.00 non-youngs/s
Pages read 650016, created 518134, written 6897026
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59221, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 4
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1026
Database pages 59380
Old database pages 21899
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 347327, not young 103727368
0.00 youngs/s, 0.00 non-youngs/s
Pages read 653023, created 523331, written 6911900
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59380, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 5
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1108
Database pages 59204
Old database pages 21861
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 341341, not young 102578398
0.00 youngs/s, 0.00 non-youngs/s
Pages read 647211, created 518967, written 6812852
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59204, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 6
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1101
Database pages 59203
Old database pages 21861
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 343242, not young 102572143
0.00 youngs/s, 0.00 non-youngs/s
Pages read 646626, created 518401, written 6240981
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59203, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 7
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1061
Database pages 59297
Old database pages 21868
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 341768, not young 99862538
0.00 youngs/s, 0.00 non-youngs/s
Pages read 649174, created 517420, written 6351616
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 59297, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
1 RW transactions active inside InnoDB
0 RO transactions active inside InnoDB
1 out of 1000 descriptors used
Main thread process no. 16572, id 140703350900480, state: sleeping
Number of rows inserted 293087206, updated 887359, deleted 58720205, read 73067195
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Number of system rows inserted 505799786, updated 0, deleted 505799788, read 505799789
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
-------
END OF INNODB MONITOR OUTPUT
=======
The replication thread never goes forward, the query is stuck.
An strace doesn't show anything relevant
Process 16572 attached
restart_
fcntl(32, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(32, F_SETFL, O_RDWR|O_NONBLOCK) = 0
accept(32, {sa_family=
fcntl(32, F_SETFL, O_RDWR) = 0
fcntl(3579, F_SETFL, O_RDONLY|
setsockopt(3579, SOL_IP, IP_TOS, [8], 4) = 0
setsockopt(3579, SOL_TCP, TCP_NODELAY, [1], 4) = 0
futex(0x13dafe4, FUTEX_WAKE_
futex(0x13d8660, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=32, events=POLLIN}, {fd=33, events=POLLIN}], 2, 4294967295) = 1 ([{fd=32, revents=POLLIN}])
fcntl(32, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(32, F_SETFL, O_RDWR|O_NONBLOCK) = 0
accept(32, {sa_family=
fcntl(32, F_SETFL, O_RDWR) = 0
fcntl(3579, F_SETFL, O_RDONLY|
setsockopt(3579, SOL_IP, IP_TOS, [8], 4) = 0
setsockopt(3579, SOL_TCP, TCP_NODELAY, [1], 4) = 0
futex(0x13dafe4, FUTEX_WAKE_
poll([{fd=32, events=POLLIN}, {fd=33, events=POLLIN}], 2, 4294967295
Stopping the slave doesn't even work and hangs forever (strace not showing anything relevant when we try to stop it)
accept(32, {sa_family=
fcntl(32, F_SETFL, O_RDWR) = 0
fcntl(3580, F_SETFL, O_RDONLY|
setsockopt(3580, SOL_IP, IP_TOS, [8], 4) = 0
setsockopt(3580, SOL_TCP, TCP_NODELAY, [1], 4) = 0
futex(0x13dafe4, FUTEX_WAKE_
futex(0x13d8660, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=32, events=POLLIN}, {fd=33, events=POLLIN}], 2, 4294967295) = 1 ([{fd=32, revents=POLLIN}])
fcntl(32, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(32, F_SETFL, O_RDWR|O_NONBLOCK) = 0
accept(32, {sa_family=
fcntl(32, F_SETFL, O_RDWR) = 0
fcntl(3581, F_SETFL, O_RDONLY|
setsockopt(3581, SOL_IP, IP_TOS, [8], 4) = 0
setsockopt(3581, SOL_TCP, TCP_NODELAY, [1], 4) = 0
clone(child_
poll([{fd=32, events=POLLIN}, {fd=33, events=POLLIN}], 2, 4294967295) = 1 ([{fd=32, revents=POLLIN}])
fcntl(32, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(32, F_SETFL, O_RDWR|O_NONBLOCK) = 0
accept(32, {sa_family=
fcntl(32, F_SETFL, O_RDWR) = 0
fcntl(3582, F_SETFL, O_RDONLY|
setsockopt(3582, SOL_IP, IP_TOS, [8], 4) = 0
setsockopt(3582, SOL_TCP, TCP_NODELAY, [1], 4) = 0
clone(child_
poll([{fd=32, events=POLLIN}, {fd=33, events=POLLIN}], 2, 4294967295
The only solution is to kill -9 mysqld and let it come back again
After a few hours it happened again, and we migrated the table to InnoDB and never happened again.
We have seen this error in the past and migrating to InnoDB stopped it from happening again.
description: | updated |
Can you ensure that you have debug symbols, then attach gdb and capture a "thread apply all backtrace". Since this is within MariaDB and not Percona Server I also suggest you log it with them and reference this issue as there is the possibility that it is related to some interaction with the server specific code.