Percona Server moved to https://jira.percona.com/projects/PS

range_query_buff and blob_buff in ha_tokudb handler could easily cause OOM

Series 5.6
Bug #1705914

Bug #1705914 reported by Fungo Wang on 2017-07-23

This bug affects 1 person

	Status	Importance	Assigned to
Percona Server moved to https://jira.percona.com/projects/PS	Status tracked in 5.7
5.5	Invalid	Undecided	Unassigned
5.6	Confirmed	Undecided	Unassigned
5.7	Confirmed	Undecided	Unassigned

Bug Description

I test this with PS 5.6.36-82.0:

## test case as follow:

###
### prepare stage, create a TokuDB partition table with 100 paritions
###

#### change the connection string as needed
MYSQL="mysql -h127.0.0.1 -P70011 -uroot"

$MYSQL <<EOF
CREATE TABLE test.t1(
id int,
uid int,
post longtext,
index idx_uid (uid)
) ENGINE=TokuDB
PARTITION BY RANGE(uid) (
PARTITION p0 VALUES LESS THAN (100),
PARTITION p1 VALUES LESS THAN (200),
PARTITION p2 VALUES LESS THAN (300),
PARTITION p3 VALUES LESS THAN (400),
PARTITION p4 VALUES LESS THAN (500),
PARTITION p5 VALUES LESS THAN (600),
PARTITION p6 VALUES LESS THAN (700),
PARTITION p7 VALUES LESS THAN (800),
PARTITION p8 VALUES LESS THAN (900),
PARTITION p9 VALUES LESS THAN (1000),
PARTITION p10 VALUES LESS THAN (1100),
PARTITION p11 VALUES LESS THAN (1200),
PARTITION p12 VALUES LESS THAN (1300),
PARTITION p13 VALUES LESS THAN (1400),
PARTITION p14 VALUES LESS THAN (1500),
PARTITION p15 VALUES LESS THAN (1600),
PARTITION p16 VALUES LESS THAN (1700),
PARTITION p17 VALUES LESS THAN (1800),
PARTITION p18 VALUES LESS THAN (1900),
PARTITION p19 VALUES LESS THAN (2000),
PARTITION p20 VALUES LESS THAN (2100),
PARTITION p21 VALUES LESS THAN (2200),
PARTITION p22 VALUES LESS THAN (2300),
PARTITION p23 VALUES LESS THAN (2400),
PARTITION p24 VALUES LESS THAN (2500),
PARTITION p25 VALUES LESS THAN (2600),
PARTITION p26 VALUES LESS THAN (2700),
PARTITION p27 VALUES LESS THAN (2800),
PARTITION p28 VALUES LESS THAN (2900),
PARTITION p29 VALUES LESS THAN (3000),
PARTITION p30 VALUES LESS THAN (3100),
PARTITION p31 VALUES LESS THAN (3200),
PARTITION p32 VALUES LESS THAN (3300),
PARTITION p33 VALUES LESS THAN (3400),
PARTITION p34 VALUES LESS THAN (3500),
PARTITION p35 VALUES LESS THAN (3600),
PARTITION p36 VALUES LESS THAN (3700),
PARTITION p37 VALUES LESS THAN (3800),
PARTITION p38 VALUES LESS THAN (3900),
PARTITION p39 VALUES LESS THAN (4000),
PARTITION p40 VALUES LESS THAN (4100),
PARTITION p41 VALUES LESS THAN (4200),
PARTITION p42 VALUES LESS THAN (4300),
PARTITION p43 VALUES LESS THAN (4400),
PARTITION p44 VALUES LESS THAN (4500),
PARTITION p45 VALUES LESS THAN (4600),
PARTITION p46 VALUES LESS THAN (4700),
PARTITION p47 VALUES LESS THAN (4800),
PARTITION p48 VALUES LESS THAN (4900),
PARTITION p49 VALUES LESS THAN (5000),
PARTITION p50 VALUES LESS THAN (5100),
PARTITION p51 VALUES LESS THAN (5200),
PARTITION p52 VALUES LESS THAN (5300),
PARTITION p53 VALUES LESS THAN (5400),
PARTITION p54 VALUES LESS THAN (5500),
PARTITION p55 VALUES LESS THAN (5600),
PARTITION p56 VALUES LESS THAN (5700),
PARTITION p57 VALUES LESS THAN (5800),
PARTITION p58 VALUES LESS THAN (5900),
PARTITION p59 VALUES LESS THAN (6000),
PARTITION p60 VALUES LESS THAN (6100),
PARTITION p61 VALUES LESS THAN (6200),
PARTITION p62 VALUES LESS THAN (6300),
PARTITION p63 VALUES LESS THAN (6400),
PARTITION p64 VALUES LESS THAN (6500),
PARTITION p65 VALUES LESS THAN (6600),
PARTITION p66 VALUES LESS THAN (6700),
PARTITION p67 VALUES LESS THAN (6800),
PARTITION p68 VALUES LESS THAN (6900),
PARTITION p69 VALUES LESS THAN (7000),
PARTITION p70 VALUES LESS THAN (7100),
PARTITION p71 VALUES LESS THAN (7200),
PARTITION p72 VALUES LESS THAN (7300),
PARTITION p73 VALUES LESS THAN (7400),
PARTITION p74 VALUES LESS THAN (7500),
PARTITION p75 VALUES LESS THAN (7600),
PARTITION p76 VALUES LESS THAN (7700),
PARTITION p77 VALUES LESS THAN (7800),
PARTITION p78 VALUES LESS THAN (7900),
PARTITION p79 VALUES LESS THAN (8000),
PARTITION p80 VALUES LESS THAN (8100),
PARTITION p81 VALUES LESS THAN (8200),
PARTITION p82 VALUES LESS THAN (8300),
PARTITION p83 VALUES LESS THAN (8400),
PARTITION p84 VALUES LESS THAN (8500),
PARTITION p85 VALUES LESS THAN (8600),
PARTITION p86 VALUES LESS THAN (8700),
PARTITION p87 VALUES LESS THAN (8800),
PARTITION p88 VALUES LESS THAN (8900),
PARTITION p89 VALUES LESS THAN (9000),
PARTITION p90 VALUES LESS THAN (9100),
PARTITION p91 VALUES LESS THAN (9200),
PARTITION p92 VALUES LESS THAN (9300),
PARTITION p93 VALUES LESS THAN (9400),
PARTITION p94 VALUES LESS THAN (9500),
PARTITION p95 VALUES LESS THAN (9600),
PARTITION p96 VALUES LESS THAN (9700),
PARTITION p97 VALUES LESS THAN (9800),
PARTITION p98 VALUES LESS THAN (9900),
PARTITION p99 VALUES LESS THAN MAXVALUE
);
EOF

#### populate every parition with one record, each has a 10MB longtext
for i in `seq 0 199`; do value=$((($i + 1) * 100 - 1)); $MYSQL -e "insert into test.t1 values($i, $value, repeat('MySQL', 2*1024*1024));"; done

###
### sysbench presure stage, run below lua scripts with sysbench 0.5 with 64 threads
###

pathtest = string.match(test, "(.*/)")

if pathtest then
dofile(pathtest .. "common.lua")
else
require("common")
end

function thread_init(thread_id)
set_vars()
end

function event(thread_id)
--use a bad query to make sure every partition is searched
db_query("select * from t1 where id = 100")
end

###
### monitor stage, use top or `pidstat -r` to monitor mysqld memeory usage
### we can see that RSS memory usage will grow up to 129G !!!!!!
###
### And the more memory will be consumed if the more sysbench threads number is given.
###

08:28:39 PM PID minflt/s majflt/s 08:28:40 PM 15748 08:28:41 PM 15748 08:28:42 PM 15748 166724.00 08:28:43 PM 15748 249058.00 08:28:44 PM 15748 180253.00 08:28:45 PM 15748 1173937.00 08:28:46 PM 15748 2926548.00 08:28:47 PM 15748 2972244.00 08:28:48 PM 15748 2987646.00 08:28:49 PM 15748 2988845.00 08:28:50 PM 15748 2996056.00 08:28:51 PM 15748 2984125.00 08:28:52 PM 15748 2983838.00 08:28:53 PM 15748 2975173.00 08:28:54 PM 15748 2962377.00 08:28:55 PM 15748 2977520.00 08:28:56 PM 15748 2881326.00 08:28:57 PM 15748 348169.00 08:28:58 PM 15748 79710.00 08:28:59 PM 15748 79712.00 08:29:00 PM 15748 72843.00 08:29:01 PM 15748 79006.00 08:29:02 PM 15748 79604.00 08:29:03 PM 15748 34403.00 08:29:04 PM 15748 08:29:05 PM 15748 08:29:06 PM 15748 VSZ RSS %MEM Command
0.00 0.00 7963124 437492 0.08 mysqld
0.00 0.00 7963124 437492 0.08 mysqld
0.00 8596212 671604 0.13 mysqld
0.00 9456372 971248 0.18 mysqld
0.00 10873588 1446404 0.27 mysqld
0.00 15592180 4637980 0.88 mysqld
0.00 28386036 15495996 2.93 mysqld
0.00 42666740 27307432 5.16 mysqld
0.00 56935156 39181144 7.40 mysqld
0.00 71178996 51060980 9.65 mysqld
0.00 85445364 62969896 11.90 mysqld
0.00 99676916 74830384 14.14 mysqld
0.00 113871604 86689420 16.38 mysqld
0.00 128140020 98513336 18.61 mysqld
0.00 142236404 110287912 20.84 mysqld
0.00 156381940 122122028 23.07 mysqld
0.00 170052340 133579136 25.24 mysqld
0.00 171418356 134846956 25.48 mysqld
0.00 171760372 135022380 25.51 mysqld
0.00 172102388 135197812 25.54 mysqld
0.00 172430068 135366320 25.57 mysqld
0.00 172772084 135538936 25.61 mysqld
0.00 173099764 135713864 25.64 mysqld
0.00 173243124 135779768 25.65 mysqld
0.00 0.00 173243124 135779768 25.65 mysqld
0.00 0.00 173243124 135779768 25.65 mysqld
0.00 0.00 173243124 135779768 25.65 mysqld

The RSS grow from 437492(427M) to 135779768(129G) in less than 30s!
BTW, the tokudb_cache_size I used is about 5G.

After the sysbench stopped, the RSS still keep at a high number, until I issued
`FLUSH TABLE test.t1` to close this table.

###
### problem analysis
###

If start mysql using `valgrind --tool=massif mysqld`（massif is a memory profiling tool in valgrind）, we will see clearly where the memory goes to:

n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
85 80,910,947,821 127,076,668,408 127,007,881,441 68,786,967 0
86 81,002,324,005 127,201,890,632 127,133,055,395 68,835,237 0
99.95% (127,133,055,395B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->49.18% (62,561,194,013B) 0xF909B1: my_malloc (my_malloc.c:38)
| ->48.80% (62,075,750,631B) 0xF91070: my_realloc (my_malloc.c:95)
| | ->48.80% (62,075,722,880B) 0x1692F5E: tokudb::memory::realloc(void*, unsigned long, int) (tokudb_memory.h:55)
| | | ->48.80% (62,075,722,880B) 0x1638FDE: ha_tokudb::unpack_blobs(unsigned char*, unsigned char const*, unsigned int, bool) (ha_tokudb.cc:2412)
| | | ->48.80% (62,075,722,880B) 0x16498E8: ha_tokudb::read_data_from_range_query_buff(unsigned char*, bool, bool) (ha_tokudb.cc:5155)
| | | ->48.80% (62,075,722,880B) 0x164AF92: ha_tokudb::get_next(unsigned char*, int, __toku_dbt*, bool) (ha_tokudb.cc:5566)
| | | ->48.80% (62,075,722,880B) 0x164CF6C: ha_tokudb::rnd_next(unsigned char*) (ha_tokudb.cc:5809)
| | | ->48.80% (62,075,722,880B) 0x6CDF3F: handler::ha_rnd_next(unsigned char*) (handler.cc:2942)
| | | ->48.80% (62,075,722,880B) 0x193ECE5: ha_partition::rnd_next(unsigned char*) (ha_partition.cc:4805)
| | | ->48.80% (62,075,722,880B) 0x6CDF3F: handler::ha_rnd_next(unsigned char*) (handler.cc:2942)

->48.95% (62,265,682,762B) 0xF9109F: my_realloc (my_malloc.c:99)
| ->48.95% (62,264,615,082B) 0x1692F5E: tokudb::memory::realloc(void*, unsigned long, int) (tokudb_memory.h:55)
| | ->48.95% (62,264,615,082B) 0x164A096: ha_tokudb::fill_range_query_buf(bool, __toku_dbt const*, __toku_dbt const*, int, THD*, unsigned char*, __toku_dbt*) (ha_tokud
b.cc:5299)
| | ->48.95% (62,264,615,082B) 0x1649AA4: smart_dbt_bf_callback(__toku_dbt const*, __toku_dbt const*, void*) (ha_tokudb.cc:5185)
| | ->48.95% (62,264,615,082B) 0x17E5006: c_getf_first_callback(unsigned int, void const*, unsigned int, void const*, void*, bool) (ydb_cursor.cc:260)
| | ->48.95% (62,264,615,082B) 0x17BCA55: ft_search_basement_node(ftnode_leaf_basement_node*, ft_search*, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, bool*, ft_cursor*, bool) (ft-ops.cc:3393)
| | ->48.95% (62,264,615,082B) 0x17BE43A: ft_search_node(ft_handle*, ftnode*, ft_search*, int, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, bool*, ft_cursor*, unlockers*, ancestors*, pivot_bounds const&, bool) (ft-ops.cc:3751)
| | ->48.65% (61,887,126,678B) 0x17BE9D0: toku_ft_search(ft_handle*, ft_search*, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, ft_cursor*, bool) (ft-ops.cc:3872)

we can see that almost half memory(49.18% (62,561,194,013B) consumed by ha_tokudb::unpack_blobs, another half 48.95% (62,265,682,762B) caused by ha_tokudb::fill_range_query_buf.

To make things more clearly, I added some print info to ha_tokudb::close()

```
diff --git a/storage/tokudb/ha_tokudb.cc b/storage/tokudb/ha_tokudb.cc
index 9079d15..a1be44b 100644
--- a/storage/tokudb/ha_tokudb.cc +++ b/storage/tokudb/ha_tokudb.cc
@@ -2218,6 +2218,11 @@ int ha_tokudb::close() { int ha_tokudb::__close() {
TOKUDB_HANDLER_DBUG_ENTER(""); TOKUDB_HANDLER_TRACE_FOR_FLAGS(TOKUDB_DEBUG_OPEN, "close:%p", this);
+ sql_print_error("TokuDB: Table %s blob_buff size is %d, " + "range_query_buff size is %d\n",
+ share->full_table_name(), + num_blob_bytes,
+ size_range_query_buff); tokudb::memory::free(rec_buff);
tokudb::memory::free(rec_update_buff); tokudb::memory::free(blob_buff);

```

We'll see many error message in error log like these:
2017-07-23 20:35:41 15748 [ERROR] TokuDB: Table ./test/t1#P#p40 blob_buff size is 10485764, range_query_buff size is 10485789
2017-07-23 20:35:41 15748 [ERROR] TokuDB: Table ./test/t1#P#p41 blob_buff size is 10485764, range_query_buff size is 10485789

Now the problem is very clear, the memory is mainly consumed by `ha_tokudb::range_query_buff` and `ha_tokudb::blob_buff`, and each tokudb handler
has two such buffer instances, for a BAD query that need search every partition, 100 tokudb handler is needed.
So the total memory needed for one thread is 100*2*10M=2G, and total memory for 64 thread is 64*2G = 128G!!!

Maybe we should release `ha_tokudb::range_query_buff` and `ha_tokudb::blob_buff` as soon as possible, not put off to `ha_tokudb::__close()`;
Or some more elagant way should be conducted.

Tags:

Sveta Smirnova (svetasmirnova) on 2017-07-24

Changed in percona-server:
assignee:	nobody → Sveta Smirnova (svetasmirnova)