range_query_buff and blob_buff in ha_tokudb handler could easily cause OOM

Bug #1705914 reported by Fungo Wang
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.5
Invalid
Undecided
Unassigned
5.6
Confirmed
Undecided
Unassigned
5.7
Confirmed
Undecided
Unassigned

Bug Description

I test this with PS 5.6.36-82.0:

## test case as follow:

###
### prepare stage, create a TokuDB partition table with 100 paritions
###

#### change the connection string as needed
MYSQL="mysql -h127.0.0.1 -P70011 -uroot"

$MYSQL <<EOF
CREATE TABLE test.t1(
id int,
uid int,
post longtext,
index idx_uid (uid)
) ENGINE=TokuDB
PARTITION BY RANGE(uid) (
PARTITION p0 VALUES LESS THAN (100),
PARTITION p1 VALUES LESS THAN (200),
PARTITION p2 VALUES LESS THAN (300),
PARTITION p3 VALUES LESS THAN (400),
PARTITION p4 VALUES LESS THAN (500),
PARTITION p5 VALUES LESS THAN (600),
PARTITION p6 VALUES LESS THAN (700),
PARTITION p7 VALUES LESS THAN (800),
PARTITION p8 VALUES LESS THAN (900),
PARTITION p9 VALUES LESS THAN (1000),
PARTITION p10 VALUES LESS THAN (1100),
PARTITION p11 VALUES LESS THAN (1200),
PARTITION p12 VALUES LESS THAN (1300),
PARTITION p13 VALUES LESS THAN (1400),
PARTITION p14 VALUES LESS THAN (1500),
PARTITION p15 VALUES LESS THAN (1600),
PARTITION p16 VALUES LESS THAN (1700),
PARTITION p17 VALUES LESS THAN (1800),
PARTITION p18 VALUES LESS THAN (1900),
PARTITION p19 VALUES LESS THAN (2000),
PARTITION p20 VALUES LESS THAN (2100),
PARTITION p21 VALUES LESS THAN (2200),
PARTITION p22 VALUES LESS THAN (2300),
PARTITION p23 VALUES LESS THAN (2400),
PARTITION p24 VALUES LESS THAN (2500),
PARTITION p25 VALUES LESS THAN (2600),
PARTITION p26 VALUES LESS THAN (2700),
PARTITION p27 VALUES LESS THAN (2800),
PARTITION p28 VALUES LESS THAN (2900),
PARTITION p29 VALUES LESS THAN (3000),
PARTITION p30 VALUES LESS THAN (3100),
PARTITION p31 VALUES LESS THAN (3200),
PARTITION p32 VALUES LESS THAN (3300),
PARTITION p33 VALUES LESS THAN (3400),
PARTITION p34 VALUES LESS THAN (3500),
PARTITION p35 VALUES LESS THAN (3600),
PARTITION p36 VALUES LESS THAN (3700),
PARTITION p37 VALUES LESS THAN (3800),
PARTITION p38 VALUES LESS THAN (3900),
PARTITION p39 VALUES LESS THAN (4000),
PARTITION p40 VALUES LESS THAN (4100),
PARTITION p41 VALUES LESS THAN (4200),
PARTITION p42 VALUES LESS THAN (4300),
PARTITION p43 VALUES LESS THAN (4400),
PARTITION p44 VALUES LESS THAN (4500),
PARTITION p45 VALUES LESS THAN (4600),
PARTITION p46 VALUES LESS THAN (4700),
PARTITION p47 VALUES LESS THAN (4800),
PARTITION p48 VALUES LESS THAN (4900),
PARTITION p49 VALUES LESS THAN (5000),
PARTITION p50 VALUES LESS THAN (5100),
PARTITION p51 VALUES LESS THAN (5200),
PARTITION p52 VALUES LESS THAN (5300),
PARTITION p53 VALUES LESS THAN (5400),
PARTITION p54 VALUES LESS THAN (5500),
PARTITION p55 VALUES LESS THAN (5600),
PARTITION p56 VALUES LESS THAN (5700),
PARTITION p57 VALUES LESS THAN (5800),
PARTITION p58 VALUES LESS THAN (5900),
PARTITION p59 VALUES LESS THAN (6000),
PARTITION p60 VALUES LESS THAN (6100),
PARTITION p61 VALUES LESS THAN (6200),
PARTITION p62 VALUES LESS THAN (6300),
PARTITION p63 VALUES LESS THAN (6400),
PARTITION p64 VALUES LESS THAN (6500),
PARTITION p65 VALUES LESS THAN (6600),
PARTITION p66 VALUES LESS THAN (6700),
PARTITION p67 VALUES LESS THAN (6800),
PARTITION p68 VALUES LESS THAN (6900),
PARTITION p69 VALUES LESS THAN (7000),
PARTITION p70 VALUES LESS THAN (7100),
PARTITION p71 VALUES LESS THAN (7200),
PARTITION p72 VALUES LESS THAN (7300),
PARTITION p73 VALUES LESS THAN (7400),
PARTITION p74 VALUES LESS THAN (7500),
PARTITION p75 VALUES LESS THAN (7600),
PARTITION p76 VALUES LESS THAN (7700),
PARTITION p77 VALUES LESS THAN (7800),
PARTITION p78 VALUES LESS THAN (7900),
PARTITION p79 VALUES LESS THAN (8000),
PARTITION p80 VALUES LESS THAN (8100),
PARTITION p81 VALUES LESS THAN (8200),
PARTITION p82 VALUES LESS THAN (8300),
PARTITION p83 VALUES LESS THAN (8400),
PARTITION p84 VALUES LESS THAN (8500),
PARTITION p85 VALUES LESS THAN (8600),
PARTITION p86 VALUES LESS THAN (8700),
PARTITION p87 VALUES LESS THAN (8800),
PARTITION p88 VALUES LESS THAN (8900),
PARTITION p89 VALUES LESS THAN (9000),
PARTITION p90 VALUES LESS THAN (9100),
PARTITION p91 VALUES LESS THAN (9200),
PARTITION p92 VALUES LESS THAN (9300),
PARTITION p93 VALUES LESS THAN (9400),
PARTITION p94 VALUES LESS THAN (9500),
PARTITION p95 VALUES LESS THAN (9600),
PARTITION p96 VALUES LESS THAN (9700),
PARTITION p97 VALUES LESS THAN (9800),
PARTITION p98 VALUES LESS THAN (9900),
PARTITION p99 VALUES LESS THAN MAXVALUE
);
EOF

#### populate every parition with one record, each has a 10MB longtext
for i in `seq 0 199`; do value=$((($i + 1) * 100 - 1)); $MYSQL -e "insert into test.t1 values($i, $value, repeat('MySQL', 2*1024*1024));"; done

###
### sysbench presure stage, run below lua scripts with sysbench 0.5 with 64 threads
###

pathtest = string.match(test, "(.*/)")

if pathtest then
   dofile(pathtest .. "common.lua")
else
   require("common")
end

function thread_init(thread_id)
   set_vars()
end

function event(thread_id)
   --use a bad query to make sure every partition is searched
   db_query("select * from t1 where id = 100")
end

###
### monitor stage, use top or `pidstat -r` to monitor mysqld memeory usage
### we can see that RSS memory usage will grow up to 129G !!!!!!
###
### And the more memory will be consumed if the more sysbench threads number is given.
###

08:28:39 PM PID minflt/s majflt/s VSZ RSS %MEM Command
08:28:40 PM 15748 0.00 0.00 7963124 437492 0.08 mysqld
08:28:41 PM 15748 0.00 0.00 7963124 437492 0.08 mysqld
08:28:42 PM 15748 166724.00 0.00 8596212 671604 0.13 mysqld
08:28:43 PM 15748 249058.00 0.00 9456372 971248 0.18 mysqld
08:28:44 PM 15748 180253.00 0.00 10873588 1446404 0.27 mysqld
08:28:45 PM 15748 1173937.00 0.00 15592180 4637980 0.88 mysqld
08:28:46 PM 15748 2926548.00 0.00 28386036 15495996 2.93 mysqld
08:28:47 PM 15748 2972244.00 0.00 42666740 27307432 5.16 mysqld
08:28:48 PM 15748 2987646.00 0.00 56935156 39181144 7.40 mysqld
08:28:49 PM 15748 2988845.00 0.00 71178996 51060980 9.65 mysqld
08:28:50 PM 15748 2996056.00 0.00 85445364 62969896 11.90 mysqld
08:28:51 PM 15748 2984125.00 0.00 99676916 74830384 14.14 mysqld
08:28:52 PM 15748 2983838.00 0.00 113871604 86689420 16.38 mysqld
08:28:53 PM 15748 2975173.00 0.00 128140020 98513336 18.61 mysqld
08:28:54 PM 15748 2962377.00 0.00 142236404 110287912 20.84 mysqld
08:28:55 PM 15748 2977520.00 0.00 156381940 122122028 23.07 mysqld
08:28:56 PM 15748 2881326.00 0.00 170052340 133579136 25.24 mysqld
08:28:57 PM 15748 348169.00 0.00 171418356 134846956 25.48 mysqld
08:28:58 PM 15748 79710.00 0.00 171760372 135022380 25.51 mysqld
08:28:59 PM 15748 79712.00 0.00 172102388 135197812 25.54 mysqld
08:29:00 PM 15748 72843.00 0.00 172430068 135366320 25.57 mysqld
08:29:01 PM 15748 79006.00 0.00 172772084 135538936 25.61 mysqld
08:29:02 PM 15748 79604.00 0.00 173099764 135713864 25.64 mysqld
08:29:03 PM 15748 34403.00 0.00 173243124 135779768 25.65 mysqld
08:29:04 PM 15748 0.00 0.00 173243124 135779768 25.65 mysqld
08:29:05 PM 15748 0.00 0.00 173243124 135779768 25.65 mysqld
08:29:06 PM 15748 0.00 0.00 173243124 135779768 25.65 mysqld

The RSS grow from 437492(427M) to 135779768(129G) in less than 30s!
BTW, the tokudb_cache_size I used is about 5G.

After the sysbench stopped, the RSS still keep at a high number, until I issued
`FLUSH TABLE test.t1` to close this table.

###
### problem analysis
###

If start mysql using `valgrind --tool=massif mysqld`(massif is a memory profiling tool in valgrind), we will see clearly where the memory goes to:

  n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
 85 80,910,947,821 127,076,668,408 127,007,881,441 68,786,967 0
 86 81,002,324,005 127,201,890,632 127,133,055,395 68,835,237 0
99.95% (127,133,055,395B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->49.18% (62,561,194,013B) 0xF909B1: my_malloc (my_malloc.c:38)
| ->48.80% (62,075,750,631B) 0xF91070: my_realloc (my_malloc.c:95)
| | ->48.80% (62,075,722,880B) 0x1692F5E: tokudb::memory::realloc(void*, unsigned long, int) (tokudb_memory.h:55)
| | | ->48.80% (62,075,722,880B) 0x1638FDE: ha_tokudb::unpack_blobs(unsigned char*, unsigned char const*, unsigned int, bool) (ha_tokudb.cc:2412)
| | | ->48.80% (62,075,722,880B) 0x16498E8: ha_tokudb::read_data_from_range_query_buff(unsigned char*, bool, bool) (ha_tokudb.cc:5155)
| | | ->48.80% (62,075,722,880B) 0x164AF92: ha_tokudb::get_next(unsigned char*, int, __toku_dbt*, bool) (ha_tokudb.cc:5566)
| | | ->48.80% (62,075,722,880B) 0x164CF6C: ha_tokudb::rnd_next(unsigned char*) (ha_tokudb.cc:5809)
| | | ->48.80% (62,075,722,880B) 0x6CDF3F: handler::ha_rnd_next(unsigned char*) (handler.cc:2942)
| | | ->48.80% (62,075,722,880B) 0x193ECE5: ha_partition::rnd_next(unsigned char*) (ha_partition.cc:4805)
| | | ->48.80% (62,075,722,880B) 0x6CDF3F: handler::ha_rnd_next(unsigned char*) (handler.cc:2942)

->48.95% (62,265,682,762B) 0xF9109F: my_realloc (my_malloc.c:99)
| ->48.95% (62,264,615,082B) 0x1692F5E: tokudb::memory::realloc(void*, unsigned long, int) (tokudb_memory.h:55)
| | ->48.95% (62,264,615,082B) 0x164A096: ha_tokudb::fill_range_query_buf(bool, __toku_dbt const*, __toku_dbt const*, int, THD*, unsigned char*, __toku_dbt*) (ha_tokud
b.cc:5299)
| | ->48.95% (62,264,615,082B) 0x1649AA4: smart_dbt_bf_callback(__toku_dbt const*, __toku_dbt const*, void*) (ha_tokudb.cc:5185)
| | ->48.95% (62,264,615,082B) 0x17E5006: c_getf_first_callback(unsigned int, void const*, unsigned int, void const*, void*, bool) (ydb_cursor.cc:260)
| | ->48.95% (62,264,615,082B) 0x17BCA55: ft_search_basement_node(ftnode_leaf_basement_node*, ft_search*, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, bool*, ft_cursor*, bool) (ft-ops.cc:3393)
| | ->48.95% (62,264,615,082B) 0x17BE43A: ft_search_node(ft_handle*, ftnode*, ft_search*, int, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, bool*, ft_cursor*, unlockers*, ancestors*, pivot_bounds const&, bool) (ft-ops.cc:3751)
| | ->48.65% (61,887,126,678B) 0x17BE9D0: toku_ft_search(ft_handle*, ft_search*, int (*)(unsigned int, void const*, unsigned int, void const*, void*, bool), void*, ft_cursor*, bool) (ft-ops.cc:3872)

we can see that almost half memory(49.18% (62,561,194,013B) consumed by ha_tokudb::unpack_blobs, another half 48.95% (62,265,682,762B) caused by ha_tokudb::fill_range_query_buf.

To make things more clearly, I added some print info to ha_tokudb::close()

```
diff --git a/storage/tokudb/ha_tokudb.cc b/storage/tokudb/ha_tokudb.cc
index 9079d15..a1be44b 100644
--- a/storage/tokudb/ha_tokudb.cc +++ b/storage/tokudb/ha_tokudb.cc
@@ -2218,6 +2218,11 @@ int ha_tokudb::close() { int ha_tokudb::__close() {
     TOKUDB_HANDLER_DBUG_ENTER(""); TOKUDB_HANDLER_TRACE_FOR_FLAGS(TOKUDB_DEBUG_OPEN, "close:%p", this);
+ sql_print_error("TokuDB: Table %s blob_buff size is %d, " + "range_query_buff size is %d\n",
+ share->full_table_name(), + num_blob_bytes,
+ size_range_query_buff); tokudb::memory::free(rec_buff);
     tokudb::memory::free(rec_update_buff); tokudb::memory::free(blob_buff);

```

We'll see many error message in error log like these:
2017-07-23 20:35:41 15748 [ERROR] TokuDB: Table ./test/t1#P#p40 blob_buff size is 10485764, range_query_buff size is 10485789
2017-07-23 20:35:41 15748 [ERROR] TokuDB: Table ./test/t1#P#p41 blob_buff size is 10485764, range_query_buff size is 10485789

Now the problem is very clear, the memory is mainly consumed by `ha_tokudb::range_query_buff` and `ha_tokudb::blob_buff`, and each tokudb handler
has two such buffer instances, for a BAD query that need search every partition, 100 tokudb handler is needed.
So the total memory needed for one thread is 100*2*10M=2G, and total memory for 64 thread is 64*2G = 128G!!!

Maybe we should release `ha_tokudb::range_query_buff` and `ha_tokudb::blob_buff` as soon as possible, not put off to `ha_tokudb::__close()`;
Or some more elagant way should be conducted.

Tags: tokudb
Changed in percona-server:
assignee: nobody → Sveta Smirnova (svetasmirnova)
Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

Thank you for the report.

Verified as described. Maybe related to https://bugs.mysql.com/bug.php?id=38002 and https://bugs.mysql.com/bug.php?id=15089, thought not repeatable with InnoDB

Changed in percona-server:
assignee: Sveta Smirnova (svetasmirnova) → nobody
status: New → Confirmed
Revision history for this message
Fungo Wang (fungo) wrote :

As the stack trace showed, this is a TokuDB engine problem, temporary buffer in ha_tokudb handler layer is not handled properly.

And this problem just gets amplified by table_cache and partition table.

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

Sveta, did you confirm specifically 5.6 w/ InnoDB? If this is at all related to partitioning, 5.7 InnoDB has native partitioning implemented which is specifically done for memory utilization issues with large numbers of partitions where 5.6 InnoDB uses the same legacy partition handler that TokuDB uses in both 5.6 and 5.7.

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

Just checked with 5.6.36 and InnoDB: issue does not exist

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3725

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.