Partition pruning doesn't seem to work with TokuDB engine

Bug #1656022 reported by Rick Pizzi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.6
Fix Released
High
George Ormond Lorch III
5.7
Fix Released
High
George Ormond Lorch III

Bug Description

We noticed that partition pruning, although correctly shown by the EXPLAIN PARTITIONS command, is not honoured for the TokuDB engine.

Please consider the following table:

show create table mytable\G
*************************** 1. row ***************************
       Table: mytable
Create Table: CREATE TABLE `mytable` (
  `ID` char(64) NOT NULL,
  `DATA_BLOB` mediumblob NOT NULL,
  `EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`ID`,`EXPIRE_DATE`),
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(EXPIRE_DATE))
(PARTITION p1003 VALUES LESS THAN (1484240400) ENGINE = TokuDB,
 PARTITION p1004 VALUES LESS THAN (1484244000) ENGINE = TokuDB,
 PARTITION p1005 VALUES LESS THAN (1484247600) ENGINE = TokuDB,
 PARTITION p1006 VALUES LESS THAN (1484251200) ENGINE = TokuDB,
 PARTITION p1007 VALUES LESS THAN (1484254800) ENGINE = TokuDB,
 PARTITION p1008 VALUES LESS THAN (1484258400) ENGINE = TokuDB,
 PARTITION p1009 VALUES LESS THAN (1484262000) ENGINE = TokuDB,
 PARTITION p1010 VALUES LESS THAN (1484265600) ENGINE = TokuDB,
 PARTITION p1011 VALUES LESS THAN (1484269200) ENGINE = TokuDB,
 PARTITION p1012 VALUES LESS THAN (1484272800) ENGINE = TokuDB,
 PARTITION p1013 VALUES LESS THAN (1484276400) ENGINE = TokuDB,
 PARTITION p1014 VALUES LESS THAN (1484280000) ENGINE = TokuDB,
 PARTITION p1015 VALUES LESS THAN (1484283600) ENGINE = TokuDB,
 PARTITION p1016 VALUES LESS THAN (1484287200) ENGINE = TokuDB,
 PARTITION p1017 VALUES LESS THAN (1484290800) ENGINE = TokuDB,
 PARTITION p1018 VALUES LESS THAN (1484294400) ENGINE = TokuDB,
 PARTITION p1019 VALUES LESS THAN (1484298000) ENGINE = TokuDB,
 PARTITION p1020 VALUES LESS THAN (1484301600) ENGINE = TokuDB,
 PARTITION p1021 VALUES LESS THAN (1484305200) ENGINE = TokuDB,
 PARTITION p1022 VALUES LESS THAN (1484308800) ENGINE = TokuDB,
 PARTITION p1023 VALUES LESS THAN (1484312400) ENGINE = TokuDB,
 PARTITION p1024 VALUES LESS THAN (1484316000) ENGINE = TokuDB,
 PARTITION p1025 VALUES LESS THAN (1484319600) ENGINE = TokuDB,
 PARTITION p1026 VALUES LESS THAN (1484323200) ENGINE = TokuDB,
 PARTITION p1027 VALUES LESS THAN (1484326800) ENGINE = TokuDB,
 PARTITION p1028 VALUES LESS THAN (1484330400) ENGINE = TokuDB,
 PARTITION p1029 VALUES LESS THAN (1484334000) ENGINE = TokuDB,
 PARTITION p1030 VALUES LESS THAN (1484337600) ENGINE = TokuDB,
 PARTITION p1031 VALUES LESS THAN (1484341200) ENGINE = TokuDB,
 PARTITION p1032 VALUES LESS THAN (1484344800) ENGINE = TokuDB,
 PARTITION p1033 VALUES LESS THAN (1484348400) ENGINE = TokuDB,
 PARTITION p1034 VALUES LESS THAN (1484352000) ENGINE = TokuDB,
 PARTITION p1035 VALUES LESS THAN (1484355600) ENGINE = TokuDB,
 PARTITION p1036 VALUES LESS THAN (1484359200) ENGINE = TokuDB,
 PARTITION p1037 VALUES LESS THAN (1484362800) ENGINE = TokuDB,
 PARTITION p1038 VALUES LESS THAN (1484366400) ENGINE = TokuDB,
 PARTITION p1039 VALUES LESS THAN (1484370000) ENGINE = TokuDB,
 PARTITION p1040 VALUES LESS THAN (1484373600) ENGINE = TokuDB,
 PARTITION p1041 VALUES LESS THAN (1484377200) ENGINE = TokuDB,
 PARTITION p1042 VALUES LESS THAN (1484380800) ENGINE = TokuDB,
 PARTITION p1043 VALUES LESS THAN (1484384400) ENGINE = TokuDB,
 PARTITION p1044 VALUES LESS THAN (1484388000) ENGINE = TokuDB,
 PARTITION p1045 VALUES LESS THAN (1484391600) ENGINE = TokuDB,
 PARTITION p1046 VALUES LESS THAN (1484395200) ENGINE = TokuDB,
 PARTITION p1047 VALUES LESS THAN (1484398800) ENGINE = TokuDB,
 PARTITION p1048 VALUES LESS THAN (1484402400) ENGINE = TokuDB,
 PARTITION p1049 VALUES LESS THAN (1484406000) ENGINE = TokuDB,
 PARTITION p1050 VALUES LESS THAN (1484409600) ENGINE = TokuDB,
 PARTITION p1051 VALUES LESS THAN (1484413200) ENGINE = TokuDB,
 PARTITION p1052 VALUES LESS THAN (1484416800) ENGINE = TokuDB,
 PARTITION p1053 VALUES LESS THAN (1484420400) ENGINE = TokuDB,
 PARTITION p1054 VALUES LESS THAN (1484424000) ENGINE = TokuDB,
 PARTITION p1055 VALUES LESS THAN (1484427600) ENGINE = TokuDB,
 PARTITION p1056 VALUES LESS THAN (1484431200) ENGINE = TokuDB,
 PARTITION p1057 VALUES LESS THAN (1484434800) ENGINE = TokuDB,
 PARTITION p1058 VALUES LESS THAN (1484438400) ENGINE = TokuDB,
 PARTITION p1059 VALUES LESS THAN (1484442000) ENGINE = TokuDB,
 PARTITION p1060 VALUES LESS THAN (1484445600) ENGINE = TokuDB,
 PARTITION p1061 VALUES LESS THAN (1484449200) ENGINE = TokuDB,
 PARTITION p1062 VALUES LESS THAN (1484452800) ENGINE = TokuDB,
 PARTITION p1063 VALUES LESS THAN (1484456400) ENGINE = TokuDB,
 PARTITION p1064 VALUES LESS THAN (1484460000) ENGINE = TokuDB) */

If we run the following explain, all seems okay:

mysql>explain partitions DELETE FROM mytable WHERE ID = 'aa60185e07d29c8866f00a5fa9e951aca07f3e3ae42b7ebb57474bf1a26721b9' AND EXPIRE_DATE > date_add(NOW(), interval 1 hour)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: mytable
   partitions: p1004,p1005,p1006,p1007,p1008,p1009,p1010,p1011,p1012,p1013,p1014,p1015,p1016,p1017,p1018,p1019,p1020,p1021,p1022,p1023,p1024,p1025,p1026,p1027,p1028,p1029,p1030,p1031,p1032,p1033,p1034,p1035,p1036,p1037,p1038,p1039,p1040,p1041,p1042,p1043,p1044,p1045,p1046,p1047,p1048,p1049,p1050,p1051,p1052,p1053,p1054,p1055,p1056,p1057,p1058,p1059,p1060,p1061,p1062,p1063,p1064
         type: range
possible_keys: PRIMARY,EXPIRE_DATE_IX
          key: PRIMARY
      key_len: 68
          ref: const,const
         rows: 1
        Extra: Using where
1 row in set (0.00 sec)

We can see that partition p1003 is pruned because contains stale data (rows with EXPIRE_DATE less than one hour old).

However, when running DROP PARTITION p1003, we see stalls in SELECT and DELETE statements that should NOT hit that partition because of pruning:

# Time: 170112 16:02:15
# Schema: myschema Last_errno: 0 Killed: 0
# Query_time: 12.219223 Lock_time: 9.126238 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0
# Bytes_sent: 11 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0
SET timestamp=1484233335;
DELETE FROM mytable WHERE ID = '70f8536e1c07367e591be7d05092d2bf10fd3b2df463a7bdad412e6d705477f1' AND EXPIRE_DATE > DATE_ADD(NOW(), INTERVAL 1 HOUR);

You can see that this query took 12 seconds and was locked for 9 seconds waiting for what seems a lock on the partition which we expected to be pruned.

We wouldn't expect the above DELETE to lock if the partition we are dropping is pruned and therefore out of scope for the DML.

Server version: 5.6.34-79.1-log Percona Server (GPL), Release 79.1, Revision 1c589f9

thanks
Rick

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

All fractal trees share the same space within the cachetable and other common resources down in PerconaFT. It is a known issue that a drop/delete of tables/indices can take a long time and possibly hold some common locks that might cause other accessors to slow or stall https://tokutek.atlassian.net/browse/FT-715

I can't say for certain if that is what you are seeing but it sounds like a possibility. It would be helpful if we could get a PMP or similar trace when you experience these stalls. It might help point us in some direction.

Revision history for this message
Rick Pizzi (pizzi) wrote :

I will try and get a PMP stack for you. Please note, however, that dropping a partition shouldn't cause stalls as it is the normal way for people to apply retention to data.

Revision history for this message
Rick Pizzi (pizzi) wrote :
Download full text (6.1 KiB)

Here you go.

Thu Jan 12 18:04:11 CET 2017
    871 pthread_cond_wait,ha_tokudb::??(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
    188 pthread_cond_timedwait,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
    162 __lll_lock_wait(libpthread.so.0),_L_lock_854(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),block_table::_mutex_lock(ha_tokudb.so),toku_ft_get_fragmentation(ha_tokudb.so),ha_tokudb::info(ha_tokudb.so),ha_partition::info,::??,handler::ha_open,open_table_from_share,open_table,open_tables,open_normal_and_derived_tables,mysql_delete,mysql_execute_command,mysql_parse,dispatch_command,threadpool_process_request,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
    106 epoll_wait(libc.so.6),::??,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     27 __lll_lock_wait(libpthread.so.0),_L_lock_854(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),block_table::_mutex_lock(ha_tokudb.so),block_table::translate_blocknum_to_offset_size(ha_tokudb.so),toku_deserialize_bp_from_disk(ha_tokudb.so),void*,(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),toku_cachetable_get_and_pin_nonblocking(ha_tokudb.so),toku_pin_ftnode_for_query(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),toku_ft_keysrange(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::records_in_range(ha_tokudb.so),ha_partition::records_in_range,int,,::??,SQL_SELECT::test_quick_select,mysql_delete,mysql_execute_command,mysql_parse,dispatch_command,threadpool_process_request,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     18 libaio::??(libaio.so.1),::??,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      4 pthread_cond_timedwait,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      3 select(libc.so.6),::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      2 pthread_cond_timedwait,ha_tokudb::??(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwaitinfo(libc.so.6),::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwait(libpthread.so.0),signal_hand,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 recv(libpthread.so.0),vio_read,::??,::??,my_net_read,do_command,threadpool_process_request,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_mutex_lock(libpthread.so.0),checkpointer::checkpoint_pending_pairs(ha_tokudb.so),checkpointer::end_checkpoint(ha_tokudb.so),toku_checkpoint(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,tokudb::background::job_manager_t::real_thread_func(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,::??,??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,evictor::run_eviction_thread(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,::??,ib...

Read more...

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

This is ... different ...

You have 162 + 27 instances awaiting the blocktable mutex(es), 1 instance with a blocktable lock doing some calculations, and the checkpointer is running (which also darts in and out of blocktable locks).

A lot of this seems to be stacking up on the blocktable fragmentation statistics. This kind of makes sense since you have tiny block sizes so you have many, many, many blocks that this statistical gathering must iterate through in order to do its calculation, so it is possible that this can take some time.

This is all within the context of an 'open' dictionary call, which also means that there is a higher level global lock in TokuDB that protects the master list of open dictionaries, thus it pretty much blocks all new dictionary openings (cached open or not).

And all of this is just so the handler can return the amount of empty space within an index file :O

So there are two issues here:

One we already know of and that is that a user/thread opening/closing a dictionary should not block all other user/threads that are opening other dictionaries (but will block others trying to open the same dictionary). This is already reocrded in https://tokutek.atlassian.net/browse/DB-1031

Second, PerconaFT was never intended or optimized for huge data sets of tiny block sizes. This is evidenced in numerous places such as all of the default values of block sizes and tree shapes, by the use of suboptimal calculations and data structures (the map block file allocator that recently bit you for example), and this particular 'free space' calculation. I believe you can reduce the effect of this by increasing your block size which should also have the effect of reducing the message flusher load that you mentioned in the google groups discussion. Also increasing your open table cache size might help mitigate it somewhat but because of your time based use pattern that it might not help much.

I acknowledge that these are real issues but they will not likely get addressed any time in the very immediate future. So you should consider the trade offs of tiny block sizes versus these kinds of stalls and high flusher loads. I will see if I can get time to address the first issue which might be enough to reduce the apparent total stall but it will not change the single thread opening stall that comes from a long running iteration of all blocks within a dictionary file map header.

Revision history for this message
Rick Pizzi (pizzi) wrote :

George, from your answer it is not clear to me why this stall only happens when I drop a partition.
Is that due to the "open" dictionary call? I understand your comment about small block size as a source of worsening of the situation, but again this only happens on the hour when we drop an unused partition no query is looking at.

Increasing block size will increase I/O and wear our SSD more. We made several tests with different block sizes and this setup we use (16k) is optimal for our workload. With default block size of 4M we were saturating the PCIe SSD cards (which max out at 2GB/sec sequential for writes), now we are well below 1/4 of I/O capacity with same workload.

Checkpointer: I can disable it when the drop partition runs if this may mitigate the issue, will it? From the stack trace it looks like is holding a mutex?

You should also define "huge" data set. This dataset is not huge in my view, it is about 25G per table and 8 tables total ... 100G total.

Thanks,
Rick

Revision history for this message
Rick Pizzi (pizzi) wrote :

Another question on the subject - would TRUNCATE be a better option than DROP PARTITION, locking-wise? I will try thst out and see what happens. I could defer the drop operation.

Revision history for this message
Rick Pizzi (pizzi) wrote :
Download full text (5.7 KiB)

I tried a truncate operation, stalls are there, same impact more or less.

gio 12 gen 2017, 20.04.48, CET
    871 pthread_cond_wait,ha_tokudb::??(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
    307 pthread_cond_timedwait,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
    118 epoll_wait(libc.so.6),::??,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     34 __lll_lock_wait(libpthread.so.0),_L_lock_854(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),block_table::_mutex_lock(ha_tokudb.so),toku_ft_get_fragmentation(ha_tokudb.so),ha_tokudb::info(ha_tokudb.so),ha_partition::info,::??,handler::ha_open,open_table_from_share,open_table,open_tables,open_normal_and_derived_tables,mysql_delete,mysql_execute_command,mysql_parse,dispatch_command,threadpool_process_request,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     18 libaio::??(libaio.so.1),::??,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
     16 __lll_lock_wait(libpthread.so.0),_L_lock_854(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),block_table::_mutex_lock(ha_tokudb.so),block_table::translate_blocknum_to_offset_size(ha_tokudb.so),toku_deserialize_bp_from_disk(ha_tokudb.so),void*,(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),toku_cachetable_get_and_pin_nonblocking(ha_tokudb.so),toku_pin_ftnode_for_query(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),toku_ft_keysrange(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::records_in_range(ha_tokudb.so),ha_partition::records_in_range,int,,::??,SQL_SELECT::test_quick_select,mysql_delete,mysql_execute_command,mysql_parse,dispatch_command,threadpool_process_request,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      4 pthread_cond_timedwait,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      3 select(libc.so.6),::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      2 pthread_cond_timedwait,ha_tokudb::??(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwait(libpthread.so.0),signal_hand,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwaitinfo(libc.so.6),::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 recv(libpthread.so.0),vio_read,::??,::??,my_net_read,do_command,threadpool_process_request,::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,toku::frwlock::write_lock(ha_tokudb.so),checkpointer::checkpoint_pending_pairs(ha_tokudb.so),checkpointer::end_checkpoint(ha_tokudb.so),toku_checkpoint(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),ha_tokudb::??(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,tokudb::background::job_manager_t::real_thread_func(ha_tokudb.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,::??,??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,::??,ib_wqueue_timedwait,::??,start_thread(libpthread.so.0),clone(libc....

Read more...

Revision history for this message
Rick Pizzi (pizzi) wrote :

Tried the operation with flusher and cleaner disabled (set to 0 globally then waited for I/O to quiesce), no change.

Revision history for this message
Rick Pizzi (pizzi) wrote :

Sorry I meant "checkpointer" above not flusher.

Revision history for this message
Rick Pizzi (pizzi) wrote :

George I wonder whether a smaller cache table would make this better or worse? I have a quite large cache_size (200G) maybe making this small will reduce the time spent in the block table hunting for stuff?

Revision history for this message
George Ormond Lorch III (gl-az) wrote :
Download full text (4.7 KiB)

Well, it almost impossible to tell exactly what is happening from a single snapshot, but I think we have a pretty good guess so far.

The DROP must re-open the dictionaries (table + indices) being dropped. I don't know all of the reasons why, but MySQl does this to us. In your workload, with the table cache in operation, this may be the only time that this actually happens (cold opening of a populated table). During this open, the lock that protects the global TokuDB cache of opened tables is held. Any new SQL query that references any TokuDB tables must go through this lock when trying to get handles to the tables and will wait until this one open is complete. This is a problem and needs to be addressed as I mentioned above and we know about it (it seems more a problem of closing than opening but you might have found an issue with opening). It may or may not be a significant contributor to the issue as I don't see that specific pathway in the call stacks above.

The small block size means that there are many more entries in the block file map for the same size data. So there are now many more things that need iterated over for whatever reasons they are. In the stack traces above we can see the partition handler calling ha_tokudb::info, which likely has some locks held up in the server as well as causing internal locking in TokuDB and PerconaFT. This ::info call is trying to get some size info from TokuDB. TokuDB is in turn calling down into PerconaFT for a size report. That size report is iterating all of the block table map entries and calculating how much free space exists within the file. This means it is iterating the records of all of the blocks (nodes) in the file and adding up the size taken by each block, and the size taken by each of the holes between blocks. This can take some time with huge numbers of blocks (nodes) and might even be getting called rather frequently from the server code. This is a problem, this operation should be fast and use as much pre-calculated info as possible. It is possible that the partition handler has the entire table locked at this point thinking that ::info is a quick, cheap call when we know in this instance it is not.

I understand that you like the smaller block size for SSD endurance and performance, but that doesn't change the fundamental fact that TokuDB and PerconaFT were designed, optimized, and tested for huge data sets on slow spinning disk. You are trying to use it (with some success as well) for exactly the opposite, small data on super fast SSD. You are able to tune it to a point by manipulating block sizes, thread counts, etc..., but you can't change the internal assumptions, algorithms, optimizations, and data structures that were built around the premise of 'big, slow storage' and huge nodes. We're trying to make that better as that is the future of the world, but it doesn't happen overnight.

Lets look at some hard numbers to get a sense of scope and what small node sizes to to us:

1 node = 1 block in the block file

With your 16KB page size and 25GB per table (lets assume it is a single index for now, even though we know it is not), your data files would have a _minimum_ 1.6 milli...

Read more...

Revision history for this message
Rick Pizzi (pizzi) wrote :

George, if I'm not mistaken, changing the block size from the actual 16k to something larger will also make the dataset footprint larger. I may certainly try and increase again block size to see how it goes, but then footprint will not be 25G per tables anymore so your computation above will not hold true... and still, let's suppose I move back to 64k, assuming dataset size will not change (which is not true in my opinion), I would have 400,000 nodes instead of 1,600,000 so I should see approximately 1/4 of the stalls (or 1/4 of the stall time) - I should see improvements, generally speaking, right?
I can certainly give it a try, but before that I will try running with a small cache size and see what happens. I remember that InnoDB has same problem with drop and truncate, where on such operation a lock is held while MySQL iterates through the buffer pool to evict all the pages related to what I'm dropping. I think it still has this problem, at least in 5.6.... and in that situation a smaller buffer pool means smaller wait time. If this is the case with TokuDB cache I can see some improvement maybe? Again, as I told in the percona forum, my workload is so random that cache is pretty much useless to me. I'll try to see what happens and report back.

Rick

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

There are two reasons why a larger block size might create a (slightly) larger than actual data size. One is fragmentation within the file, nodes grow and shrink in size and get moved around the file leaving small holes of unusable space, and two is messages in flight. This is why I was asking about flushing and row/data size in the groups thread. If your row size is large compared to your node size, you are gorging internal nodes with messages very quickly and flushing (applying) them downwards, so there is little duplication of row data within the tree (fewer versions). In your case where you are purely insert (no delete or update) there should be no size amplification due to multiple messages and versions of the same row within the index. My math is not (or would ever be) exact because FT node sizes vary from just a few tens of bytes for an empty internal node to the full block size specified + some allowed overage. They are not fixed sized node like InnoDB which is why they require the block file and tree shapes may not be evolved perfectly linearly. That does not mean that we can dismiss the differences in node counts which is the crux of the issue here.

From the PMP output, I think the time/stall is probably linear so yes, reducing your node count by half should in turn cut the time of that linear scan by half.

As far as slow drops, yes, PerconaFT does suffer from slow drops as discussed already and reported in https://tokutek.atlassian.net/browse/FT-715 and you might be seeing some of this, but this is not what is showing in the PMP output, not even a hint of it. That trace is showing us spending the bulk of time with threads counting nodes in the FT map block file for some index.

That is what this thread is doing:
BlockAllocator::UnusedStatistics(ha_tokudb.so),toku_ft_get_fragmentation(ha_tokudb.so)

And these threads are waiting on:
block_table::_mutex_lock(ha_tokudb.so),block_table::translate_blocknum_to_offset_size(ha_tokudb.so)
block_table::_mutex_lock(ha_tokudb.so),toku_ft_get_fragmentation(ha_tokudb.so),ha_tokudb::info(ha_tokudb.so),ha_partition::info

Taking a closer look at this I see the this entire structure (block_table) is guarded by a single mutex per file. I bet you that this could benefit from a read-write lock instead and in fact I remember we even discussed this a while back when we were implementing the block allocator linear scan on allocation issue that you just got past.

I will try to make some time to investigate this as it might be a fairly easy change to make.

Revision history for this message
Rick Pizzi (pizzi) wrote :

I have made some more tests.

Tried with reduced tokudb_cache_memory, resp. 8G, 1G, 64MB: no change for the first two cases, and stalls worsened with 64MB. But as expected leaving all RAM to OS cache brought a benefit, the read I/O used for key lookups (we _do_ deletes) has almost disappeared as all the tokudb files fit in memory. So we kept it to 8G.

I tried truncate instead of drop: same effect, I guess the high level mutex taken is the same.

I tried to delete all rows from expired partitions (with DELETE ... WHERE EXPIRE_DATE < NOW(), this works well (partitions are pruned, it does not impact workload at all) but unfortunately it does not reclaim space from the partition. Even after deleting all rows in the partition, dropping it produces big stalls.

Something funny happened when I reduced the tokudb_cache_memory to 64M: the value of tokudb_leaf_nodes_flushed_checkpoint_bytes dropped close to zero, like if the checkpointing activity was non existent. For reference, with 8G of tokudb cache, I have on average 150 MB/s of such activity as graphed by PMM, but when I set the cache to 64M it drops to 4MB/s!!!
Can please you elaborate on why a smaller cache influences the checkpointing?

Next steps: will move block size from current 16k to 64k (4x) and post the results as far as stalls go.

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

Hello Rick,
Correct, deletions will just mostly sit as messages up in the tree until pushed downwards and require further activity and checkpointing to eventually shrink the file. It's pretty non-deterministic in nature and relies entirely on the subsequent workload applied to the dictionary.

Reducing cache memory will have a direct impact on checkpointing. In order to maintain the memory limit, PerconaFT has this thing called The Evictor. Its job is to remove the least recently used nodes from memory in order to maintain the cache limit. If those nodes happen to be dirty, they will be dispatched to the cachetable thread pool for compression and writing out to disk.

Now, along comes the Checkpointer. Part of the role of the Checkpointer is to ensure that all nodes that were dirty at the beginning of the checkpoint are cleaned and written to disk by the time the Checkpointer completes. So, reducing the number of dirty nodes in memory via a smaller cache size and Evictor activity will reduce the amount of real work that the Checkpointer needs to do when it wakes up.

The flip side of this is that over time you will likely be writing (and reading) more (of the same) data if a dirty node is evicted due to cache pressure, then re-read and dirtied again in the period between checkpoints.

Last evening I worked on changing the block_table from using a mutex to using a reader-writer lock. So far it passes basic tests but it needs much more through vetting before it can be released to ensure it does not introduce a performance regression for other workloads. If you have some sort of non-production environment in which you can try it out to see if there is any impact on your stalls, please contact me privately and we can work on getting you some experimental binaries to try out.

Revision history for this message
Rick Pizzi (pizzi) wrote :

I am now running with 64k block size; so far (about 2 hours) results are encouraging, but let it pass peak hours to see how it goes. If stalls still around I will be glad to test your binary (this is a prod env, but the application is resilient to cache failures so we can just test it there).
I'll be back tomorrow with some results.

Thanks for continued support!

Rick

Revision history for this message
Rick Pizzi (pizzi) wrote :

Situation with 64k block size is a little better but stalls still there.
I will contact you privately for that patched binary.

Revision history for this message
Rick Pizzi (pizzi) wrote :

Addendum - we had a new peak of traffic and the situation with 8 GB cache and 64k blocks got MUCH worse, with stalls of 40 seconds and more....

By looking at PMM graph I was being able to spot a bottleneck in tokudb_cachetable_pool_threads - the graph shows that pool saturated for several second (at 2 x NCPU, since I did not change that one). For sure there is a relationship. I am going to increase that pool from default to something higher to see if that solves this problem.

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

Hi Rick,
I've been looking around this one call and decided that a slightly different approach should be taken to correct this, assuming that it is what is I think it is as evidenced by the pmp output.

Here is the issue tracker for TokuDB https://tokutek.atlassian.net/browse/DB-1033

Once I get this fixed and tested I will try to get you an experimental binary to test out.

Revision history for this message
Rick Pizzi (pizzi) wrote : Re: [Bug 1656022] Re: Partition pruning doesn't seem to work with TokuDB engine
Download full text (8.6 KiB)

Saw the issue, thanks!
Looking forward to try the fix,

I have noticed that once the dataset grows more than what the OS cache can handle, we are back to square one as far as stalls go.
Tomorrow I will go back to 16k (sorry!) and to 50% tokudb_cache_mem which offered the best results in our situation so far.

Mind you, with 8 GB toku cache and 64k once the dataset hit the disk, I have been able to saturate the tokudb_cachetable_pool_threads default value of 80 (the only one of out the 3 that I left to default).
Not sure why this happened or what that means actually, but having the pool saturated for one minute when I had the stall just seemed related.

Thanks!

Rick
--
Riccardo Pizzi
<email address hidden>
http://www.vecchiflipper.it
http://www.pinballowners.com/webmaster

Dogs have masters. Cats have staff.

On 16/gen/2017, at 20:31, George Ormond Lorch III <email address hidden> wrote:

> Hi Rick,
> I've been looking around this one call and decided that a slightly different approach should be taken to correct this, assuming that it is what is I think it is as evidenced by the pmp output.
>
> Here is the issue tracker for TokuDB
> https://tokutek.atlassian.net/browse/DB-1033
>
> Once I get this fixed and tested I will try to get you an experimental
> binary to test out.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1656022
>
> Title:
> Partition pruning doesn't seem to work with TokuDB engine
>
> Status in Percona Server:
> New
>
> Bug description:
> We noticed that partition pruning, although correctly shown by the
> EXPLAIN PARTITIONS command, is not honoured for the TokuDB engine.
>
> Please consider the following table:
>
> show create table mytable\G
> *************************** 1. row ***************************
> Table: mytable
> Create Table: CREATE TABLE `mytable` (
> `ID` char(64) NOT NULL,
> `DATA_BLOB` mediumblob NOT NULL,
> `EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
> PRIMARY KEY (`ID`,`EXPIRE_DATE`),
> KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
> ) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
> /*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(EXPIRE_DATE))
> (PARTITION p1003 VALUES LESS THAN (1484240400) ENGINE = TokuDB,
> PARTITION p1004 VALUES LESS THAN (1484244000) ENGINE = TokuDB,
> PARTITION p1005 VALUES LESS THAN (1484247600) ENGINE = TokuDB,
> PARTITION p1006 VALUES LESS THAN (1484251200) ENGINE = TokuDB,
> PARTITION p1007 VALUES LESS THAN (1484254800) ENGINE = TokuDB,
> PARTITION p1008 VALUES LESS THAN (1484258400) ENGINE = TokuDB,
> PARTITION p1009 VALUES LESS THAN (1484262000) ENGINE = TokuDB,
> PARTITION p1010 VALUES LESS THAN (1484265600) ENGINE = TokuDB,
> PARTITION p1011 VALUES LESS THAN (1484269200) ENGINE = TokuDB,
> PARTITION p1012 VALUES LESS THAN (1484272800) ENGINE = TokuDB,
> PARTITION p1013 VALUES LESS THAN (1484276400) ENGINE = TokuDB,
> PARTITION p1014 VALUES LESS THAN (1484280000) ENGINE = TokuDB,
> PARTITION p1015 VALUES LESS THAN (1484283600) ENGINE = TokuDB,
> PARTITION p1016 VALUES LESS THAN (1484287200) ENGINE = To...

Read more...

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1048

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.