> Now that the internals lesson is over, there are a couple of things here that I > think can be done. > - First, I think that since the FT logic has narrowed us down to a basement node > where we know that there is some cross section of the start key, it might make > sense when the basement is not in memory to compare the end key to the start > key for equality, and if so, just return 1 match, else go ahead and return the > phony estimate. This will not fix the same issue for narrow range scans, but > it will fix point operations such as the point deletion in this example. > - Second, it an be argued that this optimization is incorrect and that PerconaFT > should bring at least one of, if not all of the needed basements into memory > in order to obtain a more accurate Then this is no longer an estimate, it is a > real count. If the optimizer then chooses not to use this index, we just did > possibly a whole lot of read I/O for nothing. This will add to the TokuDB > over-read woes. Think of the case of maybe a table with many indices and > someone does a "SELECT * from table WHERE id between reallysmall AND > reallylarge". It is possible that the optimizer would call > ::records_in_range for all matching indices with this huge range, scanning > several indices, then just resorting to a table scan anyway. So you now have > an index scan for each matching index, just to test the index, then whatever > scan the optimizer chooses and the final fetch. So a lot of potential for > over-read. > > I think implementing the first idea would be nearly a no-op in terms of chances > of breaking something. Going into the second is a possible rat hole of breaking > established performance characteristics. Thank you, the “internal lesson” was a very interesting reading. I believe fixing the point deletions is a good idea and much needed, it will solve the problem in the current test case. Not sure what would happen when we add a condition on EXPIRE_DATE (which we actually do for partition pruning purposes), but considering that MySQL will at most use one index, perhaps it will just work fine as HASH_ID_IX will be selected by optimizer, and the other condition only used for pruning. I see you have found more issues about the rows estimate too. I am eager to test whatever fix you will submit. Thanks Rick — Riccardo Pizzi http://www.pinballowners.com/webmaster On 11/mar/2017, at 01:32, George Ormond Lorch III wrote: > OK some data, this is kinds of tough to explain w/o getting into detailed FT > internals so bear with me: > 1) The partition handler walks the 'active' partitions (partitions noted to be > involved in the query) of which there are three in this query (don't ask me > why or where three comes from, it is the same for InnoDB in the matching > case). It walks them in descending order of 'largest' to 'smallest' based on > the reported number of rows in the partition. > 2) It calls ::records_in_range and is being asked for how many records match > between two keys, but the keys are identical, this is a particularly > important component of this specific issue (it is a point select, not a > range). > 3) We work our way down into PerconaFT, all the way down to the leaf nodes to > try and figure out how many rows exist between the two keys. > - PerconaFT nodes are 'partitioned'. > - A node has some meta data, a set of pivot keys, and a set of > partitions. > - Only the partitions of a node that are required for whatever the > immediate operation is loaded into memory. > - Any write to a node requires that the node is fully in memory as the node > must be written as a contiguous data blob, reads only read through the > partitions needed to satisfy the query. > - Partitions may be individually evicted without evicting the entire > node, this is known as partial eviction and is controlled via > tokudb_enable_partial_evictions (default=true) > - Internal/non-leaf nodes: > - The tokudb_fanout parameter (default 16) defines the maximum number > of child nodes that an internal node can have. A internal node can > and often does have fewer children, but never more. > - Each internal node has a set of 'pivot keys' that define the key > boundaries between the edges of the internal node and each individual > child node. > - There is one message buffer for each child node, this is a partition. > - External/leaf nodes: > - Are partitioned based on tokudb_block_size/tokudb_read_block_size. > This defines the number of partitions in a leaf node. A leaf node can > have fewer partitions than the max defined, nut never more. > - Leaf node partitions are called basement nodes. > - A basement node contains a series of 'leaf entries'. Leaf entries > represent a 'row', with a key and the MVCC stack for that key. > - PerconaFT contains an optimization when estimating the number of keys > between a range within a leaf node. If the target basement node is not in > memory, it 'guesses' the number of keys that might be within that range. If > the basement node is in memory, then it seems an accurate count is fetched, > messages are transiently pulled down from above and applied to the leaf > entries. > - This 'guess' is a bit of a multi-layered calculation: > - It takes the total number of _physical_ rows in the entire leaf node, > and divides that by the number of basement nodes in the leaf node. > ** Note: It uses the physical count of leaf entries, not the logical > tracked on the whole tree, so many rows may be deleted but still > physically present, and this is why an OPTIMIZED table yields better > results. > - This can result in a rather large, but totally false, number of > matching keys, particularly on secondary indices where there is no data > component to record, only the key and back referral to the PK. > - Since these records tend to be very much smaller than their PK + data > counterpart, many more can fit into a node. > > Now that the internals lesson is over, there are a couple of things here that I > think can be done. > - First, I think that since the FT logic has narrowed us down to a basement node > where we know that there is some cross section of the start key, it might make > sense when the basement is not in memory to compare the end key to the start > key for equality, and if so, just return 1 match, else go ahead and return the > phony estimate. This will not fix the same issue for narrow range scans, but > it will fix point operations such as the point deletion in this example. > - Second, it an be argued that this optimization is incorrect and that PerconaFT > should bring at least one of, if not all of the needed basements into memory > in order to obtain a more accurate Then this is no longer an estimate, it is a > real count. If the optimizer then chooses not to use this index, we just did > possibly a whole lot of read I/O for nothing. This will add to the TokuDB > over-read woes. Think of the case of maybe a table with many indices and > someone does a "SELECT * from table WHERE id between reallysmall AND > reallylarge". It is possible that the optimizer would call > ::records_in_range for all matching indices with this huge range, scanning > several indices, then just resorting to a table scan anyway. So you now have > an index scan for each matching index, just to test the index, then whatever > scan the optimizer chooses and the final fetch. So a lot of potential for > over-read. > > I think implementing the first idea would be nearly a no-op in terms of chances > of breaking something. Going into the second is a possible rat hole of breaking > established performance characteristics. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1671152 > > Title: > tokudb does not use index even if cardinality is good > > Status in Percona Server: > Fix Committed > Status in Percona Server 5.6 series: > Fix Committed > Status in Percona Server 5.7 series: > Fix Committed > > Bug description: > Please consider the following partitioned table: > > CREATE TABLE `BIG_STORAGE_TOKU` ( > `HASH_ID` char(64) NOT NULL, > `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL, > `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL, > `RAW_DATA` mediumblob NOT NULL, > `LAST_UPDATE` datetime NOT NULL, > `EXPIRE_DATE` date NOT NULL, > KEY `HASH_ID_IX` (`HASH_ID`), > KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`) > ) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED > /*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE)) > (PARTITION p364 VALUES LESS THAN (736753) ENGINE = TokuDB, > PARTITION p365 VALUES LESS THAN (736754) ENGINE = TokuDB, > PARTITION p366 VALUES LESS THAN (736755) ENGINE = TokuDB, > PARTITION p367 VALUES LESS THAN (736756) ENGINE = TokuDB, > PARTITION p368 VALUES LESS THAN (736757) ENGINE = TokuDB, > PARTITION p369 VALUES LESS THAN (736758) ENGINE = TokuDB, > PARTITION p370 VALUES LESS THAN (736759) ENGINE = TokuDB, > PARTITION p371 VALUES LESS THAN (736760) ENGINE = TokuDB, > PARTITION p372 VALUES LESS THAN (736761) ENGINE = TokuDB, > PARTITION p373 VALUES LESS THAN (736762) ENGINE = TokuDB, > PARTITION p374 VALUES LESS THAN (736763) ENGINE = TokuDB, > PARTITION p375 VALUES LESS THAN (736764) ENGINE = TokuDB, > PARTITION p376 VALUES LESS THAN (736765) ENGINE = TokuDB, > PARTITION p377 VALUES LESS THAN (736766) ENGINE = TokuDB, > PARTITION p378 VALUES LESS THAN (736767) ENGINE = TokuDB, > PARTITION p379 VALUES LESS THAN (736768) ENGINE = TokuDB, > PARTITION p380 VALUES LESS THAN (736769) ENGINE = TokuDB, > PARTITION p381 VALUES LESS THAN (736770) ENGINE = TokuDB, > PARTITION p382 VALUES LESS THAN (736771) ENGINE = TokuDB, > PARTITION p383 VALUES LESS THAN (736772) ENGINE = TokuDB, > PARTITION p384 VALUES LESS THAN (736773) ENGINE = TokuDB, > PARTITION p385 VALUES LESS THAN (736774) ENGINE = TokuDB, > PARTITION p386 VALUES LESS THAN (736775) ENGINE = TokuDB, > PARTITION p387 VALUES LESS THAN (736776) ENGINE = TokuDB, > PARTITION p388 VALUES LESS THAN (736777) ENGINE = TokuDB, > PARTITION p389 VALUES LESS THAN (736778) ENGINE = TokuDB, > PARTITION p390 VALUES LESS THAN (736779) ENGINE = TokuDB, > PARTITION p391 VALUES LESS THAN (736780) ENGINE = TokuDB, > PARTITION p392 VALUES LESS THAN (736781) ENGINE = TokuDB, > PARTITION p393 VALUES LESS THAN (736782) ENGINE = TokuDB, > PARTITION p394 VALUES LESS THAN (736783) ENGINE = TokuDB, > PARTITION p395 VALUES LESS THAN (736784) ENGINE = TokuDB) */ > > > Load the table with sysbench using following LUA code (let it run until row count is about 100,000) > > > function event(thread_id) > local s1 = sb_rand_str("#####") > db_query("DELETE FROM BIG_STORAGE_INNO WHERE HASH_ID = SHA2('" .. s1 .. "', 256) AND EXPIRE_DATE > NOW()"); > db_query("INSERT INTO BIG_STORAGE_INNO VALUES(SHA2('" .. s1 .. "', 256), 'PROTOSTUFF', 'GZIP', REPEAT(CHAR(FLOOR(RAND()*96)+32), FLOOR(RAND()*16384)), NOW(), DATE_ADD(NOW(), INTERVAL FLOOR(RAND()*10) +1 day))") > end > > This will create a distribution of roughly 100,000 random records, > having a random payload between 0 and 16K, and will scatter them > throughout 10 partitions. > > Run analyze table on it, then try the following explains: > > explain DELETE from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256); > explain SELECT * from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256); > > You will see that TokuDB refuses to use any index. > > InnoDB works fine in the same exact setup and the index is always chosen. See below. > We are experiencing this in production with much larger tables. > > > dbcache05>analyze table BIG_STORAGE_TOKU; > +-------------------------+---------+----------+----------+ > | Table | Op | Msg_type | Msg_text | > +-------------------------+---------+----------+----------+ > | sbtest.BIG_STORAGE_TOKU | analyze | status | OK | > +-------------------------+---------+----------+----------+ > 1 row in set (0.42 sec) > > dbcache05>show indexes from BIG_STORAGE_TOKU; > +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ > | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | > +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ > | BIG_STORAGE_TOKU | 1 | HASH_ID_IX | 1 | HASH_ID | A | 100024 | NULL | NULL | | BTREE | | | > | BIG_STORAGE_TOKU | 1 | EXPIRE_DATE_IX | 1 | EXPIRE_DATE | A | 19 | NULL | NULL | | BTREE | | | > +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ > 2 rows in set (0.02 sec) > > dbcache05>explain select * from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256); > +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+ > | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | > +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+ > | 1 | SIMPLE | BIG_STORAGE_TOKU | ALL | HASH_ID_IX | NULL | NULL | NULL | 100024 | Using where | > +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+ > 1 row in set (0.00 sec) > > > dbcache05>explain delete from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256); > +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+ > | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | > +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+ > | 1 | SIMPLE | BIG_STORAGE_TOKU | ALL | HASH_ID_IX | NULL | NULL | NULL | 100024 | Using where | > +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+ > 1 row in set (0.00 sec) > > > This is result with identical steps but using InnoDB: > > > dbcache05>analyze table BIG_STORAGE_INNO; > +-------------------------+---------+----------+----------+ > | Table | Op | Msg_type | Msg_text | > +-------------------------+---------+----------+----------+ > | sbtest.BIG_STORAGE_INNO | analyze | status | OK | > +-------------------------+---------+----------+----------+ > 1 row in set (0.10 sec) > > dbcache05>show indexes from BIG_STORAGE_INNO; > +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ > | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | > +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ > | BIG_STORAGE_INNO | 1 | HASH_ID_IX | 1 | HASH_ID | A | 64665 | NULL | NULL | | BTREE | | | > | BIG_STORAGE_INNO | 1 | EXPIRE_DATE_IX | 1 | EXPIRE_DATE | A | 14 | NULL | NULL | | BTREE | | | > +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ > 2 rows in set (0.01 sec) > > dbcache05>explain select * from BIG_STORAGE_INNO where HASH_ID = SHA2('70164', 256); > +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+ > | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | > +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+ > | 1 | SIMPLE | BIG_STORAGE_INNO | ref | HASH_ID_IX | HASH_ID_IX | 64 | const | 7 | Using where | > +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+ > 1 row in set (0.00 sec) > > > dbcache05>explain delete from BIG_STORAGE_INNO where HASH_ID = SHA2('70164', 256); > +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+ > | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | > +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+ > | 1 | SIMPLE | BIG_STORAGE_INNO | range | HASH_ID_IX | HASH_ID_IX | 64 | const | 7 | Using where | > +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+ > 1 row in set (0.00 sec) > > To manage notifications about this bug go to: > https://bugs.launchpad.net/percona-server/+bug/1671152/+subscriptions >