OK some data, this is kinds of tough to explain w/o getting into detailed FT internals so bear with me: 1) The partition handler walks the 'active' partitions (partitions noted to be involved in the query) of which there are three in this query (don't ask me why or where three comes from, it is the same for InnoDB in the matching case). It walks them in descending order of 'largest' to 'smallest' based on the reported number of rows in the partition. 2) It calls ::records_in_range and is being asked for how many records match between two keys, but the keys are identical, this is a particularly important component of this specific issue (it is a point select, not a range). 3) We work our way down into PerconaFT, all the way down to the leaf nodes to try and figure out how many rows exist between the two keys. - PerconaFT nodes are 'partitioned'. - A node has some meta data, a set of pivot keys, and a set of partitions. - Only the partitions of a node that are required for whatever the immediate operation is loaded into memory. - Any write to a node requires that the node is fully in memory as the node must be written as a contiguous data blob, reads only read through the partitions needed to satisfy the query. - Partitions may be individually evicted without evicting the entire node, this is known as partial eviction and is controlled via tokudb_enable_partial_evictions (default=true) - Internal/non-leaf nodes: - The tokudb_fanout parameter (default 16) defines the maximum number of child nodes that an internal node can have. A internal node can and often does have fewer children, but never more. - Each internal node has a set of 'pivot keys' that define the key boundaries between the edges of the internal node and each individual child node. - There is one message buffer for each child node, this is a partition. - External/leaf nodes: - Are partitioned based on tokudb_block_size/tokudb_read_block_size. This defines the number of partitions in a leaf node. A leaf node can have fewer partitions than the max defined, nut never more. - Leaf node partitions are called basement nodes. - A basement node contains a series of 'leaf entries'. Leaf entries represent a 'row', with a key and the MVCC stack for that key. - PerconaFT contains an optimization when estimating the number of keys between a range within a leaf node. If the target basement node is not in memory, it 'guesses' the number of keys that might be within that range. If the basement node is in memory, then it seems an accurate count is fetched, messages are transiently pulled down from above and applied to the leaf entries. - This 'guess' is a bit of a multi-layered calculation: - It takes the total number of _physical_ rows in the entire leaf node, and divides that by the number of basement nodes in the leaf node. ** Note: It uses the physical count of leaf entries, not the logical tracked on the whole tree, so many rows may be deleted but still physically present, and this is why an OPTIMIZED table yields better results. - This can result in a rather large, but totally false, number of matching keys, particularly on secondary indices where there is no data component to record, only the key and back referral to the PK. - Since these records tend to be very much smaller than their PK + data counterpart, many more can fit into a node. Now that the internals lesson is over, there are a couple of things here that I think can be done. - First, I think that since the FT logic has narrowed us down to a basement node where we know that there is some cross section of the start key, it might make sense when the basement is not in memory to compare the end key to the start key for equality, and if so, just return 1 match, else go ahead and return the phony estimate. This will not fix the same issue for narrow range scans, but it will fix point operations such as the point deletion in this example. - Second, it an be argued that this optimization is incorrect and that PerconaFT should bring at least one of, if not all of the needed basements into memory in order to obtain a more accurate Then this is no longer an estimate, it is a real count. If the optimizer then chooses not to use this index, we just did possibly a whole lot of read I/O for nothing. This will add to the TokuDB over-read woes. Think of the case of maybe a table with many indices and someone does a "SELECT * from table WHERE id between reallysmall AND reallylarge". It is possible that the optimizer would call ::records_in_range for all matching indices with this huge range, scanning several indices, then just resorting to a table scan anyway. So you now have an index scan for each matching index, just to test the index, then whatever scan the optimizer chooses and the final fetch. So a lot of potential for over-read. I think implementing the first idea would be nearly a no-op in terms of chances of breaking something. Going into the second is a possible rat hole of breaking established performance characteristics.