Percona Server moved to https://jira.percona.com/projects/PS

Bug #1671152
Comment #16

Comment 16 for bug 1671152

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-11:

#16

OK some data, this is kinds of tough to explain w/o getting into detailed FT
internals so bear with me:
1) The partition handler walks the 'active' partitions (partitions noted to be
   involved in the query) of which there are three in this query (don't ask me
   why or where three comes from, it is the same for InnoDB in the matching
   case). It walks them in descending order of 'largest' to 'smallest' based on
   the reported number of rows in the partition.
2) It calls ::records_in_range and is being asked for how many records match
   between two keys, but the keys are identical, this is a particularly
   important component of this specific issue (it is a point select, not a
   range).
3) We work our way down into PerconaFT, all the way down to the leaf nodes to
   try and figure out how many rows exist between the two keys.
   - PerconaFT nodes are 'partitioned'.
     - A node has some meta data, a set of pivot keys, and a set of
       partitions.
     - Only the partitions of a node that are required for whatever the
       immediate operation is loaded into memory.
     - Any write to a node requires that the node is fully in memory as the node
       must be written as a contiguous data blob, reads only read through the
       partitions needed to satisfy the query.
       - Partitions may be individually evicted without evicting the entire
         node, this is known as partial eviction and is controlled via
         tokudb_enable_partial_evictions (default=true)
       - Internal/non-leaf nodes:
         - The tokudb_fanout parameter (default 16) defines the maximum number
           of child nodes that an internal node can have. A internal node can
           and often does have fewer children, but never more.
         - Each internal node has a set of 'pivot keys' that define the key
           boundaries between the edges of the internal node and each individual
           child node.
         - There is one message buffer for each child node, this is a partition.
       - External/leaf nodes:
         - Are partitioned based on tokudb_block_size/tokudb_read_block_size.
           This defines the number of partitions in a leaf node. A leaf node can
           have fewer partitions than the max defined, nut never more.
         - Leaf node partitions are called basement nodes.
         - A basement node contains a series of 'leaf entries'. Leaf entries
           represent a 'row', with a key and the MVCC stack for that key.
   - PerconaFT contains an optimization when estimating the number of keys
     between a range within a leaf node. If the target basement node is not in
     memory, it 'guesses' the number of keys that might be within that range. If
     the basement node is in memory, then it seems an accurate count is fetched,
     messages are transiently pulled down from above and applied to the leaf
     entries.
     - This 'guess' is a bit of a multi-layered calculation:
       - It takes the total number of _physical_ rows in the entire leaf node,
         and divides that by the number of basement nodes in the leaf node.
         ** Note: It uses the physical count of leaf entries, not the logical
         tracked on the whole tree, so many rows may be deleted but still
         physically present, and this is why an OPTIMIZED table yields better
         results.
       - This can result in a rather large, but totally false, number of
         matching keys, particularly on secondary indices where there is no data
         component to record, only the key and back referral to the PK.
       - Since these records tend to be very much smaller than their PK + data
         counterpart, many more can fit into a node.

Now that the internals lesson is over, there are a couple of things here that I
think can be done.
- First, I think that since the FT logic has narrowed us down to a basement node
  where we know that there is some cross section of the start key, it might make
  sense when the basement is not in memory to compare the end key to the start
  key for equality, and if so, just return 1 match, else go ahead and return the
  phony estimate. This will not fix the same issue for narrow range scans, but
  it will fix point operations such as the point deletion in this example.
- Second, it an be argued that this optimization is incorrect and that PerconaFT
  should bring at least one of, if not all of the needed basements into memory
  in order to obtain a more accurate Then this is no longer an estimate, it is a
  real count. If the optimizer then chooses not to use this index, we just did
  possibly a whole lot of read I/O for nothing. This will add to the TokuDB
  over-read woes. Think of the case of maybe a table with many indices and
  someone does a "SELECT * from table WHERE id between reallysmall AND
  reallylarge". It is possible that the optimizer would call
  ::records_in_range for all matching indices with this huge range, scanning
  several indices, then just resorting to a table scan anyway. So you now have
  an index scan for each matching index, just to test the index, then whatever
  scan the optimizer chooses and the final fetch. So a lot of potential for
  over-read.

I think implementing the first idea would be nearly a no-op in terms of chances
of breaking something. Going into the second is a possible rat hole of breaking
established performance characteristics.

OK some data, this is kinds of tough to explain w/o getting into detailed FT
internals so bear with me:
1) The partition handler walks the 'active' partitions (partitions noted to be 
   involved in the query) of which there are three in this query (don't ask me
   why or where three comes from, it is the same for InnoDB in the matching
   case). It walks them in descending order of 'largest' to 'smallest' based on
   the reported number of rows in the partition.
2) It calls ::records_in_range and is being asked for how many records match
   between two keys, but the keys are identical, this is a particularly
   important component of this specific issue (it is a point select, not a
   range).
3) We work our way down into PerconaFT, all the way down to the leaf nodes to
   try and figure out how many rows exist between the two keys.
   - PerconaFT nodes are 'partitioned'.
     - A node has some meta data, a set of pivot keys, and a set of
       partitions.
     - Only the partitions of a node that are required for whatever the
       immediate operation is loaded into memory.
     - Any write to a node requires that the node is fully in memory as the node
       must be written as a contiguous data blob, reads only read through the
       partitions needed to satisfy the query.
       - Partitions may be individually evicted without evicting the entire
         node, this is known as partial eviction and is controlled via
         tokudb_enable_partial_evictions (default=true)
       - Internal/non-leaf nodes:
         - The tokudb_fanout parameter (default 16) defines the maximum number
           of child nodes that an internal node can have. A internal node can
           and often does have fewer children, but never more.
         - Each internal node has a set of 'pivot keys' that define the key
           boundaries between the edges of the internal node and each individual
           child node.
         - There is one message buffer for each child node, this is a partition.
       - External/leaf nodes:
         - Are partitioned based on tokudb_block_size/tokudb_read_block_size.
           This defines the number of partitions in a leaf node. A leaf node can
           have fewer partitions than the max defined, nut never more.
         - Leaf node partitions are called basement nodes.
         - A basement node contains a series of 'leaf entries'. Leaf entries
           represent a 'row', with a key and the MVCC stack for that key.
   - PerconaFT contains an optimization when estimating the number of keys
     between a range within a leaf node. If the target basement node is not in
     memory, it 'guesses' the number of keys that might be within that range. If
     the basement node is in memory, then it seems an accurate count is fetched,
     messages are transiently pulled down from above and applied to the leaf
     entries.
     - This 'guess' is a bit of a multi-layered calculation:
       - It takes the total number of _physical_ rows in the entire leaf node,
         and divides that by the number of basement nodes in the leaf node.
         ** Note: It uses the physical count of leaf entries, not the logical
         tracked on the whole tree, so many rows may be deleted but still
         physically present, and this is why an OPTIMIZED table yields better
         results.
       - This can result in a rather large, but totally false, number of
         matching keys, particularly on secondary indices where there is no data
         component to record, only the key and back referral to the PK.
       - Since these records tend to be very much smaller than their PK + data
         counterpart, many more can fit into a node.