tokudb does not use index even if cardinality is good

Bug #1671152 reported by Rick Pizzi on 2017-03-08
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.6
Fix Released
High
George Ormond Lorch III
5.7
Fix Released
High
George Ormond Lorch III

Bug Description

Please consider the following partitioned table:

CREATE TABLE `BIG_STORAGE_TOKU` (
  `HASH_ID` char(64) NOT NULL,
  `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
  `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `LAST_UPDATE` datetime NOT NULL,
  `EXPIRE_DATE` date NOT NULL,
  KEY `HASH_ID_IX` (`HASH_ID`),
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p364 VALUES LESS THAN (736753) ENGINE = TokuDB,
 PARTITION p365 VALUES LESS THAN (736754) ENGINE = TokuDB,
 PARTITION p366 VALUES LESS THAN (736755) ENGINE = TokuDB,
 PARTITION p367 VALUES LESS THAN (736756) ENGINE = TokuDB,
 PARTITION p368 VALUES LESS THAN (736757) ENGINE = TokuDB,
 PARTITION p369 VALUES LESS THAN (736758) ENGINE = TokuDB,
 PARTITION p370 VALUES LESS THAN (736759) ENGINE = TokuDB,
 PARTITION p371 VALUES LESS THAN (736760) ENGINE = TokuDB,
 PARTITION p372 VALUES LESS THAN (736761) ENGINE = TokuDB,
 PARTITION p373 VALUES LESS THAN (736762) ENGINE = TokuDB,
 PARTITION p374 VALUES LESS THAN (736763) ENGINE = TokuDB,
 PARTITION p375 VALUES LESS THAN (736764) ENGINE = TokuDB,
 PARTITION p376 VALUES LESS THAN (736765) ENGINE = TokuDB,
 PARTITION p377 VALUES LESS THAN (736766) ENGINE = TokuDB,
 PARTITION p378 VALUES LESS THAN (736767) ENGINE = TokuDB,
 PARTITION p379 VALUES LESS THAN (736768) ENGINE = TokuDB,
 PARTITION p380 VALUES LESS THAN (736769) ENGINE = TokuDB,
 PARTITION p381 VALUES LESS THAN (736770) ENGINE = TokuDB,
 PARTITION p382 VALUES LESS THAN (736771) ENGINE = TokuDB,
 PARTITION p383 VALUES LESS THAN (736772) ENGINE = TokuDB,
 PARTITION p384 VALUES LESS THAN (736773) ENGINE = TokuDB,
 PARTITION p385 VALUES LESS THAN (736774) ENGINE = TokuDB,
 PARTITION p386 VALUES LESS THAN (736775) ENGINE = TokuDB,
 PARTITION p387 VALUES LESS THAN (736776) ENGINE = TokuDB,
 PARTITION p388 VALUES LESS THAN (736777) ENGINE = TokuDB,
 PARTITION p389 VALUES LESS THAN (736778) ENGINE = TokuDB,
 PARTITION p390 VALUES LESS THAN (736779) ENGINE = TokuDB,
 PARTITION p391 VALUES LESS THAN (736780) ENGINE = TokuDB,
 PARTITION p392 VALUES LESS THAN (736781) ENGINE = TokuDB,
 PARTITION p393 VALUES LESS THAN (736782) ENGINE = TokuDB,
 PARTITION p394 VALUES LESS THAN (736783) ENGINE = TokuDB,
 PARTITION p395 VALUES LESS THAN (736784) ENGINE = TokuDB) */

Load the table with sysbench using following LUA code (let it run until row count is about 100,000)

function event(thread_id)
   local s1 = sb_rand_str("#####")
   db_query("DELETE FROM BIG_STORAGE_INNO WHERE HASH_ID = SHA2('" .. s1 .. "', 256) AND EXPIRE_DATE > NOW()");
   db_query("INSERT INTO BIG_STORAGE_INNO VALUES(SHA2('" .. s1 .. "', 256), 'PROTOSTUFF', 'GZIP', REPEAT(CHAR(FLOOR(RAND()*96)+32), FLOOR(RAND()*16384)), NOW(), DATE_ADD(NOW(), INTERVAL FLOOR(RAND()*10) +1 day))")
end

This will create a distribution of roughly 100,000 random records, having a random payload between 0 and 16K, and will scatter them throughout 10 partitions.

Run analyze table on it, then try the following explains:

explain DELETE from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256);
explain SELECT * from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256);

You will see that TokuDB refuses to use any index.

InnoDB works fine in the same exact setup and the index is always chosen. See below.
We are experiencing this in production with much larger tables.

dbcache05>analyze table BIG_STORAGE_TOKU;
+-------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+-------------------------+---------+----------+----------+
| sbtest.BIG_STORAGE_TOKU | analyze | status | OK |
+-------------------------+---------+----------+----------+
1 row in set (0.42 sec)

dbcache05>show indexes from BIG_STORAGE_TOKU;
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| BIG_STORAGE_TOKU | 1 | HASH_ID_IX | 1 | HASH_ID | A | 100024 | NULL | NULL | | BTREE | | |
| BIG_STORAGE_TOKU | 1 | EXPIRE_DATE_IX | 1 | EXPIRE_DATE | A | 19 | NULL | NULL | | BTREE | | |
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.02 sec)

dbcache05>explain select * from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256);
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | BIG_STORAGE_TOKU | ALL | HASH_ID_IX | NULL | NULL | NULL | 100024 | Using where |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

dbcache05>explain delete from BIG_STORAGE_TOKU where HASH_ID = SHA2('70164', 256);
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | BIG_STORAGE_TOKU | ALL | HASH_ID_IX | NULL | NULL | NULL | 100024 | Using where |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

This is result with identical steps but using InnoDB:

dbcache05>analyze table BIG_STORAGE_INNO;
+-------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+-------------------------+---------+----------+----------+
| sbtest.BIG_STORAGE_INNO | analyze | status | OK |
+-------------------------+---------+----------+----------+
1 row in set (0.10 sec)

dbcache05>show indexes from BIG_STORAGE_INNO;
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| BIG_STORAGE_INNO | 1 | HASH_ID_IX | 1 | HASH_ID | A | 64665 | NULL | NULL | | BTREE | | |
| BIG_STORAGE_INNO | 1 | EXPIRE_DATE_IX | 1 | EXPIRE_DATE | A | 14 | NULL | NULL | | BTREE | | |
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.01 sec)

dbcache05>explain select * from BIG_STORAGE_INNO where HASH_ID = SHA2('70164', 256);
+----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
| 1 | SIMPLE | BIG_STORAGE_INNO | ref | HASH_ID_IX | HASH_ID_IX | 64 | const | 7 | Using where |
+----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
1 row in set (0.00 sec)

dbcache05>explain delete from BIG_STORAGE_INNO where HASH_ID = SHA2('70164', 256);
+----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
| 1 | SIMPLE | BIG_STORAGE_INNO | range | HASH_ID_IX | HASH_ID_IX | 64 | const | 7 | Using where |
+----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
1 row in set (0.00 sec)

Rick Pizzi (pizzi) wrote :
Download full text (5.6 KiB)

Here's the optimizer trace that indicates that MySQL thinks that the cost for the index scan is higher than the cost for a full table scan. This is of course wrong.

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          {
            "expanded_query": "/* select#1 */ select `BIG_STORAGE_TOKU`.`EXPIRE_DATE` AS `EXPIRE_DATE` from `BIG_STORAGE_TOKU` where (`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))"
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          {
            "condition_processing": {
              "condition": "WHERE",
              "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
              "steps": [
                {
                  "transformation": "equality_propagation",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                },
                {
                  "transformation": "constant_propagation",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                },
                {
                  "transformation": "trivial_condition_removal",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                }
              ]
            }
          },
          {
            "table_dependencies": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "row_may_be_null": false,
                "map_bit": 0,
                "depends_on_map_bits": [
                ]
              }
            ]
          },
          {
            "ref_optimizer_key_uses": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "field": "HASH_ID",
                "equals": "sha2('70164',256)",
                "null_rejecting": false
              }
            ]
          },
          {
            "rows_estimation": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "range_analysis": {
                  "table_scan": {
                    "rows": 100030,
                    "cost": 53351
                  },
                  "potential_range_indices": [
                    {
                      "index": "HASH_ID_IX",
                      "usable": true,
                      "key_parts": [
                        "HASH_ID"
                      ]
                    },
                    {
                      "index": "EXPIRE_DATE_IX",
                      "usable": false,
                      "cause": "not_applicable"
                    }
                  ],
                  "setup_range_conditions": [
                  ],
                  "group_index_range": {
                    "chosen": false,
                    "cause": "not_group_by_or_distinct"
                  },
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "HASH_ID_IX",
                        "ranges": [
                          "ffff925a44d...

Read more...

Rick Pizzi (pizzi) wrote :

Running OPTIMIZE TABLE and repeating the test brings the desired result.
I believe that the issue is due to the wrong estimate of rows in the table, which is probably TokuDB not subtracting the deleted rows from the count.
If this is correct, in a workload with many deletes, this quickly becomes a serious problem.

Rick Pizzi (pizzi) wrote :

After OPTIMIZE TABLE, cost for using index is correct and index is used:

                "table": "`BIG_STORAGE_TOKU`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "HASH_ID_IX",
                      "rows": 9,
                      "cost": 10.8,
                      "chosen": true

George Ormond Lorch III (gl-az) wrote :

So we had an extended offline discussion and some debugging and it comes down to a few things.

1) Logical row counting in PerconaFT is a moving target as operations can be counted at the time of operation, and others can not be known until the message is persistently delivered to the leaf nodes. It 'floats' towards accurate but is never going to be 100% accurate except for the moment after either an OPTIMIZE TABLE or an ANALYZE TABLE with RECOUNT_ROWS
2) Different indices, PK and secondaries, on the same table will report different logical row count estimates back simply due to the dynamics of the PerconaFT messaging scheme, different record sizes, data concentration within nodes, background threads randomly picking nodes to perform work on, etc, etc...all impact the 'in-flight' messaging and means that no two indices will ever be identical.
3) When an OPTIMIZE is run, it flushes all messages to the leaf nodes and removes any garbage from all of the index trees, and as a result normalizes the row counts across the indices.
4) There is a bug (I say bug because it is a different behavior from InnoDB) in TokuDBs ha_tokudb::records_in_range method when the optimizer asks for the number of records in the range of -infinity to +infinity. TokuDB returns the logical row count for that index, not the table/PK which tends to always be more accurate. This causes the optimizer to calculate different costs for different indices as a result of getting sometimes drastically different logical row counts. InnoDB returns the same 'table count' for all indices. The bug fix should be to have TokuDB return the logical row count value from the PK for all indices so that the optimizer can compare apples to apples w.r.t. cardinality statistics rather than apples to Volkswagens.

Rick Pizzi (pizzi) wrote :
Download full text (7.5 KiB)

I just tried the above fix (compiled from source after applying the above pull request for 5.6 by hand) and it does NOT fix the problem for us in the above benchmark.

dbcache05>explain DELETE FROM BIG_STORAGE_TOKU WHERE HASH_ID = SHA2('70164', 256);
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | BIG_STORAGE_TOKU | ALL | HASH_ID_IX | NULL | NULL | NULL | 100035 | Using where |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

Optimizer still prefers full table scan due to cost for index dive:

{
  "steps": [
    {
      "condition_processing": {
        "condition": "WHERE",
        "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
        "steps": [
          {
            "transformation": "equality_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "constant_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "trivial_condition_removal",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          }
        ]
      }
    },
    {
      "table": "`BIG_STORAGE_TOKU`",
      "range_analysis": {
        "table_scan": {
          "rows": 100036,
          "cost": 53355
        },
        "potential_range_indices": [
          {
            "index": "HASH_ID_IX",
            "usable": true,
            "key_parts": [
              "HASH_ID"
            ]
          },
          {
            "index": "EXPIRE_DATE_IX2",
            "usable": true,
            "key_parts": [
              "EXPIRE_DATE"
            ]
          }
        ],
        "setup_range_conditions": [
        ],
        "group_index_range": {
          "chosen": false,
          "cause": "no_join"
        },
        "analyzing_range_alternatives": {
          "range_scan_alternatives": [
            {
              "index": "HASH_ID_IX",
              "ranges": [
                "ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 <= HASH_ID <= ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084"
              ],
              "index_dives_for_eq_ranges": true,
              "rowid_ordered": true,
              "using_mrr": false,
              "index_only": false,
              "rows": 154749,
              "cost": 185700,
              "chosen": false,
              "cause": "cost"
            }
          ]
        }
      }
    }
  ]
}

[root@dbcache05 data]# tokuftdump --header _sbtest_sql_3af4_38_P_p376_main_10d2efd87_2_1d.tokudb
ft:
 layout_version=29
 layout_version_original=29
 layout_version_read_from_disk=29
 build_id=0
 build_id_original=0...

Read more...

Rick Pizzi (pizzi) wrote :
Download full text (6.5 KiB)

There are some records in the dataset which trigger the table scan effect, and other which do not:

dbcache05>explain DELETE FROM BIG_STORAGE_TOKU WHERE HASH_ID = SHA2('30398', 256);
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | BIG_STORAGE_TOKU | ALL | HASH_ID_IX | NULL | NULL | NULL | 100014 | Using where |
+----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

dbcache05>explain DELETE FROM BIG_STORAGE_TOKU WHERE HASH_ID = SHA2('70164', 256);
+----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
| 1 | SIMPLE | BIG_STORAGE_TOKU | range | HASH_ID_IX | HASH_ID_IX | 64 | const | 14 | Using where |
+----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
1 row in set (0.00 sec)

dbcache05>SET optimizer_trace='enabled=on';
Query OK, 0 rows affected (0.00 sec)

dbcache05>DELETE FROM BIG_STORAGE_TOKU WHERE HASH_ID = SHA2('70164', 256);
Query OK, 1 row affected (0.00 sec)

dbcache05>SELECT trace FROM information_schema.optimizer_trace \G
*************************** 1. row ***************************
trace: {
  "steps": [
    {
      "condition_processing": {
        "condition": "WHERE",
        "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
        "steps": [
          {
            "transformation": "equality_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "constant_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "trivial_condition_removal",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          }
        ]
      }
    },
    {
      "table": "`BIG_STORAGE_TOKU`",
      "range_analysis": {
        "table_scan": {
          "rows": 100022,
          "cost": 53347
        },
        "potential_range_indices": [
          {
            "index": "HASH_ID_IX",
            "usable": true,
            "key_parts": [
              "HASH_ID"
            ]
          },
          {
            "index": "EXPIRE_DATE_IX2",
            "usable": true,
            "key_parts": [
              "EXPIRE_DATE"
            ]
          }
        ],
        "setup_range_conditions": [
        ],
        "group_index_range": {
          "chosen": false,
          "cause": "no_join"
        },
   ...

Read more...

George Ormond Lorch III (gl-az) wrote :

This is strange Rick. With the current 'fix' I used exactly your table structure to create one TokuDB table and one InnoDB table. I then took your lua script and 'bash'ed it and got about 150K rows spread across 9 or 10 partitions into the TokuDB table. I then did an INSERT INTO test.BIG_STORAGE_INNO SELECT * FROM BIG_STORAGE_TOKU followed by ANALYZE TABLE on both tables.

So we now have two tables that are identical in content and partition distribution.

Attached are the SHOW INDEXES and EXPLAIN results from the two tables. They seem to be very close in responses and choose the same plans.

So I am unsure of where the differences lie. Maybe I'll see if I can get more rows in the tables for a larger sampling set.

Download full text (12.1 KiB)

The Lua script will create rows with HASH_ID based on a SHA256 encoding of numeric values between 00000 and 99999, so it can create at most 100,000 different values
(and it will, very quickly).
If you want to see the actual problem in action, it should be fairly easy.
Sysbench, 16 threads, my LUA posted here, let it run and when count is around 100,000, start looking at output of SHOW PROCESSLIST while sysbench runs.
You will see that from time to time, one of the DELETE will show that it is doing a full table scan (and Toku will clearly mark it as such because
you will see something like “scanned 20,000 rows of about 100,000” or something like that).
When you see this happening, just explain the DELETE for that key and you will see it is not using any index.
At that point in time, you can stop sysbench and run your explain again which will bring same result (no index used).
Different keys _do_use the index, which adds to the weirdness. See my example above,

From my tests with your patch, things are better (the amount of full table scans is less) but still not okay.
The estimate is still not correct.
Again, I wonder why you can’t just return the logical row count as the estimate? It always looks correct.

Thanks
Rick

Riccardo Pizzi
<email address hidden>
http://www.pinballowners.com/webmaster

On 10/mar/2017, at 19:26, George Ormond Lorch III <email address hidden> wrote:

> This is strange Rick. With the current 'fix' I used exactly your table
> structure to create one TokuDB table and one InnoDB table. I then took
> your lua script and 'bash'ed it and got about 150K rows spread across 9
> or 10 partitions into the TokuDB table. I then did an INSERT INTO
> test.BIG_STORAGE_INNO SELECT * FROM BIG_STORAGE_TOKU followed by ANALYZE
> TABLE on both tables.
>
> So we now have two tables that are identical in content and partition
> distribution.
>
> Attached are the SHOW INDEXES and EXPLAIN results from the two tables.
> They seem to be very close in responses and choose the same plans.
>
> So I am unsure of where the differences lie. Maybe I'll see if I can get
> more rows in the tables for a larger sampling set.
>
> ** Attachment added: "inno-toku.txt"
> https://bugs.launchpad.net/percona-server/+bug/1671152/+attachment/4835459/+files/inno-toku.txt
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1671152
>
> Title:
> tokudb does not use index even if cardinality is good
>
> Status in Percona Server:
> Fix Committed
> Status in Percona Server 5.6 series:
> Fix Committed
> Status in Percona Server 5.7 series:
> Fix Committed
>
> Bug description:
> Please consider the following partitioned table:
>
> CREATE TABLE `BIG_STORAGE_TOKU` (
> `HASH_ID` char(64) NOT NULL,
> `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
> `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
> `RAW_DATA` mediumblob NOT NULL,
> `LAST_UPDATE` datetime NOT NULL,
> `EXPIRE_DATE` date NOT NULL,
> KEY `HASH_ID_IX` (`HASH_ID`),
> KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
> ) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRE...

Ahh, OK, so you are running to 100K rows, then _continuing_ to run, so 100K is roughly the max and you are just then continuing to work within that range of keys. That means more deletes or actual hits on existing rows. My bash does the same, originally was limited to 32K rows but I changed that and went to 3 million rows.

The logical row estimate with messages in flight will never be accurate, it can not be within a deferred work system without performing an actual recount each time it is asked for (a table scan). So what is being returned now from ::records_in_range when queried for -+infinity and ::info is the logical row estimate for the PK. Prior to this fix, ::info was returning the logical row estimate for the PK and ::records_in_range -_infinity was returning the logical row estimate for that particular key, which could be very different from the PK.

There may be some other issue that I am trying to get in to if ::records_in_range is being provided actual key range values, this goes deep into the fractal tree and requires debugging as this is a code area that I have never been into before. It is entirely possible that this code is just dropping down to leaf nodes and making some wild assumptions about what the distance is between two keys without taking into account what messages are above it in the tree and is returning some insane number. If that is the case, it is possible that there is not an easy or practical fix that doesn't impact the performance of ::records_in_range and returns reasonably accurate estimates for all workloads and cases of the tree state. Meaning the 'fix' is to keep the trees optimized, which has been a historic problem with TokuDB and PerconaFT and is precisely what drove the switch from physical counts to logical counts as well as automatic table analysis.

Either way, I can't debug until I actually reproduce the issue and have a known key/query that I can target once I reach that state.

Rick Pizzi (pizzi) wrote :
Download full text (3.4 KiB)

Please see answers inline.

On 10/mar/2017, at 21:48, George Ormond Lorch III <email address hidden> wrote:

> Ahh, OK, so you are running to 100K rows, then _continuing_ to run, so
> 100K is roughly the max and you are just then continuing to work within
> that range of keys. That means more deletes or actual hits on existing
> rows. My bash does the same, originally was limited to 32K rows but I
> changed that and went to 3 million rows.

Correct, this is to simulate our workload when we are bitten by the bug.
We have more than 100k possible keys in that workload of course, I have restricted
the distribution amplitude to reproduce the problem quickly.

I would recommend to try using sysbench (we use 0.5.1) with the LUA code I posted.

>
> The logical row estimate with messages in flight will never be accurate,
> it can not be within a deferred work system without performing an actual
> recount each time it is asked for (a table scan).

Well, from the dumps I have done with the tokuft_dump utility, the logical count was always
accurate for some reason. Just see the samples I posted above in this bug report…. the estimate
was way off, but the logical row count was always approximately 100,000. Maybe a coincidence?

> So what is being
> returned now from ::records_in_range when queried for -+infinity and
> ::info is the logical row estimate for the PK. Prior to this fix, ::info
> was returning the logical row estimate for the PK and ::records_in_range
> -_infinity was returning the logical row estimate for that particular
> key, which could be very different from the PK.

I understood that and I thought that would have solved the issue, but unfortunately it didn’t,
although things are a bit better than before in our tests.

>
> There may be some other issue that I am trying to get in to if
> ::records_in_range is being provided actual key range values, this goes
> deep into the fractal tree and requires debugging as this is a code area
> that I have never been into before. It is entirely possible that this
> code is just dropping down to leaf nodes and making some wild
> assumptions about what the distance is between two keys without taking
> into account what messages are above it in the tree and is returning
> some insane number. If that is the case, it is possible that there is
> not an easy or practical fix that doesn't impact the performance of
> ::records_in_range and returns reasonably accurate estimates for all
> workloads and cases of the tree state. Meaning the 'fix' is to keep the
> trees optimized, which has been a historic problem with TokuDB and
> PerconaFT and is precisely what drove the switch from physical counts to
> logical counts as well as automatic table analysis.

I understand this may not be an easy thing to fix, nonetheless this is an important bug
and can prevent use of Toku in specific workloads where there are many deletions.
I have been trying to run OPTIMIZE TABLE (throttled) during the benchmark but it
seems to be taking too much time and the bug hits nonetheless.
I will have to try with more aggressive values (less throttling) hoping that it will not
create FT locking issues. But even if I succeed, t...

Read more...

I just want to be sure that you do not have tokudb_auto_analyze= (anything > 0) and tokudb_analyze_in_background=off, if so, what you are seeing is an operation that triggered a table analysis and is performing that analysis in the same context as the thread the triggered the operation. This is not the default set of options and looks like this in show proceslist:

analyze table standard test.toku#P#p376.EXPIRE_DATE_IX 1 of 2 2% rows 0% time, scanning forward

OK, got one...debugging

So I see something interesting, the bad query only seems to happen on the largest hash_id in the whole table. Not the smallest and not the largest and smallest in a partition, only the largest in the entire table.

Confirmed, it happens on the largest compared hash_id in the hash_id_idx. It seems that the key must have existed at one point as just doing a DELETE with any arbitrary higher vale does not trigger.

mysql> select SHA2('70164', 256);
+------------------------------------------------------------------+
| SHA2('70164', 256) |
+------------------------------------------------------------------+
| ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 |
+------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select MAX(hash_id) from test.toku;
+------------------------------------------------------------------+
| MAX(hash_id) |
+------------------------------------------------------------------+
| ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 |
+------------------------------------------------------------------+
1 row in set (0.01 sec)

Download full text (5.3 KiB)

OK some data, this is kinds of tough to explain w/o getting into detailed FT
internals so bear with me:
1) The partition handler walks the 'active' partitions (partitions noted to be
   involved in the query) of which there are three in this query (don't ask me
   why or where three comes from, it is the same for InnoDB in the matching
   case). It walks them in descending order of 'largest' to 'smallest' based on
   the reported number of rows in the partition.
2) It calls ::records_in_range and is being asked for how many records match
   between two keys, but the keys are identical, this is a particularly
   important component of this specific issue (it is a point select, not a
   range).
3) We work our way down into PerconaFT, all the way down to the leaf nodes to
   try and figure out how many rows exist between the two keys.
   - PerconaFT nodes are 'partitioned'.
     - A node has some meta data, a set of pivot keys, and a set of
       partitions.
     - Only the partitions of a node that are required for whatever the
       immediate operation is loaded into memory.
     - Any write to a node requires that the node is fully in memory as the node
       must be written as a contiguous data blob, reads only read through the
       partitions needed to satisfy the query.
       - Partitions may be individually evicted without evicting the entire
         node, this is known as partial eviction and is controlled via
         tokudb_enable_partial_evictions (default=true)
       - Internal/non-leaf nodes:
         - The tokudb_fanout parameter (default 16) defines the maximum number
           of child nodes that an internal node can have. A internal node can
           and often does have fewer children, but never more.
         - Each internal node has a set of 'pivot keys' that define the key
           boundaries between the edges of the internal node and each individual
           child node.
         - There is one message buffer for each child node, this is a partition.
       - External/leaf nodes:
         - Are partitioned based on tokudb_block_size/tokudb_read_block_size.
           This defines the number of partitions in a leaf node. A leaf node can
           have fewer partitions than the max defined, nut never more.
         - Leaf node partitions are called basement nodes.
         - A basement node contains a series of 'leaf entries'. Leaf entries
           represent a 'row', with a key and the MVCC stack for that key.
   - PerconaFT contains an optimization when estimating the number of keys
     between a range within a leaf node. If the target basement node is not in
     memory, it 'guesses' the number of keys that might be within that range. If
     the basement node is in memory, then it seems an accurate count is fetched,
     messages are transiently pulled down from above and applied to the leaf
     entries.
     - This 'guess' is a bit of a multi-layered calculation:
       - It takes the total number of _physical_ rows in the entire leaf node,
         and divides that by the number of basement nodes in the leaf node.
         ** Note: It uses the physical count of leaf entries, not the logical
         tr...

Read more...

I found some other thing down in the PerconaFT library that make no sense in the estimations where internal nodes are adding to the estimation totals values that have nothing to do with the actual contents of the leaf nodes. This artificially inflates the numbers coming out of ::records_in_range by orders of magnitude.

Rick Pizzi (pizzi) wrote :
Download full text (17.7 KiB)

> Now that the internals lesson is over, there are a couple of things here that I
> think can be done.
> - First, I think that since the FT logic has narrowed us down to a basement node
> where we know that there is some cross section of the start key, it might make
> sense when the basement is not in memory to compare the end key to the start
> key for equality, and if so, just return 1 match, else go ahead and return the
> phony estimate. This will not fix the same issue for narrow range scans, but
> it will fix point operations such as the point deletion in this example.
> - Second, it an be argued that this optimization is incorrect and that PerconaFT
> should bring at least one of, if not all of the needed basements into memory
> in order to obtain a more accurate Then this is no longer an estimate, it is a
> real count. If the optimizer then chooses not to use this index, we just did
> possibly a whole lot of read I/O for nothing. This will add to the TokuDB
> over-read woes. Think of the case of maybe a table with many indices and
> someone does a "SELECT * from table WHERE id between reallysmall AND
> reallylarge". It is possible that the optimizer would call
> ::records_in_range for all matching indices with this huge range, scanning
> several indices, then just resorting to a table scan anyway. So you now have
> an index scan for each matching index, just to test the index, then whatever
> scan the optimizer chooses and the final fetch. So a lot of potential for
> over-read.
>
> I think implementing the first idea would be nearly a no-op in terms of chances
> of breaking something. Going into the second is a possible rat hole of breaking
> established performance characteristics.

Thank you, the “internal lesson” was a very interesting reading.
I believe fixing the point deletions is a good idea and much needed, it will
solve the problem in the current test case.

Not sure what would happen when we add a condition on EXPIRE_DATE (which we
actually do for partition pruning purposes), but considering that MySQL will at most use
one index, perhaps it will just work fine as HASH_ID_IX will be selected by optimizer,
and the other condition only used for pruning.

I see you have found more issues about the rows estimate too.
I am eager to test whatever fix you will submit.

Thanks

Rick


Riccardo Pizzi
<email address hidden>
http://www.pinballowners.com/webmaster

On 11/mar/2017, at 01:32, George Ormond Lorch III <email address hidden> wrote:

> OK some data, this is kinds of tough to explain w/o getting into detailed FT
> internals so bear with me:
> 1) The partition handler walks the 'active' partitions (partitions noted to be
> involved in the query) of which there are three in this query (don't ask me
> why or where three comes from, it is the same for InnoDB in the matching
> case). It walks them in descending order of 'largest' to 'smallest' based on
> the reported number of rows in the partition.
> 2) It calls ::records_in_range and is being asked for how many records match
> between two keys, but the keys are identical, this is a particularly
> important component of this specific issue (it is ...

OK, I think I found the root problem, the discussion above dances around it and the previous patch is part of it but none of it actually hit the real root problem.

The PerconaFT keyrange estimator starts off at the top of the tree with the estimate being the total number of _physical_ rows in the tree, not the logical count. It then begins walking down the tree and slowly knocks off of that count the logical rows that fall outside of the range...so right away there is an incongruency between what it believes is the estimate and what the real estimate is, and that difference is the 'garbage' within the tree.

In this very specific case it hits an interesting edge where the very last pivot key on the farthest rightmost leaf node is the actual point key that the range is searching for, and, the farthest rightmost basement node is empty and the basement node just to the left of it is not in memory. So the estimator must 'guess' that both nodes are full of keys that match the range and returns what it estimates as the number of keys in a basement node * these two nodes. This is where the huge row count comes out to the optimizer. I believe there may be a few different combinations like this that could cause a bad large value to be returned but they are the exceptions, not the rule.

By changing the keyrange estimator to start with the current logical row count instead of the physical row count, the number of estimated rows per basement node is now much smaller and more accurate logically and returns a far more accurate guess number back out to the optimizer in ::records_in_range. The optimizer, now having a more reasonable estimate, then uses it to arrive at the correct query plan.

Rick Pizzi (pizzi) wrote :

Recompiling with the above changes (thanks George!) I confirm that the problem is now gone.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1073

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments