Bug #1671152 “tokudb does not use index even if cardinality is ...” : Bugs : Percona Server moved to https://jira.percona.com/projects/PS

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-08:

#1

Download full text (5.6 KiB)

Here's the optimizer trace that indicates that MySQL thinks that the cost for the index scan is higher than the cost for a full table scan. This is of course wrong.

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          {
            "expanded_query": "/* select#1 */ select `BIG_STORAGE_TOKU`.`EXPIRE_DATE` AS `EXPIRE_DATE` from `BIG_STORAGE_TOKU` where (`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))"
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          {
            "condition_processing": {
              "condition": "WHERE",
              "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
              "steps": [
                {
                  "transformation": "equality_propagation",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                },
                {
                  "transformation": "constant_propagation",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                },
                {
                  "transformation": "trivial_condition_removal",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                }
              ]
            }
          },
          {
            "table_dependencies": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "row_may_be_null": false,
                "map_bit": 0,
                "depends_on_map_bits": [
                ]
              }
            ]
          },
          {
            "ref_optimizer_key_uses": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "field": "HASH_ID",
                "equals": "sha2('70164',256)",
                "null_rejecting": false
              }
            ]
          },
          {
            "rows_estimation": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "range_analysis": {
                  "table_scan": {
                    "rows": 100030,
                    "cost": 53351
                  },
                  "potential_range_indices": [
                    {
                      "index": "HASH_ID_IX",
                      "usable": true,
                      "key_parts": [
                        "HASH_ID"
                      ]
                    },
                    {
                      "index": "EXPIRE_DATE_IX",
                      "usable": false,
                      "cause": "not_applicable"
                    }
                  ],
                  "setup_range_conditions": [
                  ],
                  "group_index_range": {
                    "chosen": false,
                    "cause": "not_group_by_or_distinct"
                  },
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "HASH_ID_IX",
                        "ranges": [
                          "ffff925a44d...

Here's the optimizer trace that indicates that MySQL thinks that the cost for the index scan is higher than the cost for a full table scan. This is of course wrong.

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          {
            "expanded_query": "/* select#1 */ select `BIG_STORAGE_TOKU`.`EXPIRE_DATE` AS `EXPIRE_DATE` from `BIG_STORAGE_TOKU` where (`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))"
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          {
            "condition_processing": {
              "condition": "WHERE",
              "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
              "steps": [
                {
                  "transformation": "equality_propagation",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                },
                {
                  "transformation": "constant_propagation",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                },
                {
                  "transformation": "trivial_condition_removal",
                  "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
                }
              ]
            }
          },
          {
            "table_dependencies": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "row_may_be_null": false,
                "map_bit": 0,
                "depends_on_map_bits": [
                ]
              }
            ]
          },
          {
            "ref_optimizer_key_uses": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "field": "HASH_ID",
                "equals": "sha2('70164',256)",
                "null_rejecting": false
              }
            ]
          },
          {
            "rows_estimation": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "range_analysis": {
                  "table_scan": {
                    "rows": 100030,
                    "cost": 53351
                  },
                  "potential_range_indices": [
                    {
                      "index": "HASH_ID_IX",
                      "usable": true,
                      "key_parts": [
                        "HASH_ID"
                      ]
                    },
                    {
                      "index": "EXPIRE_DATE_IX",
                      "usable": false,
                      "cause": "not_applicable"
                    }
                  ],
                  "setup_range_conditions": [
                  ],
                  "group_index_range": {
                    "chosen": false,
                    "cause": "not_group_by_or_distinct"
                  },
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "HASH_ID_IX",
                        "ranges": [
                          "ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 <= HASH_ID <= ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084"
                        ],
                        "index_dives_for_eq_ranges": true,
                        "rowid_ordered": true,
                        "using_mrr": false,
                        "index_only": false,
                        "rows": 239680,
                        "cost": 287617,
                        "chosen": false,
                        "cause": "cost"
                      }
                    ],
                    "analyzing_roworder_intersect": {
                      "usable": false,
                      "cause": "too_few_roworder_scans"
                    }
                  }
                }
              }
            ]
          },
          {
            "considered_execution_plans": [
              {
                "plan_prefix": [
                ],
                "table": "`BIG_STORAGE_TOKU`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "HASH_ID_IX",
                      "rows": 239680,
                      "cost": 57939,
                      "chosen": true
                    },
                    {
                      "access_type": "scan",
                      "rows": 75023,
                      "cost": 53349,
                      "chosen": true
                    }
                  ]
                },
                "cost_for_plan": 53349,
                "rows_for_plan": 75023,
                "chosen": true
              }
            ]
          },
          {
            "attaching_conditions_to_tables": {
              "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
              "attached_conditions_computation": [
              ],
              "attached_conditions_summary": [
                {
                  "table": "`BIG_STORAGE_TOKU`",
                  "attached": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))"
                }
              ]
            }
          },
          {
            "refine_plan": [
              {
                "table": "`BIG_STORAGE_TOKU`",
                "access_type": "table_scan"
              }
            ]
          }
        ]
      }
    },
    {
      "join_execution": {
        "select#": 1,
        "steps": [
        ]
      }
    }
  ]
}

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-08:

#2

Running OPTIMIZE TABLE and repeating the test brings the desired result.
I believe that the issue is due to the wrong estimate of rows in the table, which is probably TokuDB not subtracting the deleted rows from the count.
If this is correct, in a workload with many deletes, this quickly becomes a serious problem.

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-08:

#3

After OPTIMIZE TABLE, cost for using index is correct and index is used:

                "table": "`BIG_STORAGE_TOKU`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "HASH_ID_IX",
                      "rows": 9,
                      "cost": 10.8,
                      "chosen": true

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-08:

#4

So we had an extended offline discussion and some debugging and it comes down to a few things.

1) Logical row counting in PerconaFT is a moving target as operations can be counted at the time of operation, and others can not be known until the message is persistently delivered to the leaf nodes. It 'floats' towards accurate but is never going to be 100% accurate except for the moment after either an OPTIMIZE TABLE or an ANALYZE TABLE with RECOUNT_ROWS
2) Different indices, PK and secondaries, on the same table will report different logical row count estimates back simply due to the dynamics of the PerconaFT messaging scheme, different record sizes, data concentration within nodes, background threads randomly picking nodes to perform work on, etc, etc...all impact the 'in-flight' messaging and means that no two indices will ever be identical.
3) When an OPTIMIZE is run, it flushes all messages to the leaf nodes and removes any garbage from all of the index trees, and as a result normalizes the row counts across the indices.
4) There is a bug (I say bug because it is a different behavior from InnoDB) in TokuDBs ha_tokudb::records_in_range method when the optimizer asks for the number of records in the range of -infinity to +infinity. TokuDB returns the logical row count for that index, not the table/PK which tends to always be more accurate. This causes the optimizer to calculate different costs for different indices as a result of getting sometimes drastically different logical row counts. InnoDB returns the same 'table count' for all indices. The bug fix should be to have TokuDB return the logical row count value from the PK for all indices so that the optimizer can compare apples to apples w.r.t. cardinality statistics rather than apples to Volkswagens.

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-09:

#5

http://jenkins.percona.com/view/PS%205.6/job/percona-server-5.6-param/1761/
https://github.com/percona/percona-server/pull/1484

http://jenkins.percona.com/view/5.7/job/mysql-5.7-param/713/
https://github.com/percona/percona-server/pull/1485

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-10:

#6

Download full text (7.5 KiB)

I just tried the above fix (compiled from source after applying the above pull request for 5.6 by hand) and it does NOT fix the problem for us in the above benchmark.

Optimizer still prefers full table scan due to cost for index dive:

{
  "steps": [
    {
      "condition_processing": {
        "condition": "WHERE",
        "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
        "steps": [
          {
            "transformation": "equality_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "constant_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "trivial_condition_removal",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          }
        ]
      }
    },
    {
      "table": "`BIG_STORAGE_TOKU`",
      "range_analysis": {
        "table_scan": {
          "rows": 100036,
          "cost": 53355
        },
        "potential_range_indices": [
          {
            "index": "HASH_ID_IX",
            "usable": true,
            "key_parts": [
              "HASH_ID"
            ]
          },
          {
            "index": "EXPIRE_DATE_IX2",
            "usable": true,
            "key_parts": [
              "EXPIRE_DATE"
            ]
          }
        ],
        "setup_range_conditions": [
        ],
        "group_index_range": {
          "chosen": false,
          "cause": "no_join"
        },
        "analyzing_range_alternatives": {
          "range_scan_alternatives": [
            {
              "index": "HASH_ID_IX",
              "ranges": [
                "ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 <= HASH_ID <= ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084"
              ],
              "index_dives_for_eq_ranges": true,
              "rowid_ordered": true,
              "using_mrr": false,
              "index_only": false,
              "rows": 154749,
              "cost": 185700,
              "chosen": false,
              "cause": "cost"
            }
          ]
        }
      }
    }
  ]
}

[root@dbcache05 data]# tokuftdump --header _sbtest_sql_3af4_38_P_p376_main_10d2efd87_2_1d.tokudb
ft:
layout_version=29
layout_version_original=29
layout_version_read_from_disk=29
build_id=0
build_id_original=0...

I just tried the above fix (compiled from source after applying the above pull request for 5.6 by hand) and it does NOT fix the problem for us in the above benchmark.

Optimizer still prefers full table scan due to cost for index dive:

{
  "steps": [
    {
      "condition_processing": {
        "condition": "WHERE",
        "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
        "steps": [
          {
            "transformation": "equality_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "constant_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "trivial_condition_removal",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          }
        ]
      }
    },
    {
      "table": "`BIG_STORAGE_TOKU`",
      "range_analysis": {
        "table_scan": {
          "rows": 100036,
          "cost": 53355
        },
        "potential_range_indices": [
          {
            "index": "HASH_ID_IX",
            "usable": true,
            "key_parts": [
              "HASH_ID"
            ]
          },
          {
            "index": "EXPIRE_DATE_IX2",
            "usable": true,
            "key_parts": [
              "EXPIRE_DATE"
            ]
          }
        ],
        "setup_range_conditions": [
        ],
        "group_index_range": {
          "chosen": false,
          "cause": "no_join"
        },
        "analyzing_range_alternatives": {
          "range_scan_alternatives": [
            {
              "index": "HASH_ID_IX",
              "ranges": [
                "ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 <= HASH_ID <= ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084"
              ],
              "index_dives_for_eq_ranges": true,
              "rowid_ordered": true,
              "using_mrr": false,
              "index_only": false,
              "rows": 154749,
              "cost": 185700,
              "chosen": false,
              "cause": "cost"
            }
          ]
        }
      }
    }
  ]
}

[root@dbcache05 data]# tokuftdump --header  _sbtest_sql_3af4_38_P_p376_main_10d2efd87_2_1d.tokudb
ft:
 layout_version=29
 layout_version_original=29
 layout_version_read_from_disk=29
 build_id=0
 build_id_original=0
 time_of_creation=         1489145268    Fri Mar 10 12:27:48 2017
 time_of_last_modification=1489145606    Fri Mar 10 12:33:26 2017
 dirty=0
 checkpoint_count=100
 checkpoint_lsn=5904661771
 nodesize=4194304
 fanout=16
 basementnodesize=65536
 compression_method=0
 unnamed_root=3
 flags=0
 descriptor size 16 07000000000008050000000104000000
 estimated numrows=27364
 estimated numbytes=183478891
 logical row count=9875
[root@dbcache05 data]# tokuftdump --header  _sbtest_sql_3af4_38_P_p377_main_10d2efd88_2_1d.tokudb 
ft:
 layout_version=29
 layout_version_original=29
 layout_version_read_from_disk=29
 build_id=0
 build_id_original=0
 time_of_creation=         1489145268    Fri Mar 10 12:27:48 2017
 time_of_last_modification=1489145613    Fri Mar 10 12:33:33 2017
 dirty=0
 checkpoint_count=96
 checkpoint_lsn=5904725617
 nodesize=4194304
 fanout=16
 basementnodesize=65536
 compression_method=0
 unnamed_root=3
 flags=0
 descriptor size 16 07000000000008050000000104000000
 estimated numrows=29948
 estimated numbytes=202821885
 logical row count=9992

[root@dbcache05 data]# tokuftdump --header  _sbtest_sql_3af4_38_P_p378_main_10d2efd89_2_1d.tokudb 
ft:
 layout_version=29
 layout_version_original=29
 layout_version_read_from_disk=29
 build_id=0
 build_id_original=0
 time_of_creation=         1489145268    Fri Mar 10 12:27:48 2017
 time_of_last_modification=1489145614    Fri Mar 10 12:33:34 2017
 dirty=0
 checkpoint_count=106
 checkpoint_lsn=5904736258
 nodesize=4194304
 fanout=16
 basementnodesize=65536
 compression_method=0
 unnamed_root=3
 flags=0
 descriptor size 16 07000000000008050000000104000000
 estimated numrows=27045
 estimated numbytes=180967676
 logical row count=10017

As usual, optimize table fixes it:

[root@dbcache05 data]# tokuftdump --header  _sbtest_sql_3af4_38_P_p376_main_10d2efd87_2_1d.tokudb
ft:
 layout_version=29
 layout_version_original=29
 layout_version_read_from_disk=29
 build_id=0
 build_id_original=0
 time_of_creation=         1489145268    Fri Mar 10 12:27:48 2017
 time_of_last_modification=1489146147    Fri Mar 10 12:42:27 2017
 dirty=0
 checkpoint_count=118
 checkpoint_lsn=5910291212
 nodesize=4194304
 fanout=16
 basementnodesize=65536
 compression_method=0
 unnamed_root=3
 flags=0
 descriptor size 16 07000000000008050000000104000000
 estimated numrows=10636
 estimated numbytes=84542788
 logical row count=9875

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-10:

#7

Download full text (6.5 KiB)

There are some records in the dataset which trigger the table scan effect, and other which do not:

dbcache05>SET optimizer_trace='enabled=on';
Query OK, 0 rows affected (0.00 sec)

dbcache05>DELETE FROM BIG_STORAGE_TOKU WHERE HASH_ID = SHA2('70164', 256);
Query OK, 1 row affected (0.00 sec)

dbcache05>SELECT trace FROM information_schema.optimizer_trace \G
*************************** 1. row ***************************
trace: {
  "steps": [
    {
      "condition_processing": {
        "condition": "WHERE",
        "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
        "steps": [
          {
            "transformation": "equality_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "constant_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "trivial_condition_removal",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          }
        ]
      }
    },
    {
      "table": "`BIG_STORAGE_TOKU`",
      "range_analysis": {
        "table_scan": {
          "rows": 100022,
          "cost": 53347
        },
        "potential_range_indices": [
          {
            "index": "HASH_ID_IX",
            "usable": true,
            "key_parts": [
              "HASH_ID"
            ]
          },
          {
            "index": "EXPIRE_DATE_IX2",
            "usable": true,
            "key_parts": [
              "EXPIRE_DATE"
            ]
          }
        ],
        "setup_range_conditions": [
        ],
        "group_index_range": {
          "chosen": false,
          "cause": "no_join"
        },
   ...

There are some records in the dataset which trigger the table scan effect, and other which do not:

dbcache05>SET optimizer_trace='enabled=on';
Query OK, 0 rows affected (0.00 sec)

dbcache05>DELETE FROM BIG_STORAGE_TOKU  WHERE HASH_ID = SHA2('70164', 256);
Query OK, 1 row affected (0.00 sec)

dbcache05>SELECT trace FROM information_schema.optimizer_trace \G
*************************** 1. row ***************************
trace: {
  "steps": [
    {
      "condition_processing": {
        "condition": "WHERE",
        "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('70164',256))",
        "steps": [
          {
            "transformation": "equality_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "constant_propagation",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "trivial_condition_removal",
            "resulting_condition": "multiple equal(sha2('70164',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          }
        ]
      }
    },
    {
      "table": "`BIG_STORAGE_TOKU`",
      "range_analysis": {
        "table_scan": {
          "rows": 100022,
          "cost": 53347
        },
        "potential_range_indices": [
          {
            "index": "HASH_ID_IX",
            "usable": true,
            "key_parts": [
              "HASH_ID"
            ]
          },
          {
            "index": "EXPIRE_DATE_IX2",
            "usable": true,
            "key_parts": [
              "EXPIRE_DATE"
            ]
          }
        ],
        "setup_range_conditions": [
        ],
        "group_index_range": {
          "chosen": false,
          "cause": "no_join"
        },
        "analyzing_range_alternatives": {
          "range_scan_alternatives": [
            {
              "index": "HASH_ID_IX",
              "ranges": [
                "ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 <= HASH_ID <= ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084"
              ],
              "index_dives_for_eq_ranges": true,
              "rowid_ordered": true,
              "using_mrr": false,
              "index_only": false,
              "rows": 14,
              "cost": 17.81,
              "chosen": true
            }
          ]
        },
        "chosen_range_access_summary": {
          "range_access_plan": {
            "type": "range_scan",
            "index": "HASH_ID_IX",
            "rows": 14,
            "ranges": [
              "ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 <= HASH_ID <= ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084"
            ]
          },
          "rows_for_plan": 14,
          "cost_for_plan": 17.81,
          "chosen": true
        }
      }
    }
  ]
}
1 row in set (0.00 sec)

dbcache05>DELETE FROM BIG_STORAGE_TOKU  WHERE HASH_ID = SHA2('30398', 256);
Query OK, 1 row affected (2.40 sec)

dbcache05>SELECT trace FROM information_schema.optimizer_trace \G
*************************** 1. row ***************************
trace: {
  "steps": [
    {
      "condition_processing": {
        "condition": "WHERE",
        "original_condition": "(`BIG_STORAGE_TOKU`.`HASH_ID` = sha2('30398',256))",
        "steps": [
          {
            "transformation": "equality_propagation",
            "resulting_condition": "multiple equal(sha2('30398',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "constant_propagation",
            "resulting_condition": "multiple equal(sha2('30398',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          },
          {
            "transformation": "trivial_condition_removal",
            "resulting_condition": "multiple equal(sha2('30398',256), `BIG_STORAGE_TOKU`.`HASH_ID`)"
          }
        ]
      }
    },
    {
      "table": "`BIG_STORAGE_TOKU`",
      "range_analysis": {
        "table_scan": {
          "rows": 100021,
          "cost": 53347
        },
        "potential_range_indices": [
          {
            "index": "HASH_ID_IX",
            "usable": true,
            "key_parts": [
              "HASH_ID"
            ]
          },
          {
            "index": "EXPIRE_DATE_IX2",
            "usable": true,
            "key_parts": [
              "EXPIRE_DATE"
            ]
          }
        ],
        "setup_range_conditions": [
        ],
        "group_index_range": {
          "chosen": false,
          "cause": "no_join"
        },
        "analyzing_range_alternatives": {
          "range_scan_alternatives": [
            {
              "index": "HASH_ID_IX",
              "ranges": [
                "ffff8ed56f65caf0019f90d65b7f158b862efcab5c4517a76c01c73acf92d99b <= HASH_ID <= ffff8ed56f65caf0019f90d65b7f158b862efcab5c4517a76c01c73acf92d99b"
              ],
              "index_dives_for_eq_ranges": true,
              "rowid_ordered": true,
              "using_mrr": false,
              "index_only": false,
              "rows": 181860,
              "cost": 218233,
              "chosen": false,
              "cause": "cost"
            }
          ]
        }
      }
    }
  ]
}
1 row in set (0.01 sec)

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-10:

#8

inno-toku.txt Edit (7.8 KiB, text/plain)

This is strange Rick. With the current 'fix' I used exactly your table structure to create one TokuDB table and one InnoDB table. I then took your lua script and 'bash'ed it and got about 150K rows spread across 9 or 10 partitions into the TokuDB table. I then did an INSERT INTO test.BIG_STORAGE_INNO SELECT * FROM BIG_STORAGE_TOKU followed by ANALYZE TABLE on both tables.

So we now have two tables that are identical in content and partition distribution.

Attached are the SHOW INDEXES and EXPLAIN results from the two tables. They seem to be very close in responses and choose the same plans.

So I am unsure of where the differences lie. Maybe I'll see if I can get more rows in the tables for a larger sampling set.

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-10: Re: [Bug 1671152] Re: tokudb does not use index even if cardinality is good

#9

Download full text (12.1 KiB)

The Lua script will create rows with HASH_ID based on a SHA256 encoding of numeric values between 00000 and 99999, so it can create at most 100,000 different values
(and it will, very quickly).
If you want to see the actual problem in action, it should be fairly easy.
Sysbench, 16 threads, my LUA posted here, let it run and when count is around 100,000, start looking at output of SHOW PROCESSLIST while sysbench runs.
You will see that from time to time, one of the DELETE will show that it is doing a full table scan (and Toku will clearly mark it as such because
you will see something like “scanned 20,000 rows of about 100,000” or something like that).
When you see this happening, just explain the DELETE for that key and you will see it is not using any index.
At that point in time, you can stop sysbench and run your explain again which will bring same result (no index used).
Different keys _do_use the index, which adds to the weirdness. See my example above,

From my tests with your patch, things are better (the amount of full table scans is less) but still not okay.
The estimate is still not correct.
Again, I wonder why you can’t just return the logical row count as the estimate? It always looks correct.

Thanks
Rick
—
Riccardo Pizzi
<email address hidden>
http://www.pinballowners.com/webmaster

On 10/mar/2017, at 19:26, George Ormond Lorch III <email address hidden> wrote:

> This is strange Rick. With the current 'fix' I used exactly your table
> structure to create one TokuDB table and one InnoDB table. I then took
> your lua script and 'bash'ed it and got about 150K rows spread across 9
> or 10 partitions into the TokuDB table. I then did an INSERT INTO
> test.BIG_STORAGE_INNO SELECT * FROM BIG_STORAGE_TOKU followed by ANALYZE
> TABLE on both tables.
>
> So we now have two tables that are identical in content and partition
> distribution.
>
> Attached are the SHOW INDEXES and EXPLAIN results from the two tables.
> They seem to be very close in responses and choose the same plans.
>
> So I am unsure of where the differences lie. Maybe I'll see if I can get
> more rows in the tables for a larger sampling set.
>
> ** Attachment added: "inno-toku.txt"
> https://bugs.launchpad.net/percona-server/+bug/1671152/+attachment/4835459/+files/inno-toku.txt
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1671152
>
> Title:
> tokudb does not use index even if cardinality is good
>
> Status in Percona Server:
> Fix Committed
> Status in Percona Server 5.6 series:
> Fix Committed
> Status in Percona Server 5.7 series:
> Fix Committed
>
> Bug description:
> Please consider the following partitioned table:
>
> CREATE TABLE `BIG_STORAGE_TOKU` (
> `HASH_ID` char(64) NOT NULL,
> `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
> `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
> `RAW_DATA` mediumblob NOT NULL,
> `LAST_UPDATE` datetime NOT NULL,
> `EXPIRE_DATE` date NOT NULL,
> KEY `HASH_ID_IX` (`HASH_ID`),
> KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
> ) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRE...

The Lua script will create rows with HASH_ID based on a SHA256 encoding of numeric values between 00000 and 99999, so it can create at most 100,000 different values
(and it will, very quickly). 
If you want to see the actual problem in action, it should be fairly easy.
Sysbench, 16 threads, my LUA posted here, let it run and  when count is around 100,000, start looking at output of SHOW PROCESSLIST while sysbench runs.
You will see that from time to time, one of the DELETE will show that it is doing a full table scan (and Toku will clearly mark it as such because
you will see something like “scanned 20,000 rows of about 100,000” or something like that).
When you see this happening, just explain the DELETE for that key and you will see it is not using any index.
At that point in time, you can stop sysbench and run your explain again which will bring same result (no index used).
Different keys _do_use the index, which adds to the weirdness. See my example above,

From my tests with your patch, things are better (the amount of full table scans is less) but still not okay.
The estimate is still not correct. 
Again, I wonder why you can’t just return the logical row count as the estimate? It always looks correct.

Thanks
Rick
— 
Riccardo Pizzi
pizzi@leopardus.com
http://www.pinballowners.com/webmaster

On 10/mar/2017, at 19:26, George Ormond Lorch III <george.lorch@percona.com> wrote:

> This is strange Rick. With the current 'fix' I used exactly your table
> structure to create one TokuDB table and one InnoDB table. I then took
> your lua script and 'bash'ed it and got about 150K rows spread across 9
> or 10 partitions into the TokuDB table. I then did an INSERT INTO
> test.BIG_STORAGE_INNO SELECT * FROM BIG_STORAGE_TOKU followed by ANALYZE
> TABLE on both tables.
> 
> So we now have two tables that are identical in content and partition
> distribution.
> 
> Attached are the SHOW INDEXES and EXPLAIN results from the two tables.
> They seem to be very close in responses and choose the same plans.
> 
> So I am unsure of where the differences lie. Maybe I'll see if I can get
> more rows in the tables for a larger sampling set.
> 
> ** Attachment added: "inno-toku.txt"
>   https://bugs.launchpad.net/percona-server/+bug/1671152/+attachment/4835459/+files/inno-toku.txt
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1671152
> 
> Title:
>  tokudb  does not use index even if cardinality is good
> 
> Status in Percona Server:
>  Fix Committed
> Status in Percona Server 5.6 series:
>  Fix Committed
> Status in Percona Server 5.7 series:
>  Fix Committed
> 
> Bug description:
>  Please consider the following partitioned table:
> 
>  CREATE TABLE `BIG_STORAGE_TOKU` (
>    `HASH_ID` char(64) NOT NULL,
>    `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
>    `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
>    `RAW_DATA` mediumblob NOT NULL,
>    `LAST_UPDATE` datetime NOT NULL,
>    `EXPIRE_DATE` date NOT NULL,
>    KEY `HASH_ID_IX` (`HASH_ID`),
>    KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
>  ) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
>  /*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
>  (PARTITION p364 VALUES LESS THAN (736753) ENGINE = TokuDB,
>   PARTITION p365 VALUES LESS THAN (736754) ENGINE = TokuDB,
>   PARTITION p366 VALUES LESS THAN (736755) ENGINE = TokuDB,
>   PARTITION p367 VALUES LESS THAN (736756) ENGINE = TokuDB,
>   PARTITION p368 VALUES LESS THAN (736757) ENGINE = TokuDB,
>   PARTITION p369 VALUES LESS THAN (736758) ENGINE = TokuDB,
>   PARTITION p370 VALUES LESS THAN (736759) ENGINE = TokuDB,
>   PARTITION p371 VALUES LESS THAN (736760) ENGINE = TokuDB,
>   PARTITION p372 VALUES LESS THAN (736761) ENGINE = TokuDB,
>   PARTITION p373 VALUES LESS THAN (736762) ENGINE = TokuDB,
>   PARTITION p374 VALUES LESS THAN (736763) ENGINE = TokuDB,
>   PARTITION p375 VALUES LESS THAN (736764) ENGINE = TokuDB,
>   PARTITION p376 VALUES LESS THAN (736765) ENGINE = TokuDB,
>   PARTITION p377 VALUES LESS THAN (736766) ENGINE = TokuDB,
>   PARTITION p378 VALUES LESS THAN (736767) ENGINE = TokuDB,
>   PARTITION p379 VALUES LESS THAN (736768) ENGINE = TokuDB,
>   PARTITION p380 VALUES LESS THAN (736769) ENGINE = TokuDB,
>   PARTITION p381 VALUES LESS THAN (736770) ENGINE = TokuDB,
>   PARTITION p382 VALUES LESS THAN (736771) ENGINE = TokuDB,
>   PARTITION p383 VALUES LESS THAN (736772) ENGINE = TokuDB,
>   PARTITION p384 VALUES LESS THAN (736773) ENGINE = TokuDB,
>   PARTITION p385 VALUES LESS THAN (736774) ENGINE = TokuDB,
>   PARTITION p386 VALUES LESS THAN (736775) ENGINE = TokuDB,
>   PARTITION p387 VALUES LESS THAN (736776) ENGINE = TokuDB,
>   PARTITION p388 VALUES LESS THAN (736777) ENGINE = TokuDB,
>   PARTITION p389 VALUES LESS THAN (736778) ENGINE = TokuDB,
>   PARTITION p390 VALUES LESS THAN (736779) ENGINE = TokuDB,
>   PARTITION p391 VALUES LESS THAN (736780) ENGINE = TokuDB,
>   PARTITION p392 VALUES LESS THAN (736781) ENGINE = TokuDB,
>   PARTITION p393 VALUES LESS THAN (736782) ENGINE = TokuDB,
>   PARTITION p394 VALUES LESS THAN (736783) ENGINE = TokuDB,
>   PARTITION p395 VALUES LESS THAN (736784) ENGINE = TokuDB) */
> 
> 
>  Load the table with sysbench using following LUA code (let it run until row count is about 100,000)
> 
> 
>  function event(thread_id)
>     local s1 = sb_rand_str("#####")
>     db_query("DELETE FROM BIG_STORAGE_INNO  WHERE HASH_ID = SHA2('" .. s1 .. "', 256) AND EXPIRE_DATE > NOW()");
>     db_query("INSERT INTO BIG_STORAGE_INNO  VALUES(SHA2('" .. s1 .. "', 256), 'PROTOSTUFF', 'GZIP', REPEAT(CHAR(FLOOR(RAND()*96)+32), FLOOR(RAND()*16384)), NOW(), DATE_ADD(NOW(), INTERVAL FLOOR(RAND()*10) +1  day))")
>  end
> 
>  This will create a distribution of roughly 100,000 random records,
>  having a random payload between 0 and 16K, and will scatter them
>  throughout 10 partitions.
> 
>  Run analyze table on it, then try the following explains:
> 
>  explain DELETE   from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
>  explain SELECT * from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
> 
>  You will see that TokuDB refuses to use any index.
> 
>  InnoDB works fine in the same exact setup and the index is always chosen. See below.
>  We are experiencing this in production with much larger tables.
> 
> 
>  dbcache05>analyze table BIG_STORAGE_TOKU;
>  +-------------------------+---------+----------+----------+
>  | Table                   | Op      | Msg_type | Msg_text |
>  +-------------------------+---------+----------+----------+
>  | sbtest.BIG_STORAGE_TOKU | analyze | status   | OK       |
>  +-------------------------+---------+----------+----------+
>  1 row in set (0.42 sec)
> 
>  dbcache05>show indexes from BIG_STORAGE_TOKU;
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | Table            | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | BIG_STORAGE_TOKU |          1 | HASH_ID_IX     |            1 | HASH_ID     | A         |      100024 |     NULL | NULL   |      | BTREE      |         |               |
>  | BIG_STORAGE_TOKU |          1 | EXPIRE_DATE_IX |            1 | EXPIRE_DATE | A         |          19 |     NULL | NULL   |      | BTREE      |         |               |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  2 rows in set (0.02 sec)
> 
>  dbcache05>explain select * from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  | id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_TOKU | ALL  | HASH_ID_IX    | NULL | NULL    | NULL | 100024 | Using where |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  1 row in set (0.00 sec)
> 
> 
>  dbcache05>explain delete  from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  | id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_TOKU | ALL  | HASH_ID_IX    | NULL | NULL    | NULL | 100024 | Using where |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  1 row in set (0.00 sec)
> 
> 
>  This is result with identical steps but using InnoDB:
> 
> 
>  dbcache05>analyze table BIG_STORAGE_INNO;
>  +-------------------------+---------+----------+----------+
>  | Table                   | Op      | Msg_type | Msg_text |
>  +-------------------------+---------+----------+----------+
>  | sbtest.BIG_STORAGE_INNO | analyze | status   | OK       |
>  +-------------------------+---------+----------+----------+
>  1 row in set (0.10 sec)
> 
>  dbcache05>show indexes from BIG_STORAGE_INNO;
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | Table            | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | BIG_STORAGE_INNO |          1 | HASH_ID_IX     |            1 | HASH_ID     | A         |       64665 |     NULL | NULL   |      | BTREE      |         |               |
>  | BIG_STORAGE_INNO |          1 | EXPIRE_DATE_IX |            1 | EXPIRE_DATE | A         |          14 |     NULL | NULL   |      | BTREE      |         |               |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  2 rows in set (0.01 sec)
> 
>  dbcache05>explain select * from BIG_STORAGE_INNO where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
>  | id | select_type | table            | type | possible_keys | key        | key_len | ref   | rows | Extra       |
>  +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID_IX    | HASH_ID_IX | 64      | const |    7 | Using where |
>  +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
>  1 row in set (0.00 sec)
> 
> 
>  dbcache05>explain delete from BIG_STORAGE_INNO where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
>  | id | select_type | table            | type  | possible_keys | key        | key_len | ref   | rows | Extra       |
>  +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_INNO | range | HASH_ID_IX    | HASH_ID_IX | 64      | const |    7 | Using where |
>  +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
>  1 row in set (0.00 sec)
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/percona-server/+bug/1671152/+subscriptions
>

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-10:

#10

Ahh, OK, so you are running to 100K rows, then _continuing_ to run, so 100K is roughly the max and you are just then continuing to work within that range of keys. That means more deletes or actual hits on existing rows. My bash does the same, originally was limited to 32K rows but I changed that and went to 3 million rows.

The logical row estimate with messages in flight will never be accurate, it can not be within a deferred work system without performing an actual recount each time it is asked for (a table scan). So what is being returned now from ::records_in_range when queried for -+infinity and ::info is the logical row estimate for the PK. Prior to this fix, ::info was returning the logical row estimate for the PK and ::records_in_range -_infinity was returning the logical row estimate for that particular key, which could be very different from the PK.

There may be some other issue that I am trying to get in to if ::records_in_range is being provided actual key range values, this goes deep into the fractal tree and requires debugging as this is a code area that I have never been into before. It is entirely possible that this code is just dropping down to leaf nodes and making some wild assumptions about what the distance is between two keys without taking into account what messages are above it in the tree and is returning some insane number. If that is the case, it is possible that there is not an easy or practical fix that doesn't impact the performance of ::records_in_range and returns reasonably accurate estimates for all workloads and cases of the tree state. Meaning the 'fix' is to keep the trees optimized, which has been a historic problem with TokuDB and PerconaFT and is precisely what drove the switch from physical counts to logical counts as well as automatic table analysis.

Either way, I can't debug until I actually reproduce the issue and have a known key/query that I can target once I reach that state.

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-10:

#11

Download full text (3.4 KiB)

Please see answers inline.

On 10/mar/2017, at 21:48, George Ormond Lorch III <email address hidden> wrote:

> Ahh, OK, so you are running to 100K rows, then _continuing_ to run, so
> 100K is roughly the max and you are just then continuing to work within
> that range of keys. That means more deletes or actual hits on existing
> rows. My bash does the same, originally was limited to 32K rows but I
> changed that and went to 3 million rows.

Correct, this is to simulate our workload when we are bitten by the bug.
We have more than 100k possible keys in that workload of course, I have restricted
the distribution amplitude to reproduce the problem quickly.

I would recommend to try using sysbench (we use 0.5.1) with the LUA code I posted.

>
> The logical row estimate with messages in flight will never be accurate,
> it can not be within a deferred work system without performing an actual
> recount each time it is asked for (a table scan).

Well, from the dumps I have done with the tokuft_dump utility, the logical count was always
accurate for some reason. Just see the samples I posted above in this bug report…. the estimate
was way off, but the logical row count was always approximately 100,000. Maybe a coincidence?

> So what is being
> returned now from ::records_in_range when queried for -+infinity and
> ::info is the logical row estimate for the PK. Prior to this fix, ::info
> was returning the logical row estimate for the PK and ::records_in_range
> -_infinity was returning the logical row estimate for that particular
> key, which could be very different from the PK.

I understood that and I thought that would have solved the issue, but unfortunately it didn’t,
although things are a bit better than before in our tests.

>
> There may be some other issue that I am trying to get in to if
> ::records_in_range is being provided actual key range values, this goes
> deep into the fractal tree and requires debugging as this is a code area
> that I have never been into before. It is entirely possible that this
> code is just dropping down to leaf nodes and making some wild
> assumptions about what the distance is between two keys without taking
> into account what messages are above it in the tree and is returning
> some insane number. If that is the case, it is possible that there is
> not an easy or practical fix that doesn't impact the performance of
> ::records_in_range and returns reasonably accurate estimates for all
> workloads and cases of the tree state. Meaning the 'fix' is to keep the
> trees optimized, which has been a historic problem with TokuDB and
> PerconaFT and is precisely what drove the switch from physical counts to
> logical counts as well as automatic table analysis.

I understand this may not be an easy thing to fix, nonetheless this is an important bug
and can prevent use of Toku in specific workloads where there are many deletions.
I have been trying to run OPTIMIZE TABLE (throttled) during the benchmark but it
seems to be taking too much time and the bug hits nonetheless.
I will have to try with more aggressive values (less throttling) hoping that it will not
create FT locking issues. But even if I succeed, t...

Please see answers inline.

On 10/mar/2017, at 21:48, George Ormond Lorch III <george.lorch@percona.com> wrote:

> Ahh, OK, so you are running to 100K rows, then _continuing_ to run, so
> 100K is roughly the max and you are just then continuing to work within
> that range of keys. That means more deletes or actual hits on existing
> rows. My bash does the same, originally was limited to 32K rows but I
> changed that and went to 3 million rows.

Correct, this is to simulate our workload  when we are bitten by the bug.
We have more than 100k possible keys in that workload of course, I have restricted 
the distribution amplitude to reproduce the problem quickly.

I would recommend to try using sysbench (we use 0.5.1) with the LUA code I posted.

> 
> The logical row estimate with messages in flight will never be accurate,
> it can not be within a deferred work system without performing an actual
> recount each time it is asked for (a table scan).

Well, from the dumps I have done with the tokuft_dump utility, the logical count was always
accurate for some reason. Just see the samples I posted above in this bug report…. the estimate
was way off, but the logical row count was always approximately 100,000. Maybe a coincidence?

> So what is being
> returned now from ::records_in_range when queried for -+infinity and
> ::info is the logical row estimate for the PK. Prior to this fix, ::info
> was returning the logical row estimate for the PK and ::records_in_range
> -_infinity was returning the logical row estimate for that particular
> key, which could be very different from the PK.

I understood that and I thought that would have solved the issue, but unfortunately it didn’t, 
although things are a bit better than before in our tests.

> 
> There may be some other issue that I am trying to get in to if
> ::records_in_range is being provided actual key range values, this goes
> deep into the fractal tree and requires debugging as this is a code area
> that I have never been into before. It is entirely possible that this
> code is just dropping down to leaf nodes and making some wild
> assumptions about what the distance is between two keys without taking
> into account what messages are above it in the tree and is returning
> some insane number. If that is the case, it is possible that there is
> not an easy or practical fix that doesn't impact the performance of
> ::records_in_range and returns reasonably accurate estimates for all
> workloads and cases of the tree state. Meaning the 'fix' is to keep the
> trees optimized, which has been a historic problem with TokuDB and
> PerconaFT and is precisely what drove the switch from physical counts to
> logical counts as well as automatic table analysis.

I understand this may not be an easy thing to fix, nonetheless this is an important bug
and can prevent use of Toku in specific workloads where there are many deletions.
I have been trying to run OPTIMIZE TABLE (throttled) during the benchmark but it
seems to be taking too much time and the bug hits nonetheless.  
I will have to try with more aggressive values (less throttling) hoping that it will not
create FT locking issues. But even if I succeed, this would be a workaround and not
a solution.

> 
> Either way, I can't debug until I actually reproduce the issue and have
> a known key/query that I can target once I reach that state.

Try what I have suggested - run sysbench and the two line LUA code.
Very easy to reproduce the issue.

Thanks
Rick

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-10:

#12

I just want to be sure that you do not have tokudb_auto_analyze= (anything > 0) and tokudb_analyze_in_background=off, if so, what you are seeing is an operation that triggered a table analysis and is performing that analysis in the same context as the thread the triggered the operation. This is not the default set of options and looks like this in show proceslist:

analyze table standard test.toku#P#p376.EXPIRE_DATE_IX 1 of 2 2% rows 0% time, scanning forward

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-10:

#13

OK, got one...debugging

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-10:

#14

So I see something interesting, the bad query only seems to happen on the largest hash_id in the whole table. Not the smallest and not the largest and smallest in a partition, only the largest in the entire table.

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-10:

#15

Confirmed, it happens on the largest compared hash_id in the hash_id_idx. It seems that the key must have existed at one point as just doing a DELETE with any arbitrary higher vale does not trigger.

mysql> select SHA2('70164', 256);
+------------------------------------------------------------------+
| SHA2('70164', 256) |
+------------------------------------------------------------------+
| ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 |
+------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select MAX(hash_id) from test.toku;
+------------------------------------------------------------------+
| MAX(hash_id) |
+------------------------------------------------------------------+
| ffff925a44dfc908915841bdafe81d3ee3b5238610e4901d0d806e4740881084 |
+------------------------------------------------------------------+
1 row in set (0.01 sec)

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-11:

#16

Download full text (5.3 KiB)

OK some data, this is kinds of tough to explain w/o getting into detailed FT
internals so bear with me:
1) The partition handler walks the 'active' partitions (partitions noted to be
   involved in the query) of which there are three in this query (don't ask me
   why or where three comes from, it is the same for InnoDB in the matching
   case). It walks them in descending order of 'largest' to 'smallest' based on
   the reported number of rows in the partition.
2) It calls ::records_in_range and is being asked for how many records match
   between two keys, but the keys are identical, this is a particularly
   important component of this specific issue (it is a point select, not a
   range).
3) We work our way down into PerconaFT, all the way down to the leaf nodes to
   try and figure out how many rows exist between the two keys.
   - PerconaFT nodes are 'partitioned'.
     - A node has some meta data, a set of pivot keys, and a set of
       partitions.
     - Only the partitions of a node that are required for whatever the
       immediate operation is loaded into memory.
     - Any write to a node requires that the node is fully in memory as the node
       must be written as a contiguous data blob, reads only read through the
       partitions needed to satisfy the query.
       - Partitions may be individually evicted without evicting the entire
         node, this is known as partial eviction and is controlled via
         tokudb_enable_partial_evictions (default=true)
       - Internal/non-leaf nodes:
         - The tokudb_fanout parameter (default 16) defines the maximum number
           of child nodes that an internal node can have. A internal node can
           and often does have fewer children, but never more.
         - Each internal node has a set of 'pivot keys' that define the key
           boundaries between the edges of the internal node and each individual
           child node.
         - There is one message buffer for each child node, this is a partition.
       - External/leaf nodes:
         - Are partitioned based on tokudb_block_size/tokudb_read_block_size.
           This defines the number of partitions in a leaf node. A leaf node can
           have fewer partitions than the max defined, nut never more.
         - Leaf node partitions are called basement nodes.
         - A basement node contains a series of 'leaf entries'. Leaf entries
           represent a 'row', with a key and the MVCC stack for that key.
   - PerconaFT contains an optimization when estimating the number of keys
     between a range within a leaf node. If the target basement node is not in
     memory, it 'guesses' the number of keys that might be within that range. If
     the basement node is in memory, then it seems an accurate count is fetched,
     messages are transiently pulled down from above and applied to the leaf
     entries.
     - This 'guess' is a bit of a multi-layered calculation:
       - It takes the total number of _physical_ rows in the entire leaf node,
         and divides that by the number of basement nodes in the leaf node.
         ** Note: It uses the physical count of leaf entries, not the logical
         tr...

OK some data, this is kinds of tough to explain w/o getting into detailed FT
internals so bear with me:
1) The partition handler walks the 'active' partitions (partitions noted to be 
   involved in the query) of which there are three in this query (don't ask me
   why or where three comes from, it is the same for InnoDB in the matching
   case). It walks them in descending order of 'largest' to 'smallest' based on
   the reported number of rows in the partition.
2) It calls ::records_in_range and is being asked for how many records match
   between two keys, but the keys are identical, this is a particularly
   important component of this specific issue (it is a point select, not a
   range).
3) We work our way down into PerconaFT, all the way down to the leaf nodes to
   try and figure out how many rows exist between the two keys.
   - PerconaFT nodes are 'partitioned'.
     - A node has some meta data, a set of pivot keys, and a set of
       partitions.
     - Only the partitions of a node that are required for whatever the
       immediate operation is loaded into memory.
     - Any write to a node requires that the node is fully in memory as the node
       must be written as a contiguous data blob, reads only read through the
       partitions needed to satisfy the query.
       - Partitions may be individually evicted without evicting the entire
         node, this is known as partial eviction and is controlled via
         tokudb_enable_partial_evictions (default=true)
       - Internal/non-leaf nodes:
         - The tokudb_fanout parameter (default 16) defines the maximum number
           of child nodes that an internal node can have. A internal node can
           and often does have fewer children, but never more.
         - Each internal node has a set of 'pivot keys' that define the key
           boundaries between the edges of the internal node and each individual
           child node.
         - There is one message buffer for each child node, this is a partition.
       - External/leaf nodes:
         - Are partitioned based on tokudb_block_size/tokudb_read_block_size.
           This defines the number of partitions in a leaf node. A leaf node can
           have fewer partitions than the max defined, nut never more.
         - Leaf node partitions are called basement nodes.
         - A basement node contains a series of 'leaf entries'. Leaf entries
           represent a 'row', with a key and the MVCC stack for that key.
   - PerconaFT contains an optimization when estimating the number of keys
     between a range within a leaf node. If the target basement node is not in
     memory, it 'guesses' the number of keys that might be within that range. If
     the basement node is in memory, then it seems an accurate count is fetched,
     messages are transiently pulled down from above and applied to the leaf
     entries.
     - This 'guess' is a bit of a multi-layered calculation:
       - It takes the total number of _physical_ rows in the entire leaf node,
         and divides that by the number of basement nodes in the leaf node.
         ** Note: It uses the physical count of leaf entries, not the logical
         tracked on the whole tree, so many rows may be deleted but still
         physically present, and this is why an OPTIMIZED table yields better
         results.
       - This can result in a rather large, but totally false, number of
         matching keys, particularly on secondary indices where there is no data
         component to record, only the key and back referral to the PK.
       - Since these records tend to be very much smaller than their PK + data
         counterpart, many more can fit into a node.

Now that the internals lesson is over, there are a couple of things here that I
think can be done.
- First, I think that since the FT logic has narrowed us down to a basement node
  where we know that there is some cross section of the start key, it might make
  sense when the basement is not in memory to compare the end key to the start
  key for equality, and if so, just return 1 match, else go ahead and return the
  phony estimate. This will not fix the same issue for narrow range scans, but
  it will fix point operations such as the point deletion in this example.
- Second, it an be argued that this optimization is incorrect and that PerconaFT
  should bring at least one of, if not all of the needed basements into memory
  in order to obtain a more accurate Then this is no longer an estimate, it is a
  real count. If the optimizer then chooses not to use this index, we just did
  possibly a whole lot of read I/O for nothing. This will add to the TokuDB
  over-read woes. Think of the case of maybe a table with many indices and
  someone does a "SELECT * from table WHERE id between reallysmall AND
  reallylarge". It is possible that the optimizer would call
  ::records_in_range for all matching indices with this huge range, scanning
  several indices, then just resorting to a table scan anyway. So you now have
  an index scan for each matching index, just to test the index, then whatever
  scan the optimizer chooses and the final fetch. So a lot of potential for
  over-read.

I think implementing the first idea would be nearly a no-op in terms of chances
of breaking something. Going into the second is a possible rat hole of breaking
established performance characteristics.

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-11:

#17

I found some other thing down in the PerconaFT library that make no sense in the estimations where internal nodes are adding to the estimation totals values that have nothing to do with the actual contents of the leaf nodes. This artificially inflates the numbers coming out of ::records_in_range by orders of magnitude.

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-11:

#18

Download full text (17.7 KiB)

> Now that the internals lesson is over, there are a couple of things here that I
> think can be done.
> - First, I think that since the FT logic has narrowed us down to a basement node
> where we know that there is some cross section of the start key, it might make
> sense when the basement is not in memory to compare the end key to the start
> key for equality, and if so, just return 1 match, else go ahead and return the
> phony estimate. This will not fix the same issue for narrow range scans, but
> it will fix point operations such as the point deletion in this example.
> - Second, it an be argued that this optimization is incorrect and that PerconaFT
> should bring at least one of, if not all of the needed basements into memory
> in order to obtain a more accurate Then this is no longer an estimate, it is a
> real count. If the optimizer then chooses not to use this index, we just did
> possibly a whole lot of read I/O for nothing. This will add to the TokuDB
> over-read woes. Think of the case of maybe a table with many indices and
> someone does a "SELECT * from table WHERE id between reallysmall AND
> reallylarge". It is possible that the optimizer would call
> ::records_in_range for all matching indices with this huge range, scanning
> several indices, then just resorting to a table scan anyway. So you now have
> an index scan for each matching index, just to test the index, then whatever
> scan the optimizer chooses and the final fetch. So a lot of potential for
> over-read.
>
> I think implementing the first idea would be nearly a no-op in terms of chances
> of breaking something. Going into the second is a possible rat hole of breaking
> established performance characteristics.

Thank you, the “internal lesson” was a very interesting reading.
I believe fixing the point deletions is a good idea and much needed, it will
solve the problem in the current test case.

Not sure what would happen when we add a condition on EXPIRE_DATE (which we
actually do for partition pruning purposes), but considering that MySQL will at most use
one index, perhaps it will just work fine as HASH_ID_IX will be selected by optimizer,
and the other condition only used for pruning.

I see you have found more issues about the rows estimate too.
I am eager to test whatever fix you will submit.

Thanks

Rick

—
Riccardo Pizzi
<email address hidden>
http://www.pinballowners.com/webmaster

On 11/mar/2017, at 01:32, George Ormond Lorch III <email address hidden> wrote:

> OK some data, this is kinds of tough to explain w/o getting into detailed FT
> internals so bear with me:
> 1) The partition handler walks the 'active' partitions (partitions noted to be
> involved in the query) of which there are three in this query (don't ask me
> why or where three comes from, it is the same for InnoDB in the matching
> case). It walks them in descending order of 'largest' to 'smallest' based on
> the reported number of rows in the partition.
> 2) It calls ::records_in_range and is being asked for how many records match
> between two keys, but the keys are identical, this is a particularly
> important component of this specific issue (it is ...

> Now that the internals lesson is over, there are a couple of things here that I
> think can be done.
> - First, I think that since the FT logic has narrowed us down to a basement node
>  where we know that there is some cross section of the start key, it might make
>  sense when the basement is not in memory to compare the end key to the start
>  key for equality, and if so, just return 1 match, else go ahead and return the
>  phony estimate. This will not fix the same issue for narrow range scans, but
>  it will fix point operations such as the point deletion in this example.
> - Second, it an be argued that this optimization is incorrect and that PerconaFT
>  should bring at least one of, if not all of the needed basements into memory
>  in order to obtain a more accurate Then this is no longer an estimate, it is a
>  real count. If the optimizer then chooses not to use this index, we just did
>  possibly a whole lot of read I/O for nothing. This will add to the TokuDB
>  over-read woes. Think of the case of maybe a table with many indices and
>  someone does a "SELECT * from table WHERE id between reallysmall AND
>  reallylarge". It is possible that the optimizer would call
>  ::records_in_range for all matching indices with this huge range, scanning
>  several indices, then just resorting to a table scan anyway. So you now have
>  an index scan for each matching index, just to test the index, then whatever
>  scan the optimizer chooses and the final fetch. So a lot of potential for
>  over-read.
> 
> I think implementing the first idea would be nearly a no-op in terms of chances
> of breaking something. Going into the second is a possible rat hole of breaking
> established performance characteristics.

Thank you, the “internal lesson” was a very interesting reading.
I believe fixing the point deletions is a good idea and much needed, it will
solve the problem in the current test case.

Not sure what would happen when we add a condition on EXPIRE_DATE (which we
actually do for partition pruning purposes), but considering that MySQL will at most use 
one index, perhaps it will just work fine as HASH_ID_IX will be selected by optimizer,
and the other condition only used for pruning.

I see you have found more issues about the rows estimate too.
I am eager to test whatever fix you will submit.

Thanks

Rick

— 
Riccardo Pizzi
pizzi@leopardus.com
http://www.pinballowners.com/webmaster

On 11/mar/2017, at 01:32, George Ormond Lorch III <george.lorch@percona.com> wrote:

> OK some data, this is kinds of tough to explain w/o getting into detailed FT
> internals so bear with me:
> 1) The partition handler walks the 'active' partitions (partitions noted to be 
>   involved in the query) of which there are three in this query (don't ask me
>   why or where three comes from, it is the same for InnoDB in the matching
>   case). It walks them in descending order of 'largest' to 'smallest' based on
>   the reported number of rows in the partition.
> 2) It calls ::records_in_range and is being asked for how many records match
>   between two keys, but the keys are identical, this is a particularly
>   important component of this specific issue (it is a point select, not a
>   range).
> 3) We work our way down into PerconaFT, all the way down to the leaf nodes to
>   try and figure out how many rows exist between the two keys.
>   - PerconaFT nodes are 'partitioned'.
>     - A node has some meta data, a set of pivot keys, and a set of
>       partitions.
>     - Only the partitions of a node that are required for whatever the
>       immediate operation is loaded into memory.
>     - Any write to a node requires that the node is fully in memory as the node
>       must be written as a contiguous data blob, reads only read through the
>       partitions needed to satisfy the query.
>       - Partitions may be individually evicted without evicting the entire
>         node, this is known as partial eviction and is controlled via
>         tokudb_enable_partial_evictions (default=true)
>       - Internal/non-leaf nodes:
>         - The tokudb_fanout parameter (default 16) defines the maximum number
>           of child nodes that an internal node can have. A internal node can
>           and often does have fewer children, but never more.
>         - Each internal node has a set of 'pivot keys' that define the key
>           boundaries between the edges of the internal node and each individual
>           child node.
>         - There is one message buffer for each child node, this is a partition.
>       - External/leaf nodes:
>         - Are partitioned based on tokudb_block_size/tokudb_read_block_size.
>           This defines the number of partitions in a leaf node. A leaf node can
>           have fewer partitions than the max defined, nut never more.
>         - Leaf node partitions are called basement nodes.
>         - A basement node contains a series of 'leaf entries'. Leaf entries
>           represent a 'row', with a key and the MVCC stack for that key.
>   - PerconaFT contains an optimization when estimating the number of keys
>     between a range within a leaf node. If the target basement node is not in
>     memory, it 'guesses' the number of keys that might be within that range. If
>     the basement node is in memory, then it seems an accurate count is fetched,
>     messages are transiently pulled down from above and applied to the leaf
>     entries.
>     - This 'guess' is a bit of a multi-layered calculation:
>       - It takes the total number of _physical_ rows in the entire leaf node,
>         and divides that by the number of basement nodes in the leaf node.
>         ** Note: It uses the physical count of leaf entries, not the logical
>         tracked on the whole tree, so many rows may be deleted but still
>         physically present, and this is why an OPTIMIZED table yields better
>         results.
>       - This can result in a rather large, but totally false, number of
>         matching keys, particularly on secondary indices where there is no data
>         component to record, only the key and back referral to the PK.
>       - Since these records tend to be very much smaller than their PK + data
>         counterpart, many more can fit into a node.
> 
> Now that the internals lesson is over, there are a couple of things here that I
> think can be done.
> - First, I think that since the FT logic has narrowed us down to a basement node
>  where we know that there is some cross section of the start key, it might make
>  sense when the basement is not in memory to compare the end key to the start
>  key for equality, and if so, just return 1 match, else go ahead and return the
>  phony estimate. This will not fix the same issue for narrow range scans, but
>  it will fix point operations such as the point deletion in this example.
> - Second, it an be argued that this optimization is incorrect and that PerconaFT
>  should bring at least one of, if not all of the needed basements into memory
>  in order to obtain a more accurate Then this is no longer an estimate, it is a
>  real count. If the optimizer then chooses not to use this index, we just did
>  possibly a whole lot of read I/O for nothing. This will add to the TokuDB
>  over-read woes. Think of the case of maybe a table with many indices and
>  someone does a "SELECT * from table WHERE id between reallysmall AND
>  reallylarge". It is possible that the optimizer would call
>  ::records_in_range for all matching indices with this huge range, scanning
>  several indices, then just resorting to a table scan anyway. So you now have
>  an index scan for each matching index, just to test the index, then whatever
>  scan the optimizer chooses and the final fetch. So a lot of potential for
>  over-read.
> 
> I think implementing the first idea would be nearly a no-op in terms of chances
> of breaking something. Going into the second is a possible rat hole of breaking
> established performance characteristics.
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1671152
> 
> Title:
>  tokudb  does not use index even if cardinality is good
> 
> Status in Percona Server:
>  Fix Committed
> Status in Percona Server 5.6 series:
>  Fix Committed
> Status in Percona Server 5.7 series:
>  Fix Committed
> 
> Bug description:
>  Please consider the following partitioned table:
> 
>  CREATE TABLE `BIG_STORAGE_TOKU` (
>    `HASH_ID` char(64) NOT NULL,
>    `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
>    `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
>    `RAW_DATA` mediumblob NOT NULL,
>    `LAST_UPDATE` datetime NOT NULL,
>    `EXPIRE_DATE` date NOT NULL,
>    KEY `HASH_ID_IX` (`HASH_ID`),
>    KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
>  ) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
>  /*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
>  (PARTITION p364 VALUES LESS THAN (736753) ENGINE = TokuDB,
>   PARTITION p365 VALUES LESS THAN (736754) ENGINE = TokuDB,
>   PARTITION p366 VALUES LESS THAN (736755) ENGINE = TokuDB,
>   PARTITION p367 VALUES LESS THAN (736756) ENGINE = TokuDB,
>   PARTITION p368 VALUES LESS THAN (736757) ENGINE = TokuDB,
>   PARTITION p369 VALUES LESS THAN (736758) ENGINE = TokuDB,
>   PARTITION p370 VALUES LESS THAN (736759) ENGINE = TokuDB,
>   PARTITION p371 VALUES LESS THAN (736760) ENGINE = TokuDB,
>   PARTITION p372 VALUES LESS THAN (736761) ENGINE = TokuDB,
>   PARTITION p373 VALUES LESS THAN (736762) ENGINE = TokuDB,
>   PARTITION p374 VALUES LESS THAN (736763) ENGINE = TokuDB,
>   PARTITION p375 VALUES LESS THAN (736764) ENGINE = TokuDB,
>   PARTITION p376 VALUES LESS THAN (736765) ENGINE = TokuDB,
>   PARTITION p377 VALUES LESS THAN (736766) ENGINE = TokuDB,
>   PARTITION p378 VALUES LESS THAN (736767) ENGINE = TokuDB,
>   PARTITION p379 VALUES LESS THAN (736768) ENGINE = TokuDB,
>   PARTITION p380 VALUES LESS THAN (736769) ENGINE = TokuDB,
>   PARTITION p381 VALUES LESS THAN (736770) ENGINE = TokuDB,
>   PARTITION p382 VALUES LESS THAN (736771) ENGINE = TokuDB,
>   PARTITION p383 VALUES LESS THAN (736772) ENGINE = TokuDB,
>   PARTITION p384 VALUES LESS THAN (736773) ENGINE = TokuDB,
>   PARTITION p385 VALUES LESS THAN (736774) ENGINE = TokuDB,
>   PARTITION p386 VALUES LESS THAN (736775) ENGINE = TokuDB,
>   PARTITION p387 VALUES LESS THAN (736776) ENGINE = TokuDB,
>   PARTITION p388 VALUES LESS THAN (736777) ENGINE = TokuDB,
>   PARTITION p389 VALUES LESS THAN (736778) ENGINE = TokuDB,
>   PARTITION p390 VALUES LESS THAN (736779) ENGINE = TokuDB,
>   PARTITION p391 VALUES LESS THAN (736780) ENGINE = TokuDB,
>   PARTITION p392 VALUES LESS THAN (736781) ENGINE = TokuDB,
>   PARTITION p393 VALUES LESS THAN (736782) ENGINE = TokuDB,
>   PARTITION p394 VALUES LESS THAN (736783) ENGINE = TokuDB,
>   PARTITION p395 VALUES LESS THAN (736784) ENGINE = TokuDB) */
> 
> 
>  Load the table with sysbench using following LUA code (let it run until row count is about 100,000)
> 
> 
>  function event(thread_id)
>     local s1 = sb_rand_str("#####")
>     db_query("DELETE FROM BIG_STORAGE_INNO  WHERE HASH_ID = SHA2('" .. s1 .. "', 256) AND EXPIRE_DATE > NOW()");
>     db_query("INSERT INTO BIG_STORAGE_INNO  VALUES(SHA2('" .. s1 .. "', 256), 'PROTOSTUFF', 'GZIP', REPEAT(CHAR(FLOOR(RAND()*96)+32), FLOOR(RAND()*16384)), NOW(), DATE_ADD(NOW(), INTERVAL FLOOR(RAND()*10) +1  day))")
>  end
> 
>  This will create a distribution of roughly 100,000 random records,
>  having a random payload between 0 and 16K, and will scatter them
>  throughout 10 partitions.
> 
>  Run analyze table on it, then try the following explains:
> 
>  explain DELETE   from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
>  explain SELECT * from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
> 
>  You will see that TokuDB refuses to use any index.
> 
>  InnoDB works fine in the same exact setup and the index is always chosen. See below.
>  We are experiencing this in production with much larger tables.
> 
> 
>  dbcache05>analyze table BIG_STORAGE_TOKU;
>  +-------------------------+---------+----------+----------+
>  | Table                   | Op      | Msg_type | Msg_text |
>  +-------------------------+---------+----------+----------+
>  | sbtest.BIG_STORAGE_TOKU | analyze | status   | OK       |
>  +-------------------------+---------+----------+----------+
>  1 row in set (0.42 sec)
> 
>  dbcache05>show indexes from BIG_STORAGE_TOKU;
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | Table            | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | BIG_STORAGE_TOKU |          1 | HASH_ID_IX     |            1 | HASH_ID     | A         |      100024 |     NULL | NULL   |      | BTREE      |         |               |
>  | BIG_STORAGE_TOKU |          1 | EXPIRE_DATE_IX |            1 | EXPIRE_DATE | A         |          19 |     NULL | NULL   |      | BTREE      |         |               |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  2 rows in set (0.02 sec)
> 
>  dbcache05>explain select * from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  | id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_TOKU | ALL  | HASH_ID_IX    | NULL | NULL    | NULL | 100024 | Using where |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  1 row in set (0.00 sec)
> 
> 
>  dbcache05>explain delete  from BIG_STORAGE_TOKU where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  | id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_TOKU | ALL  | HASH_ID_IX    | NULL | NULL    | NULL | 100024 | Using where |
>  +----+-------------+------------------+------+---------------+------+---------+------+--------+-------------+
>  1 row in set (0.00 sec)
> 
> 
>  This is result with identical steps but using InnoDB:
> 
> 
>  dbcache05>analyze table BIG_STORAGE_INNO;
>  +-------------------------+---------+----------+----------+
>  | Table                   | Op      | Msg_type | Msg_text |
>  +-------------------------+---------+----------+----------+
>  | sbtest.BIG_STORAGE_INNO | analyze | status   | OK       |
>  +-------------------------+---------+----------+----------+
>  1 row in set (0.10 sec)
> 
>  dbcache05>show indexes from BIG_STORAGE_INNO;
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | Table            | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  | BIG_STORAGE_INNO |          1 | HASH_ID_IX     |            1 | HASH_ID     | A         |       64665 |     NULL | NULL   |      | BTREE      |         |               |
>  | BIG_STORAGE_INNO |          1 | EXPIRE_DATE_IX |            1 | EXPIRE_DATE | A         |          14 |     NULL | NULL   |      | BTREE      |         |               |
>  +------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
>  2 rows in set (0.01 sec)
> 
>  dbcache05>explain select * from BIG_STORAGE_INNO where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
>  | id | select_type | table            | type | possible_keys | key        | key_len | ref   | rows | Extra       |
>  +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID_IX    | HASH_ID_IX | 64      | const |    7 | Using where |
>  +----+-------------+------------------+------+---------------+------------+---------+-------+------+-------------+
>  1 row in set (0.00 sec)
> 
> 
>  dbcache05>explain delete from BIG_STORAGE_INNO where  HASH_ID = SHA2('70164', 256);
>  +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
>  | id | select_type | table            | type  | possible_keys | key        | key_len | ref   | rows | Extra       |
>  +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
>  |  1 | SIMPLE      | BIG_STORAGE_INNO | range | HASH_ID_IX    | HASH_ID_IX | 64      | const |    7 | Using where |
>  +----+-------------+------------------+-------+---------------+------------+---------+-------+------+-------------+
>  1 row in set (0.00 sec)
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/percona-server/+bug/1671152/+subscriptions
>

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-11:

#19

OK, I think I found the root problem, the discussion above dances around it and the previous patch is part of it but none of it actually hit the real root problem.

The PerconaFT keyrange estimator starts off at the top of the tree with the estimate being the total number of _physical_ rows in the tree, not the logical count. It then begins walking down the tree and slowly knocks off of that count the logical rows that fall outside of the range...so right away there is an incongruency between what it believes is the estimate and what the real estimate is, and that difference is the 'garbage' within the tree.

In this very specific case it hits an interesting edge where the very last pivot key on the farthest rightmost leaf node is the actual point key that the range is searching for, and, the farthest rightmost basement node is empty and the basement node just to the left of it is not in memory. So the estimator must 'guess' that both nodes are full of keys that match the range and returns what it estimates as the number of keys in a basement node * these two nodes. This is where the huge row count comes out to the optimizer. I believe there may be a few different combinations like this that could cause a bad large value to be returned but they are the exceptions, not the rule.

By changing the keyrange estimator to start with the current logical row count instead of the physical row count, the number of estimated rows per basement node is now much smaller and more accurate logically and returns a far more accurate guess number back out to the optimizer in ::records_in_range. The optimizer, now having a more reasonable estimate, then uses it to arrive at the correct query plan.

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-03-14:

#20

Recompiling with the above changes (thanks George!) I confirm that the problem is now gone.

Revision history for this message

George Ormond Lorch III (gl-az) wrote on 2017-03-14:

#21

http://jenkins.percona.com/view/TokuDB/job/PerconaFT-param/72/
https://github.com/percona/PerconaFT/pull/368

http://jenkins.percona.com/view/PS%205.6/job/percona-server-5.6-param/1770/
https://github.com/georgelorchpercona/percona-server/tree/ps-5.6-lp1671152
https://github.com/percona/percona-server/pull/1498

http://jenkins.percona.com/view/5.7/job/mysql-5.7-param/722/
https://github.com/georgelorchpercona/percona-server/tree/ps-5.7-lp1671152
https://github.com/percona/percona-server/pull/1499

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-25:

#22

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1073

	Status	Importance	Assigned to	Milestone
Percona Server moved to https://jira.percona.com/projects/PS	Status tracked in 5.7
5.6	Fix Released	High	George Ormond Lorch III	Percona Server moved to https://jira.percona.com/projects/PS 5.6.35-81.0
5.7	Fix Released	High	George Ormond Lorch III	Percona Server moved to https://jira.percona.com/projects/PS 5.7.17-12

Percona Server moved to https://jira.percona.com/projects/PS

tokudb does not use index even if cardinality is good

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches