Decrease the indexed text size on pgtextindex

Bug #1461637 reported by Paul Everitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL4
Fix Released
Medium
Paul Everitt

Bug Description

Prefix searches are slow, for a number of reasons related to relevance ranking, which requires a table scan of the tsvector for any matching docids.

When the searchable text and thus tsvector size is over the TOAST threshold, this means the tsvector gets gzip'd and written to disk. Since these are documents with many words, they are likely to match on many prefix searches.

Thus, just about any prefix search is going to match all of those big entries and trigger a TOAST read/decompress. Some searches cause 500 MB to get read and decompressed.

In 2012, Shane did this analysis:

"""
select min(textlen), count(1), sum(textlen) from (select length(text_vector::varchar) as textlen from pgtextindex) tmp where textlen > 0 group by floor(log(10, textlen));

 min | count | sum
--------+-------+-----------
     5 | 4659 | 37390
    10 | 41436 | 1765747
   100 | 41345 | 16814598
  1000 | 49416 | 188352636
 10000 | 19041 | 566536929
100009 | 879 | 120819463

This says OSF has:

- 4659 documents with text_vectors containing 1-9 characters
- 41436 documents with text_vectors containing 10-99 characters

Postgres can comfortably handle text vectors up to around 10,000 characters in length, but after that it moves the vectors to TOAST tables. So OSF has around 20,000 documents getting TOASTed.
"""

For this ticket:

- Get some measurements beforehand
- Decrease the maxlen in karl.textindex.KarlPGTextIndex to 10000
- reindex
- Re-do the measurements

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Some values beforehand from production, starting with counts at various text_vector sizes.

osf-karl=# select count(docid) from pgtextindex;
 count
--------
 365338
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 9000;
 count
-------
 33214

(1 row)
osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 10000;
 count
-------
 30520
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 200000;
 count
-------
   112
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 300000;
 count
-------
    20
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 400000;
 count
-------
     5
(1 row)

Changed in karl4:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.