KARL4

Decrease the indexed text size on pgtextindex

Bug #1461637 reported by Paul Everitt on 2015-06-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	KARL4	Fix Released	Medium	Paul Everitt	KARL4 008 "June 2015"

Bug Description

Prefix searches are slow, for a number of reasons related to relevance ranking, which requires a table scan of the tsvector for any matching docids.

When the searchable text and thus tsvector size is over the TOAST threshold, this means the tsvector gets gzip'd and written to disk. Since these are documents with many words, they are likely to match on many prefix searches.

Thus, just about any prefix search is going to match all of those big entries and trigger a TOAST read/decompress. Some searches cause 500 MB to get read and decompressed.

In 2012, Shane did this analysis:

"""
select min(textlen), count(1), sum(textlen) from (select length(text_vector::varchar) as textlen from pgtextindex) tmp where textlen > 0 group by floor(log(10, textlen));

min | count | sum
--------+-------+-----------
     5 | 4659 | 37390
    10 | 41436 | 1765747
   100 | 41345 | 16814598
  1000 | 49416 | 188352636
10000 | 19041 | 566536929
100009 | 879 | 120819463

This says OSF has:

- 4659 documents with text_vectors containing 1-9 characters
- 41436 documents with text_vectors containing 10-99 characters

Postgres can comfortably handle text vectors up to around 10,000 characters in length, but after that it moves the vectors to TOAST tables. So OSF has around 20,000 documents getting TOASTed.
"""

For this ticket:

- Get some measurements beforehand
- Decrease the maxlen in karl.textindex.KarlPGTextIndex to 10000
- reindex
- Re-do the measurements

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2015-06-03:

Some values beforehand from production, starting with counts at various text_vector sizes.

osf-karl=# select count(docid) from pgtextindex;
count
--------
365338
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 9000;
count
-------
33214

(1 row)
osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 10000;
count
-------
30520
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 200000;
count
-------
112
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 300000;
count
-------
20
(1 row)

osf-karl=# select count(docid) from pgtextindex where length(text_vector::varchar) > 400000;
count
-------
5
(1 row)

Paul Everitt (paul-agendaless) on 2015-06-22

Changed in karl4:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.