Reduce the ts_vector size to 250KB
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
KARL3 |
Fix Released
|
Medium
|
Chris Rossi |
Bug Description
Shane wrote:
We can reduce the amount of text we are willing to index. We currently limit the indexable text to 1MB due to a hard limit in Postgres. Could we reasonably limit it to, say, 250KB? 250KB is like a 250 page book. I think I should try it out on karlstaging. I'm guessing we'll get a 2X improvement.
Here is a histogram (using the following query).
select min(textlen), count(1), sum(textlen) from (select length(
min | count | sum
-------
5 | 4659 | 37390
10 | 41436 | 1765747
100 | 41345 | 16814598
1000 | 49416 | 188352636
10000 | 19041 | 566536929
100009 | 879 | 120819463
This says OSF has:
- 4659 documents with text_vectors containing 1-9 characters
- 41436 documents with text_vectors containing 10-99 characters
and so on.
Postgres can comfortably handle text vectors up to around 10,000 characters in length, but after that it moves the vectors to TOAST tables. So OSF has around 20,000 documents getting TOASTed.
To make matters worse, large documents are more likely to match any given query, so the larger the document, the more frequently ts_rank has to fetch it. Some of the largest text vectors are probably being fetched on nearly every query.
Changed in karl3: | |
milestone: | m120 → m121 |
Changed in karl3: | |
milestone: | m121 → m122 |
Changed in karl3: | |
status: | Fix Committed → Fix Released |
Shane, I think I'm willing to ok this. Is it possible to make this a configurable option?