Reduce the ts_vector size to 250KB

Bug #1063834 reported by Paul Everitt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Chris Rossi

Bug Description

Shane wrote:

We can reduce the amount of text we are willing to index. We currently limit the indexable text to 1MB due to a hard limit in Postgres. Could we reasonably limit it to, say, 250KB? 250KB is like a 250 page book. I think I should try it out on karlstaging. I'm guessing we'll get a 2X improvement.

Here is a histogram (using the following query).

select min(textlen), count(1), sum(textlen) from (select length(text_vector::varchar) as textlen from pgtextindex) tmp where textlen > 0 group by floor(log(10, textlen));

 min | count | sum
--------+-------+-----------
     5 | 4659 | 37390
    10 | 41436 | 1765747
   100 | 41345 | 16814598
  1000 | 49416 | 188352636
 10000 | 19041 | 566536929
100009 | 879 | 120819463

This says OSF has:

- 4659 documents with text_vectors containing 1-9 characters
- 41436 documents with text_vectors containing 10-99 characters

and so on.

Postgres can comfortably handle text vectors up to around 10,000 characters in length, but after that it moves the vectors to TOAST tables. So OSF has around 20,000 documents getting TOASTed.

To make matters worse, large documents are more likely to match any given query, so the larger the document, the more frequently ts_rank has to fetch it. Some of the largest text vectors are probably being fetched on nearly every query.

Tags: pgtextindex
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Shane, I think I'm willing to ok this. Is it possible to make this a configurable option?

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

Yes, this can be configurable.

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

I have added the maxlen option to pgtextindex and karlserve. A new release of karlserve is required.

JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m120 → m121
JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m121 → m122
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Chris, assigning to you. I'd like you to confirm that the production update actually did this.

Changed in karl3:
assignee: Shane Hathaway (shane-hathawaymix) → Chris Rossi (chris-archimedeanco)
status: New → In Progress
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Oops, forgot I asked Chris to confirm this.

Changed in karl3:
status: In Progress → Fix Released
status: Fix Released → Fix Committed
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Hmm, not sure:

osf-karl=# select max(length(text_vector::varchar)) from pgtextindex;
  max
--------
 295802
(1 row)

Shane is this probably just a bytes versus characters issue?

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

That size looks fine. The text_vector is not the actual document; it's derived from the document and often has a similar length. 295k is a reasonable text_vector size for a 250k document.

Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.