KARL3

Reduce the ts_vector size to 250KB

Bug #1063834 reported by Paul Everitt on 2012-10-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	KARL3	Fix Released	Medium	Chris Rossi	KARL3 m122

Bug Description

Shane wrote:

We can reduce the amount of text we are willing to index. We currently limit the indexable text to 1MB due to a hard limit in Postgres. Could we reasonably limit it to, say, 250KB? 250KB is like a 250 page book. I think I should try it out on karlstaging. I'm guessing we'll get a 2X improvement.

Here is a histogram (using the following query).

select min(textlen), count(1), sum(textlen) from (select length(text_vector::varchar) as textlen from pgtextindex) tmp where textlen > 0 group by floor(log(10, textlen));

min | count | sum
--------+-------+-----------
     5 | 4659 | 37390
    10 | 41436 | 1765747
   100 | 41345 | 16814598
  1000 | 49416 | 188352636
10000 | 19041 | 566536929
100009 | 879 | 120819463

This says OSF has:

- 4659 documents with text_vectors containing 1-9 characters
- 41436 documents with text_vectors containing 10-99 characters

and so on.

Postgres can comfortably handle text vectors up to around 10,000 characters in length, but after that it moves the vectors to TOAST tables. So OSF has around 20,000 documents getting TOASTed.

To make matters worse, large documents are more likely to match any given query, so the larger the document, the more frequently ts_rank has to fetch it. Some of the largest text vectors are probably being fetched on nearly every query.

Tags:

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2012-10-08:

Shane, I think I'm willing to ok this. Is it possible to make this a configurable option?

Revision history for this message

Shane Hathaway (shane-hathawaymix) wrote on 2012-10-09:

Yes, this can be configurable.

Revision history for this message

Shane Hathaway (shane-hathawaymix) wrote on 2012-10-09:

I have added the maxlen option to pgtextindex and karlserve. A new release of karlserve is required.

JimPGlenn (jpglenn09) on 2012-10-22

Changed in karl3:
milestone:	m120 → m121

JimPGlenn (jpglenn09) on 2012-11-05

Changed in karl3:
milestone:	m121 → m122

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2012-11-16:

Chris, assigning to you. I'd like you to confirm that the production update actually did this.

Changed in karl3:
assignee:	Shane Hathaway (shane-hathawaymix) → Chris Rossi (chris-archimedeanco)
status:	New → In Progress

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2012-11-19:

Oops, forgot I asked Chris to confirm this.

Changed in karl3:
status:	In Progress → Fix Released
status:	Fix Released → Fix Committed

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2012-11-19:

Hmm, not sure:

osf-karl=# select max(length(text_vector::varchar)) from pgtextindex;
max
--------
295802
(1 row)

Shane is this probably just a bytes versus characters issue?

Revision history for this message

Shane Hathaway (shane-hathawaymix) wrote on 2012-11-20:

That size looks fine. The text_vector is not the actual document; it's derived from the document and often has a similar length. 295k is a reasonable text_vector size for a 250k document.

Paul Everitt (paul-agendaless) on 2012-11-20

Changed in karl3:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.