broad searches are slow

Bug #788379 reported by James Fournie
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Undecided
Unassigned
2.0
Fix Released
Undecided
Unassigned
2.1
Fix Released
Undecided
Unassigned

Bug Description

In general, on any Evergreen system with a large dataset, a very broad search term is somewhat slower than most people would consider acceptable, to the point that most people would give up. Broad searches would include keyword searches for things like "Europe" or "Fish". This is reproducible on KCLS, Conifer, Sitka, GAPINES

Revision history for this message
Ben Shum (bshum) wrote :

I've noticed that sometimes with broad searches, the spinning search logo can continue to appear as though it were still searching, but the search has actually completed without rendering my results.

For example, searching for "the help" in our catalog results in over 2500 results, but it doesn't show them to me until I hit the refresh button.

I strongly agree that searching is pretty slow if you don't add more narrowing search terms though.

Revision history for this message
Dan Scott (denials) wrote :

Ben, the continuation of the spinning search logo (I believe this happens when the OpenSRF call times out (default: 60 seconds) should be treated as a separate issue.

Let's focus on the core issue: single broad search terms being unacceptably slow. Having just upgraded to 2.0, such searches are indeed much slower for Conifer than when we were running 1.6.1. However, in sanity-checking our configuration, I realized that my carefully tuned PostgreSQL config file was sitting in a different directory, and our 48-core/48 GB RAM database server was running with the insane defaults for PostgreSQL (maintenance_work_mem: 16MB / work_mem: 1M / shared_buffers: 24MB / checkpoint_segments: 3 / effective_cache_size: 128MB), which is obviously handicapping our system at the moment.

That said, let's take a look at one simple change to the default configuration that can be made - switching to GIN indexes. We're running a PostgreSQL 9.0 server so my confidence level is a lot higher than in the early days of GIN indexes. I created a GIN index on one table - the obvious candidate metabib.keyword_field_entry - and here's a comparison:

EXPLAIN ANALYZE SELECT COUNT(*) FROM metabib.keyword_field_entry WHERE index_vector @@ to_tsquery('public.keyword', 'canada');

was 38 seconds with a GIST index, and 9.5 seconds with a GIN index. "Canada" is probably a worst-case scenrario for us, with 420K matching rows.

9.5 seconds is still too long to simply return the number of matching rows for a text query, but shaving 28.5 seconds off the base query is a good start.

Trying one of your examples:

EXPLAIN ANALYZE SELECT COUNT(*) FROM metabib.keyword_field_entry WHERE index_vector @@ to_tsquery('public.keyword', 'europe');

Tells me 1.191 seconds using a GIN index (60K matching rows).

More GIN - "fish" is 95 ms for 11K matching rows, "the" is 7.5 seconds for 1.9 M rows; that's pretty acceptable to me.

Given the current crazy configuration of our database server, I'll report back on the results after PostgreSQL is actually allowed to use its shared_buffers and work_mem to augment performance. Hopefully I'll have a span of down time sometime in the next 12 hours to apply the new configuration and will be able to report back shortly.

Of course, here we're simply dealing with the core full-text search matching; some significant delays come from ranking and limiting within the search routine.

Revision history for this message
Dan Scott (denials) wrote :

Just for fun, rather than retrieving the set of bib records that matches the text part of the query, and then iterating over those records to filter out those that don't match other conditions (owning library / circ library / deleted / copy status), I built those conditions into a single query:

EXPLAIN ANALYZE
  SELECT mkfe.record
    FROM metabib.keyword_field_entry mkfe
      INNER JOIN asset.call_number acn ON acn.record = mkfe.source
      INNER JOIN asset.copy ac ON ac.call_number = acn.id
  WHERE mkfe.index_vector @@ to_tsquery('europe')
    AND acn.owning_lib = 103
    AND ac.circ_lib = 103
    AND ac.deleted IS FALSE
   AND ac.status = 0;

Needed to create an index on asset.call_number(owning_lib) to avoid a sequential scan, but the results came back in 3 seconds. Not bad.

Ordering by title slows things down considerably:

EXPLAIN ANALYZE SELECT mkfe.source
  FROM metabib.keyword_field_entry mkfe
    INNER JOIN asset.call_number acn ON acn.record = mkfe.source
    INNER JOIN asset.copy ac ON ac.call_number = acn.id
    INNER JOIN metabib.full_rec mfr ON mfr.record = acn.record
  WHERE mkfe.index_vector @@ to_tsquery('europe')
    AND acn.owning_lib = 103
    AND ac.circ_lib = 103
    AND ac.deleted IS FALSE
    AND ac.status = 0
    AND mfr.tag = 'tnf'
  ORDER BY ts_rank_cd(mfr.index_vector, to_tsquery('europe'));

... that comes in at 18 seconds. Of course, bear in mind that our system is still horribly hobbled by its use of default settings...

Revision history for this message
Dan Scott (denials) wrote :

Ah, note that adding "AND mfr.subfield IS NULL" enables the query to use an existing index and the resulting queries return in 0.5 to 1.5 seconds. (At this point, my eyes are bleary and I need to sleep, but perhaps this suggests a fruitful direction to explore - at least for the most common query cases. Note that real queries would have to be more sophisticated ('Available' can mean ac.status = 0, but also means an asset.uri with no copies necessary).

Revision history for this message
Dan Scott (denials) wrote :

Also note that no rows in mfr actually have a NULL subfield, which is why that last one was so fast (heh). /me blames that one on late night experimentation...

Revision history for this message
James Fournie (jfournie) wrote :

Hi Dan, I know I've asked you this before, but I have never been really clear on your reservations regarding GIN indexes, I tested switching my indexes and found a drastic improvement but based on your considerable database experience I am somewhat unnerved by your hesitance to use them. Can you clarify what the potential problems are? I saw a few news posts that you were concerned about but it wasn't clear to me how they would affect Evergreen.

We have tweaked our PG settings a bit, I'm not sure the rationale for our current settings but in the past we've used pgtune. But I haven't really noticed that much (qualitative) change in response times.

Revision history for this message
Mike Rylander (mrylander) wrote :

James, GIN indexes are (probably) fine for Evergreen. They build slower, but run much faster. The primary runtime drawback is that if you feed them a empty (NULL) query (which we're pretty good a guarding against in 2.0+, thanks to QueryParser and some in-SQL sanity checks) the query will explode in an ugly way (but, on the up side, quickly). A secondary consideration is that they become bloated much faster than GiST indexes. But, bibliographic data probably doesn't change fast enough for that to be a problem in practice, and autovac should make that problem mostly transparent anyway.

All in all, I'd support (though can't promise to code for) officially switching to GIN for all FTS indexing in 2.2 by way of looking for all GiST indexes and creating duplicates of those using GIN and then dropping the GiST version. This can be done live using CREATE INDEX CONCURRENTLY. GIN indexes have been WAL'd for a while now, so I'm quite comfortable with them, personally.

Regarding the 'tnf' title sorting, this goes away in 2.1, as SVF supplies a stock 'titlesort' attribute (as well as 'authorsort', both of which are directly indexed by default).

Revision history for this message
Dan Scott (denials) wrote :

James - I think we were originally talking about GIN indexes in the PostgreSQL 8.2-ish days? There have been a lot of bug fixes through the 8.4 series, which is why my confidence level is higher now.

But in any case, GIN indexes alone don't solve the problem that is at the heart of this bug report. And the 'tnf' title sorting issue is reported in a separate bug for 2.0.

So getting back to the "broad searches are slow" issue...

Revision history for this message
George Duimovich (george-duimovich) wrote :

FWIW, we're also seeing this, and at first investigation, seems more an issue with Staff Client searches vs. equivalent search done in public OPAC.

Revision history for this message
Galen Charlton (gmc) wrote :
Download full text (5.6 KiB)

A small optimization relevant in the 2.0.6 that George is running: the staff-side copy visibility check includes the following query:

            PERFORM 1
              FROM asset.call_number cn
                    JOIN asset.copy cp ON (cp.call_number = cn.id)
                    JOIN actor.org_unit a ON (cp.circ_lib = a.id)
              WHERE NOT cn.deleted
                    AND NOT cp.deleted
                    AND cp.circ_lib IN ( SELECT * FROM search.explode_array( search_org_list ) )
                    AND cn.record IN ( SELECT * FROM search.explode_array( core_result.records ) )
              LIMIT 1;

Note that joining actor.org_unit is unnecessary, but it also has a bad effect on the query plan. Compare:

evergreen=# explain SELECT 1
evergreen-# FROM asset.call_number cn
evergreen-# JOIN asset.copy cp ON (cp.call_number = cn.id)
evergreen-# JOIN actor.org_unit a ON (cp.circ_lib = a.id)
evergreen-# WHERE NOT cn.deleted
evergreen-# AND NOT cp.deleted
evergreen-# AND cp.circ_lib IN ( SELECT * FROM search.explode_array( ARRAY[1, 2] ) )
evergreen-# AND cn.record IN ( SELECT * FROM search.explode_array( ARRAY[2, 3] ) )
evergreen-# LIMIT 1;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit (cost=20.88..82.78 rows=1 width=0)
   -> Nested Loop Semi Join (cost=20.88..1569722.78 rows=25360 width=0)
         Join Filter: (cn.record = ('{2,3}'::integer[])[s.s])
         -> Nested Loop (cost=20.88..301722.78 rows=50720 width=8)
               -> Hash Join (cost=20.88..25198.05 rows=50720 width=8)
                     Hash Cond: (cp.circ_lib = a.id)
                     -> Append (cost=0.00..21833.43 rows=751401 width=12)
                           -> Seq Scan on copy cp (cost=0.00..21821.73 rows=751316 width=12)
                                 Filter: (NOT deleted)
                           -> Seq Scan on unit cp (cost=0.00..11.70 rows=85 width=12)
                                 Filter: (NOT deleted)
                     -> Hash (cost=20.71..20.71 rows=14 width=8)
                           -> Hash Join (cost=16.61..20.71 rows=14 width=8)
                                 Hash Cond: (('{1,2}'::integer[])[s.s] = a.id)
                                 -> HashAggregate (cost=15.00..17.00 rows=200 width=4)
                                       -> Function Scan on generate_series s (cost=0.00..12.50 rows=1000 width=4)
                                 -> Hash (cost=1.27..1.27 rows=27 width=4)
                                       -> Seq Scan on org_unit a (cost=0.00..1.27 rows=27 width=4)
               -> Index Scan using call_number_pkey on call_number cn (cost=0.00..5.44 rows=1 width=16)
                     Index Cond: (cn.id = cp.call_number)
                     Filter: (NOT cn.deleted)
         -> Function Scan on generate_series s (cost=0.00..12.50 rows=1000 width=4)
(22 rows...

Read more...

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Has the proposed "fix" been added to 2.0, yet?

Revision history for this message
Dan Scott (denials) wrote :

Commit 86cf8555661a1c8b4b9c5a15f013c8c4fbbb8e05 says this was committed on June 10, 2011. Marking as "Fix Committed" for 2.0.

Changed in evergreen:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.