OPAC searching significantly slowed by adding format filters

Bug #1438136 reported by Chris Sharp
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.7
Fix Released
High
Unassigned
2.8
Fix Released
High
Unassigned

Bug Description

We have discovered that adding a format filter to OPAC search causes a significant delay. Example from PINES:

Keyword search for the term "lusitania" scoped to the Leesburg Library returns results within a few seconds:

http://gapines.org/eg/opac/results?query=lusitania&qtype=keyword&fi%3Asearch_format=&locg=117

However, if you add a format filter by selecting "All Books" from the "Limit to:" dropdown menu, the search takes far longer (nearly 30 seconds):

http://gapines.org/eg/opac/results?query=lusitania&qtype=keyword&fi%3Asearch_format=book&locg=117&sort=

Evergreen 2.7.2
OpenSRF 2.4
PostgreSQL 9.3
Ubuntu LTS

Revision history for this message
Mike Rylander (mrylander) wrote :

Chris,

An explain-analyze of the core query would be really good to have. I suspect bad stats on the attribute vector list table, but there's no way to tell without more info.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Okay - attaching explain analyze output.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

And the core query for that explain.

Revision history for this message
Chris Sharp (chrissharp123) wrote :
Revision history for this message
Chris Sharp (chrissharp123) wrote :

Following Mike's advice:

09:07 < eeevil> csharp: yep: Bitmap Index Scan on mrca_vlist_idx (cost=0.00..337.66 rows=1821 width=0) (actual time=515.026..515.026 rows=1587948 loops=1) ... 1.8k est rows vs 1.6M actual rows. first thing I'd
                suggest is remove the duplicate condition in the vlist clause (change '(61|311)&(61|311)' to '61|311'
09:09 < eeevil> csharp: if that changes the plan, then there's a code change that could help (dedup vlist searches before passing to PG). if not, you'll want to increase the stats target on record_attr_vlist
                and vac-analyze

I removed the duplicate condition from the core query, which did not change the plan. I then did (within a transaction):

alter table metabib.record_attr_vector_list alter COLUMN vlist set statistics 10000;

and ran VACUUM ANALYZE, but I did not see those statistics change and the query is still very slow.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Adding an example query using the 'book' filter.

Revision history for this message
Mike Rylander (mrylander) wrote :

Chris,

Can you get an EXPLAIN ANAYLZE of the core query from that for the ticket? I'll look at other data sets, also.

TIA

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Sorry, Mike - should've thought to do that before. Attached below.

Revision history for this message
Mike Rylander (mrylander) wrote :

No problem.

I'm working with a comparable data set based on your estimates and actual costs, so I think I have a good handle on it now. What I'm seeing is that the statistics for the array are actually really good, but the planner is not, for some reason, noticing that. For instance:

# explain analyze select count(*) from metabib.record_attr_vector_list where vlist @@ '(610)';
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (cost=3809.32..3809.33 rows=1 width=0) (actual time=902.652..902.652 rows=1 loops=1)
   -> Bitmap Heap Scan on record_attr_vector_list (cost=38.79..3803.90 rows=2167 width=0) (actual time=342.924..724.453 rows=1268532 loops=1)
         Recheck Cond: (vlist @@ '610'::query_int)
         -> Bitmap Index Scan on mrca_vlist_idx (cost=0.00..38.25 rows=2167 width=0) (actual time=333.671..333.671 rows=1486582 loops=1)
               Index Cond: (vlist @@ '610'::query_int)
 Total runtime: 902.712 ms
(6 rows)

Even with no constraining join conditions, it still wants to use a bitmap index scan. However, the stats say that the value 610 is in is in approximately 73% of the rows in that table:

# select row_number from (select unnest, row_number() over () from (select unnest(most_common_elems::text::int[]) from pg_stats where tablename = 'record_attr_vector_list' and attname = 'vlist')x)y where unnest=610;
 row_number
------------
        326
(1 row)

# select most_common_elem_freqs[326] from pg_stats where tablename = 'record_attr_vector_list' and attname = 'vlist'; most_common_elem_freqs
------------------------
                 0.7364
(1 row)

So, either it intarray-supplied operators are not using the stats, or Pg has a bug or deficiency. Either way, I'm continuing to look for a way to construct the query such that the stats are used.

More as the story develops...

Revision history for this message
Galen Charlton (gmc) wrote :

Chris: just in case it's relevant to the potential query planner issue, could you let us know which subversion of Pg 9.3 you're using?

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Galen,

I'm using 9.3.5. I have an identical-to-production test server that I can upgrade to the most recent 9.3 point release if that makes a difference in ferreting out the problem.

Thanks,

Chris

Revision history for this message
Mike Rylander (mrylander) wrote :

SO! I'm confident that I've pinned down the issue, and that I have a likely solution.

The problem: Array query selectivity is Hard(tm). Very simple operators can use the data in pg_stats to estimate selectivity, but complicated ones (like generalized overlap/contains/contained, and especially query_int) require things like data distribution and correlation information that is 1) hard to gather 2) large to store 3) expensive to use. So, instead, the default selectivity estimate for arrays is used, and that estimate is 0.5%. That is bad for very common values in complicated queries (such as our core search query).

The solution: We need to be able to tell the planner that a particular WHERE clause is expensive. But, ^- that! AH! But all WHERE clauses boil down to a boolean expression in the end, and the selectivity of a user defined function that returns a BOOL in Postgres is 0.333, or 33%, which is well past the "it's super cheap, use whatever index you can find!" threshold. By wrapping the @@ operator use in a function like this:

CREATE FUNCTION evergreen.query_int_wrapper (INT[],TEXT)
 RETURNS BOOL $$ BEGIN RETURN $1 @@ $2::query_int; END; $$ LANGUAGE PLPGSQL STABLE;

we can convince the planner to move the condition to the end of the list of tests, or at least convince it to avoid a nested loop over the result of that filter. Common attribute value inference achievement unlocked!

"But wait," you say, "what about rare attr values! Those are SUPER fast with an index! We want to keep that speedup!"

The optimization: In my previous comment I showed how to find the most common values (as seen by the analyzer in PG) for array elements like this. So, we'll pull those at QueryParser startup (appropriately fenced by a PG version check) just like we do for query rewriting information, and sorters and whatnot, decide whether we use the @@ operator directly, or the function, based on whether any of the values in the query_int we construct are one of theose super-common values and is not negated. (The reason for "and not negated" is that those are always a recheck issue, and is a sunk cost regarless, so we ignore negated common values.)

Thoughts?

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Mike,

Given my level of expertise on the QP, I'll have to trust you and others that this solution is a good one. Are there particular people you'd like to have feedback from? I don't mind poking them directly if so.

Thanks,

Chris

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Mike,

Make a branch. I'll be more than happy to test it.

Jason

Revision history for this message
Mike Rylander (mrylander) wrote :

I was hoping for feedback on the analysis and method described before digging into code, to avoid any wasted effort in case someone sees a different/simpler/better solution.

Chris (and Jason, or anyone else), if I put together sample core queries that demonstrate the effect, would you put them through their paces? I'd like to see the effect on different real-world data sets before figuring out the details of the code generation.

TIA

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Mike,

Yes, I am very willing to test sample core queries. Just let me know what you've got and I'll give it a go.

Thanks!

Chris

Revision history for this message
Mike Rylander (mrylander) wrote :

All, here's a file containing the same core query, except for the function-hiding of the attr filter in the presence of the known-common value of 610. It dumps the output of EXPLAIN ANALYZE into a file called core-query-compare-output via \o, so adjust to taste.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Mike,

Here is my output. Obviously, the "with function" output was far faster.

Thanks,

Chris

Revision history for this message
Jason Boyer (jboyer) wrote :

Here's the output from 9.1.14 from EI's database. nearly3 seconds down to around .8.

Revision history for this message
Mike Rylander (mrylander) wrote :

Chris,

Your second query plan is still very strange, and I'm not 100% sure why. The second query should be using the tsvector index, like Jason's. Based on a little empirical testing, I think you may need to look into your indexing, as all title searches are taking at least 4 seconds, which is consistent with your output linked above. I don't think your indexing is quite right...

But, it's also obviously better than before.

I don't have time right this moment to test the inverse of this, where a very rare attribute is selected and a (relatively speaking) common search term is used. We don't want to lose the gains made in those cases, and that's where the complicated code comes into play. If anyone has time to apply the same structural change so such a core query, I would be very appreciative.

TIA.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Okay, Mike,

With indexes now corrected. I'm attaching the core query output again.

Chris

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I get these timings on my production 9.1.x database server:

BEGIN
Timing is on.
without function
Time: 171.886 ms
Time: 318.326 ms
with function
Time: 75.914 ms
Timing is off.
ROLLBACK

And I've attached the query plans.

Revision history for this message
Mike Rylander (mrylander) wrote :

Jason (et al), there's a branch implementing what I describe above at http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1438136-query_int_wrapper ... I can confirm that the functions work, and that the perl syntax is good, but I've not tested thoroughly.

TIA for any testing.

Revision history for this message
Mike Rylander (mrylander) wrote :

I force-pushed an update to improve the commit first message, and to add another commit updating the baseline schema with the appropriate functions.

Revision history for this message
Mike Rylander (mrylander) wrote :

I've (not-force) pushed another commit to protect PG 9.1 and before from the new stats probe. The query will die on older versions, and there's no need to clutter the logs with that mess. We assume that all attributes are common (foregoing the rare-attribute optimization) on older PGs.

Revision history for this message
Mike Rylander (mrylander) wrote :

I've pushed one final commit to make the function creation SQL forgiving in the case that the functions have already been created by a hot-fix.

Changed in evergreen:
milestone: none → 2.8.1
milestone: 2.8.1 → none
importance: Undecided → High
Changed in evergreen:
status: New → Triaged
Revision history for this message
Jason Boyer (jboyer) wrote :

A quick note to say that I've tested this on our dev server for a bit and not seen any issues, and we've been running this branch live since last Thursday. Our pgbadger Top Queries page had probably 12+ format searches in the top 20 long running queries each day (some taking up to 35 minutes), that's now down to 3-4 that take a max of 1 minute or so. I'd say this is looking really good so far.

Revision history for this message
Ben Shum (bshum) wrote :

So this is marked "High" for status meaning fix before we make the next releases, but there isn't an actual pullrequest tag on it. That said, it seems mostly ready to go and picked cleanly to master and related branches in my testing.

Should we push it on with Jason's +1 from testing?

Are we worried about the PG 9.2+ for the pg_statistics functionality? Do we only want to include this as a new enhancement for 2.9+ where we can try enforcing a harder database version restriction on the next major version of Evergreen?

Revision history for this message
Jason Boyer (jboyer) wrote :

The only concern I have about the pg_statistics check is that I can't test it. I would think that Mike has tested it to the extent that it's functional and the syntax is correct, but I haven't been able to test it with real data to make sure nothing goes pear-shaped.

I have added a signoff to the tip of this branch (cleanly rebased to master): http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/jboyer/lp-1438136-query_int_wrapper_signoff but I would feel more comfortable with an additional signoff from someone running 9.2 or newer.

Revision history for this message
Mike Rylander (mrylander) wrote :

Ben,

The lack of a pullrequest tag was my oversight. I've added that, now.

I think we should push this, yes.

There are not worries about PG 9.2+ or, for that matter, versions before 9.2. The code is built specifically to handle different versions of postgres.

Once PG 9.6 is out we may augment that check more, but until it's release we won't know the exact nature of the change.

Thanks!

tags: added: pullrequest
Changed in evergreen:
milestone: none → 2.9-beta
Revision history for this message
Chris Sharp (chrissharp123) wrote :
tags: added: signedoff
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Thanks, everyone!

I pushed this to master. I leave it to berick and bshum to work out whether/how they want to backport this.

Changed in evergreen:
status: Triaged → Fix Committed
Revision history for this message
Ben Shum (bshum) wrote :

Backported to rel_2_8 and rel_2_7.

Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.