Bug #1438136 “OPAC searching significantly slowed by adding form...” : Bugs : Evergreen

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-03-30:

#1

Chris,

An explain-analyze of the core query would be really good to have. I suspect bad stats on the attribute vector list table, but there's no way to tell without more info.

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-03-30:

#2

explain_analyze_long_query.txt Edit (5.2 KiB, text/plain)

Okay - attaching explain analyze output.

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-03-30:

#3

And the core query for that explain.

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-03-30:

#4

core_query.sql Edit (6.7 KiB, application/x-sql)

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-03-31:

#5

Following Mike's advice:

09:07 < eeevil> csharp: yep: Bitmap Index Scan on mrca_vlist_idx (cost=0.00..337.66 rows=1821 width=0) (actual time=515.026..515.026 rows=1587948 loops=1) ... 1.8k est rows vs 1.6M actual rows. first thing I'd
suggest is remove the duplicate condition in the vlist clause (change '(61|311)&(61|311)' to '61|311'
09:09 < eeevil> csharp: if that changes the plan, then there's a code change that could help (dedup vlist searches before passing to PG). if not, you'll want to increase the stats target on record_attr_vlist
and vac-analyze

I removed the duplicate condition from the core query, which did not change the plan. I then did (within a transaction):

alter table metabib.record_attr_vector_list alter COLUMN vlist set statistics 10000;

and ran VACUUM ANALYZE, but I did not see those statistics change and the query is still very slow.

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-04-02:

#6

longquery2.sql Edit (3.1 KiB, application/x-sql)

Adding an example query using the 'book' filter.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-02:

#7

Chris,

Can you get an EXPLAIN ANAYLZE of the core query from that for the ticket? I'll look at other data sets, also.

TIA

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-04-02:

#8

explain_analyze_longquery2.txt Edit (5.1 KiB, text/plain)

Sorry, Mike - should've thought to do that before. Attached below.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-02:

#9

No problem.

I'm working with a comparable data set based on your estimates and actual costs, so I think I have a good handle on it now. What I'm seeing is that the statistics for the array are actually really good, but the planner is not, for some reason, noticing that. For instance:

# explain analyze select count(*) from metabib.record_attr_vector_list where vlist @@ '(610)';
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3809.32..3809.33 rows=1 width=0) (actual time=902.652..902.652 rows=1 loops=1)
   -> Bitmap Heap Scan on record_attr_vector_list (cost=38.79..3803.90 rows=2167 width=0) (actual time=342.924..724.453 rows=1268532 loops=1)
         Recheck Cond: (vlist @@ '610'::query_int)
         -> Bitmap Index Scan on mrca_vlist_idx (cost=0.00..38.25 rows=2167 width=0) (actual time=333.671..333.671 rows=1486582 loops=1)
               Index Cond: (vlist @@ '610'::query_int)
Total runtime: 902.712 ms
(6 rows)

Even with no constraining join conditions, it still wants to use a bitmap index scan. However, the stats say that the value 610 is in is in approximately 73% of the rows in that table:

# select row_number from (select unnest, row_number() over () from (select unnest(most_common_elems::text::int[]) from pg_stats where tablename = 'record_attr_vector_list' and attname = 'vlist')x)y where unnest=610;
row_number
------------
326
(1 row)

# select most_common_elem_freqs[326] from pg_stats where tablename = 'record_attr_vector_list' and attname = 'vlist'; most_common_elem_freqs
------------------------
0.7364
(1 row)

So, either it intarray-supplied operators are not using the stats, or Pg has a bug or deficiency. Either way, I'm continuing to look for a way to construct the query such that the stats are used.

More as the story develops...

No problem.

I'm working with a comparable data set based on your estimates and actual costs, so I think I have a good handle on it now.  What I'm seeing is that the statistics for the array are actually really good, but the planner is not, for some reason, noticing that.  For instance:

# explain analyze select count(*) from metabib.record_attr_vector_list where vlist @@ '(610)';
                                                                   QUERY PLAN                                                                   
------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3809.32..3809.33 rows=1 width=0) (actual time=902.652..902.652 rows=1 loops=1)
   ->  Bitmap Heap Scan on record_attr_vector_list  (cost=38.79..3803.90 rows=2167 width=0) (actual time=342.924..724.453 rows=1268532 loops=1)
         Recheck Cond: (vlist @@ '610'::query_int)
         ->  Bitmap Index Scan on mrca_vlist_idx  (cost=0.00..38.25 rows=2167 width=0) (actual time=333.671..333.671 rows=1486582 loops=1)
               Index Cond: (vlist @@ '610'::query_int)
 Total runtime: 902.712 ms
(6 rows)

Even with no constraining join conditions, it still wants to use a bitmap index scan.  However, the stats say that the value 610 is in is in approximately 73% of the rows in that table:

# select row_number from (select unnest, row_number() over () from (select unnest(most_common_elems::text::int[]) from pg_stats where tablename = 'record_attr_vector_list' and attname = 'vlist')x)y where unnest=610;
 row_number 
------------
        326
(1 row)

# select most_common_elem_freqs[326] from pg_stats where tablename = 'record_attr_vector_list' and attname = 'vlist'; most_common_elem_freqs 
------------------------
                 0.7364
(1 row)

So, either it intarray-supplied operators are not using the stats, or Pg has a bug or deficiency.  Either way, I'm continuing to look for a way to construct the query such that the stats are used.

More as the story develops...

Revision history for this message

Galen Charlton (gmc) wrote on 2015-04-03:

#10

Chris: just in case it's relevant to the potential query planner issue, could you let us know which subversion of Pg 9.3 you're using?

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-04-03:

#11

Galen,

I'm using 9.3.5. I have an identical-to-production test server that I can upgrade to the most recent 9.3 point release if that makes a difference in ferreting out the problem.

Thanks,

Chris

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-03:

#12

SO! I'm confident that I've pinned down the issue, and that I have a likely solution.

The problem: Array query selectivity is Hard(tm). Very simple operators can use the data in pg_stats to estimate selectivity, but complicated ones (like generalized overlap/contains/contained, and especially query_int) require things like data distribution and correlation information that is 1) hard to gather 2) large to store 3) expensive to use. So, instead, the default selectivity estimate for arrays is used, and that estimate is 0.5%. That is bad for very common values in complicated queries (such as our core search query).

The solution: We need to be able to tell the planner that a particular WHERE clause is expensive. But, ^- that! AH! But all WHERE clauses boil down to a boolean expression in the end, and the selectivity of a user defined function that returns a BOOL in Postgres is 0.333, or 33%, which is well past the "it's super cheap, use whatever index you can find!" threshold. By wrapping the @@ operator use in a function like this:

CREATE FUNCTION evergreen.query_int_wrapper (INT[],TEXT)
RETURNS BOOL $$ BEGIN RETURN $1 @@ $2::query_int; END; $$ LANGUAGE PLPGSQL STABLE;

we can convince the planner to move the condition to the end of the list of tests, or at least convince it to avoid a nested loop over the result of that filter. Common attribute value inference achievement unlocked!

"But wait," you say, "what about rare attr values! Those are SUPER fast with an index! We want to keep that speedup!"

The optimization: In my previous comment I showed how to find the most common values (as seen by the analyzer in PG) for array elements like this. So, we'll pull those at QueryParser startup (appropriately fenced by a PG version check) just like we do for query rewriting information, and sorters and whatnot, decide whether we use the @@ operator directly, or the function, based on whether any of the values in the query_int we construct are one of theose super-common values and is not negated. (The reason for "and not negated" is that those are always a recheck issue, and is a sunk cost regarless, so we ignore negated common values.)

Thoughts?

SO! I'm confident that I've pinned down the issue, and that I have a likely solution.

The problem: Array query selectivity is Hard(tm).  Very simple operators can use the data in pg_stats to estimate selectivity, but complicated ones (like generalized overlap/contains/contained, and especially query_int) require things like data distribution and correlation information that is 1) hard to gather 2) large to store 3) expensive to use.  So, instead, the default selectivity estimate for arrays is used, and that estimate is 0.5%.  That is bad for very common values in complicated queries (such as our core search query).

The solution: We need to be able to tell the planner that a particular WHERE clause is expensive.  But, ^- that!  AH! But all WHERE clauses boil down to a boolean expression in the end, and the selectivity of a user defined function that returns a BOOL in Postgres is 0.333, or 33%, which is well past the "it's super cheap, use whatever index you can find!" threshold.  By wrapping the @@ operator use in a function like this:

CREATE FUNCTION evergreen.query_int_wrapper (INT[],TEXT)
 RETURNS BOOL $$ BEGIN RETURN $1 @@ $2::query_int; END; $$ LANGUAGE PLPGSQL STABLE;

we can convince the planner to move the condition to the end of the list of tests, or at least convince it to avoid a nested loop over the result of that filter.  Common attribute value inference achievement unlocked!

"But wait," you say, "what about rare attr values!  Those are SUPER fast with an index!  We want to keep that speedup!"

The optimization:  In my previous comment I showed how to find the most common values (as seen by the analyzer in PG) for array elements like this.  So, we'll pull those at QueryParser startup (appropriately fenced by a PG version check) just like we do for query rewriting information, and sorters and whatnot, decide whether we use the @@ operator directly, or the function, based on whether any of the values in the query_int we construct are one of theose super-common values and is not negated.  (The reason for "and not negated" is that those are always a recheck issue, and is a sunk cost regarless, so we ignore negated common values.)

Thoughts?

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-04-10:

#13

Mike,

Given my level of expertise on the QP, I'll have to trust you and others that this solution is a good one. Are there particular people you'd like to have feedback from? I don't mind poking them directly if so.

Thanks,

Chris

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2015-04-13:

#14

Mike,

Make a branch. I'll be more than happy to test it.

Jason

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-13:

#15

I was hoping for feedback on the analysis and method described before digging into code, to avoid any wasted effort in case someone sees a different/simpler/better solution.

Chris (and Jason, or anyone else), if I put together sample core queries that demonstrate the effect, would you put them through their paces? I'd like to see the effect on different real-world data sets before figuring out the details of the code generation.

TIA

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-04-13:

#16

Mike,

Yes, I am very willing to test sample core queries. Just let me know what you've got and I'll give it a go.

Thanks!

Chris

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-14:

#17

core_query_comparison.sql Edit (13.6 KiB, application/x-sql)

All, here's a file containing the same core query, except for the function-hiding of the attr filter in the presence of the known-common value of 610. It dumps the output of EXPLAIN ANALYZE into a file called core-query-compare-output via \o, so adjust to taste.

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-04-14:

#18

core-query-comparison-output.txt Edit (10.0 KiB, text/plain)

Mike,

Here is my output. Obviously, the "with function" output was far faster.

Thanks,

Chris

Revision history for this message

Jason Boyer (jboyer) wrote on 2015-04-14:

#19

core-query-compare-output Edit (10.4 KiB, text/plain)

Here's the output from 9.1.14 from EI's database. nearly3 seconds down to around .8.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-14:

#20

Chris,

Your second query plan is still very strange, and I'm not 100% sure why. The second query should be using the tsvector index, like Jason's. Based on a little empirical testing, I think you may need to look into your indexing, as all title searches are taking at least 4 seconds, which is consistent with your output linked above. I don't think your indexing is quite right...

But, it's also obviously better than before.

I don't have time right this moment to test the inverse of this, where a very rare attribute is selected and a (relatively speaking) common search term is used. We don't want to lose the gains made in those cases, and that's where the complicated code comes into play. If anyone has time to apply the same structural change so such a core query, I would be very appreciative.

TIA.

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-04-15:

#21

core_query_output_after_adding_index.txt Edit (10.4 KiB, text/plain)

Okay, Mike,

With indexes now corrected. I'm attaching the core query output again.

Chris

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2015-04-15:

#22

Query Plan Output on Pg 9.1.x Edit (9.4 KiB, text/plain)

I get these timings on my production 9.1.x database server:

BEGIN
Timing is on.
without function
Time: 171.886 ms
Time: 318.326 ms
with function
Time: 75.914 ms
Timing is off.
ROLLBACK

And I've attached the query plans.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-15:

#23

Jason (et al), there's a branch implementing what I describe above at http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1438136-query_int_wrapper ... I can confirm that the functions work, and that the perl syntax is good, but I've not tested thoroughly.

TIA for any testing.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-16:

#24

I force-pushed an update to improve the commit first message, and to add another commit updating the baseline schema with the appropriate functions.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-17:

#25

I've (not-force) pushed another commit to protect PG 9.1 and before from the new stats probe. The query will die on older versions, and there's no need to clutter the logs with that mess. We assume that all attributes are common (foregoing the rare-attribute optimization) on older PGs.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-04-26:

#26

I've pushed one final commit to make the function creation SQL forgiving in the case that the functions have already been created by a hot-fix.

Mike Rylander (mrylander) on 2015-04-27

Changed in evergreen:
milestone:	none → 2.8.1
milestone:	2.8.1 → none
importance:	Undecided → High

Evergreen Bug Maintenance (bugmaster) on 2015-06-18

Changed in evergreen:
status:	New → Triaged

Revision history for this message

Jason Boyer (jboyer) wrote on 2015-08-10:

#27

A quick note to say that I've tested this on our dev server for a bit and not seen any issues, and we've been running this branch live since last Thursday. Our pgbadger Top Queries page had probably 12+ format searches in the top 20 long running queries each day (some taking up to 35 minutes), that's now down to 3-4 that take a max of 1 minute or so. I'd say this is looking really good so far.

Revision history for this message

Ben Shum (bshum) wrote on 2015-08-19:

#28

So this is marked "High" for status meaning fix before we make the next releases, but there isn't an actual pullrequest tag on it. That said, it seems mostly ready to go and picked cleanly to master and related branches in my testing.

Should we push it on with Jason's +1 from testing?

Are we worried about the PG 9.2+ for the pg_statistics functionality? Do we only want to include this as a new enhancement for 2.9+ where we can try enforcing a harder database version restriction on the next major version of Evergreen?

Revision history for this message

Jason Boyer (jboyer) wrote on 2015-08-19:

#29

The only concern I have about the pg_statistics check is that I can't test it. I would think that Mike has tested it to the extent that it's functional and the syntax is correct, but I haven't been able to test it with real data to make sure nothing goes pear-shaped.

I have added a signoff to the tip of this branch (cleanly rebased to master): http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/jboyer/lp-1438136-query_int_wrapper_signoff but I would feel more comfortable with an additional signoff from someone running 9.2 or newer.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-08-19:

#30

Ben,

The lack of a pullrequest tag was my oversight. I've added that, now.

I think we should push this, yes.

There are not worries about PG 9.2+ or, for that matter, versions before 9.2. The code is built specifically to handle different versions of postgres.

Once PG 9.6 is out we may augment that check more, but until it's release we won't know the exact nature of the change.

Thanks!

tags:	added: pullrequest
Changed in evergreen:
milestone:	none → 2.9-beta

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2015-08-19:

#31

...and my signoff branch (on top of Jason's signoff) is here: http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/csharp/working/lp-1438136-query_int_wrapper_signoff

tags:

added: signedoff

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2015-08-19:

#32

Thanks, everyone!

I pushed this to master. I leave it to berick and bshum to work out whether/how they want to backport this.

Changed in evergreen:
status:	Triaged → Fix Committed

Revision history for this message

Ben Shum (bshum) wrote on 2015-08-19:

#33

Backported to rel_2_8 and rel_2_7.

Evergreen Bug Maintenance (bugmaster) on 2015-09-17

Changed in evergreen:
status:	Fix Committed → Fix Released

	Status	Importance	Assigned to	Milestone
Evergreen	Fix Released	High	Unassigned	Evergreen 2.9-beta
2.7	Fix Released	High	Unassigned	Evergreen 2.7.7
2.8	Fix Released	High	Unassigned	Evergreen 2.8.3

Evergreen

OPAC searching significantly slowed by adding format filters

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches