QueryParser usage of @> operator is not optimal

Bug #1018913 reported by James Fournie
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Triaged
Undecided
Unassigned

Bug Description

Postgres 9.1
Evergreen 2.1->2.2

We were finding on our test servers that filtering by item_type is very slow and often times out in the OPAC.

Steven Chan dug into the database queries and found the following query is painfully slow on our dataset of over a million records, this is a very simplified version of what's generated by O:A:S:D:Pg:QueryParser.pm

explain analyze select count(m.source)
from metabib.record_attr mra
join metabib.metarecord_source_map m on (m.source = mra.id)
where mra.attrs @> 'item_type=>a';

(this takes 30 seconds on our development database)

Some testing of indexes and reading of hstore documentation later, we found this alternative which is much faster:

explain analyze select count(m.source)
from metabib.record_attr mra
join metabib.metarecord_source_map m on (m.source = mra.id)
where (mra.attrs -> 'item_type') = 'a';

(this takes 7 seconds on our development database)

The former is probably being forced to scan every key, whereas the latter will only need to look at the item_type keys.

Commit b90330614 in master seems to correct this issue somewhat, although we haven't tested it appears to remove the most problematic @> operators, however the commit message doesn't show an awareness of this performance issue. That commit is big enough that a backport is probably annoying, but it's likely not a very big patch that would be required to backport this change to 2.2 (and probably 2.1)

Tags: search
Revision history for this message
Mike Rylander (mrylander) wrote :

Actually, the reason it's slow in the larger search query is not because of a seqscan, but because of the way PG decides to join (not JOIN, mind you) the plan nodes. More generally, HSTORE data does not get accurate enough statistics for the planner to know that there are TONNES of item_type=>a values in your (and most everyone's) dataset.

The easiest way to address this in a 2.1 or 2.2 db is to remove the metabib_svf_attrs_idx on metabib.record_attr. This is only necessary on larger datasets, like yours.

In the future, it may be advantageous to add specific indexes similar to the attrs->'date1' index for other HSTORE keys on that same table, particularly for low-use or high-cardinality keys. Hopefully, though, we can find a way to improve things with a single general index rather than having to create one by hand when needed. Perhaps an expression index that collapses the HSTORE to an array where each element is a combination of key and value.

Aside from all that, though, I think 2.2 and before should probably get their own patch (remember, placeholder upgrade script for master!) that just changes @> to -> [NOT] IN (...), or removes that index mentioned above. The feature set implemented by the commit you cite is significantly more than a bug fix.

Revision history for this message
James Fournie (jfournie) wrote :
Download full text (6.1 KiB)

Hi Mike,

I was with you up to "drop metabib_svf_attrs_idx". In my testing, that does nothing, the problem is actually the @>.

=====================
Original:

explain analyze select count(m.source)
from metabib.record_attr mra
join metabib.metarecord_source_map m on (m.source = mra.id)
where mra.attrs @> ('item_type'=> 'a');

 Aggregate (cost=36676.69..36676.70 rows=1 width=8) (actual time=29759.974..29759.974 rows=1 loops=1)
   -> Nested Loop (cost=105.07..36673.25 rows=1373 width=8) (actual time=794.358..29476.101 rows=1183405 loops=1)
         -> Bitmap Heap Scan on record_attr mra (cost=105.07..11617.84 rows=3101 width=8) (actual time=794.292..4440.120 rows=2743759 loops=1)
               Recheck Cond: (attrs @> '"item_type"=>"a"'::hstore)
               -> Bitmap Index Scan on record_attr_attrs_idx1 (cost=0.00..104.29 rows=3101 width=0) (actual time=791.309..791.309 rows=3101261 loops=1)
                     Index Cond: (attrs @> '"item_type"=>"a"'::hstore)
         -> Index Scan using metabib_metarecord_source_map_source_record_idx on metarecord_source_map m (cost=0.00..8.07 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=2743759)
               Index Cond: (source = mra.id)
 Total runtime: 29760.050 ms

=====================

With no index:

 Aggregate (cost=407613.61..407613.62 rows=1 width=8) (actual time=28434.095..28434.095 rows=1 loops=1)
   -> Nested Loop (cost=0.00..407610.18 rows=1373 width=8) (actual time=0.048..28134.122 rows=1183405 loops=1)
         -> Seq Scan on record_attr mra (cost=0.00..382554.76 rows=3101 width=8) (actual time=0.018..3589.922 rows=2743759 loops=1)
               Filter: (attrs @> '"item_type"=>"a"'::hstore)
         -> Index Scan using metabib_metarecord_source_map_source_record_idx on metarecord_source_map m (cost=0.00..8.07 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=2743759)
               Index Cond: (source = mra.id)
 Total runtime: 28434.149 ms
(7 rows)

=====================
With that indexed dropped plus CREATE INDEX on metabib.record_attr ((attrs->'item_type'::text));

 Aggregate (cost=407613.61..407613.62 rows=1 width=8) (actual time=29021.666..29021.666 rows=1 loops=1)
   -> Nested Loop (cost=0.00..407610.18 rows=1373 width=8) (actual time=0.049..28719.581 rows=1183405 loops=1)
         -> Seq Scan on record_attr mra (cost=0.00..382554.76 rows=3101 width=8) (actual time=0.019..4313.373 rows=2743759 loops=1)
               Filter: (attrs @> '"item_type"=>"a"'::hstore)
         -> Index Scan using metabib_metarecord_source_map_source_record_idx on metarecord_source_map m (cost=0.00..8.07 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=2743759)
               Index Cond: (source = mra.id)
 Total runtime: 29021.719 ms

=====================
Dropped the previous index, re-added a GIN index on attrs:
CREATE INDEX ON metabib.record_attr USING GIN (attrs);

 Aggregate (cost=36639.65..36639.66 rows=1 width=8) (actual time=30672.383..30672.384 rows=1 loops=1)
   -> Nested Loop (cost=68.03..36636.22 rows=1373 width=8) (actual time=826.967..30382.283 rows=1183405 loops=1)
         -> Bitmap Heap Scan on record_attr mra (cost=68.03..11580.80 ...

Read more...

Changed in evergreen:
status: New → Triaged
tags: added: search
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.