QueryParser usage of @> operator is not optimal
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Triaged
|
Undecided
|
Unassigned |
Bug Description
Postgres 9.1
Evergreen 2.1->2.2
We were finding on our test servers that filtering by item_type is very slow and often times out in the OPAC.
Steven Chan dug into the database queries and found the following query is painfully slow on our dataset of over a million records, this is a very simplified version of what's generated by O:A:S:D:
explain analyze select count(m.source)
from metabib.record_attr mra
join metabib.
where mra.attrs @> 'item_type=>a';
(this takes 30 seconds on our development database)
Some testing of indexes and reading of hstore documentation later, we found this alternative which is much faster:
explain analyze select count(m.source)
from metabib.record_attr mra
join metabib.
where (mra.attrs -> 'item_type') = 'a';
(this takes 7 seconds on our development database)
The former is probably being forced to scan every key, whereas the latter will only need to look at the item_type keys.
Commit b90330614 in master seems to correct this issue somewhat, although we haven't tested it appears to remove the most problematic @> operators, however the commit message doesn't show an awareness of this performance issue. That commit is big enough that a backport is probably annoying, but it's likely not a very big patch that would be required to backport this change to 2.2 (and probably 2.1)
Actually, the reason it's slow in the larger search query is not because of a seqscan, but because of the way PG decides to join (not JOIN, mind you) the plan nodes. More generally, HSTORE data does not get accurate enough statistics for the planner to know that there are TONNES of item_type=>a values in your (and most everyone's) dataset.
The easiest way to address this in a 2.1 or 2.2 db is to remove the metabib_ svf_attrs_ idx on metabib. record_ attr. This is only necessary on larger datasets, like yours.
In the future, it may be advantageous to add specific indexes similar to the attrs->'date1' index for other HSTORE keys on that same table, particularly for low-use or high-cardinality keys. Hopefully, though, we can find a way to improve things with a single general index rather than having to create one by hand when needed. Perhaps an expression index that collapses the HSTORE to an array where each element is a combination of key and value.
Aside from all that, though, I think 2.2 and before should probably get their own patch (remember, placeholder upgrade script for master!) that just changes @> to -> [NOT] IN (...), or removes that index mentioned above. The feature set implemented by the commit you cite is significantly more than a bug fix.