Circ errors due to issue with metabib.record_attr_flat view

Bug #1400376 reported by Thomas Berezansky
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.6
Fix Released
Undecided
Unassigned

Bug Description

It would appear that a recent "improve performance" change to metabib.record_attr_flat breaks other views that rely on it. I suspect metabib.record_attr, but metabib.rec_descriptor is the view that is failing in INDB circ tests.

The problem looks to be the union of queries doing LEFT joins - By being LEFT joins we got null values in attr and value when there we no matches in a given query, which then got passed into hstore which disliked the nulls on attr at least (as far as I can tell).

By removing the LEFT portion of the joins (which brings it back in line to before the "improve performance" commit which did a normal join as well) we remove the NULL entries and the errors go away, hopefully still getting the improved performance.

The branch below makes said change.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/tsbere/fix_record_attr_flat

Tags: pullrequest
summary: - Circ errors due to issue with metabib.record_attr_fla view
+ Circ errors due to issue with metabib.record_attr_flat view
Revision history for this message
Mike Rylander (mrylander) wrote :

The postgres docs do not suggest that hstore has a problem storing NULL values: http://www.postgresql.org/docs/9.2/static/hstore.html ... I'd be concerned about a mis-diagnosis here, unless you've found that hstore is truly not acting as documented.

However, if you do want to force the metabib.record_attr view to exclude NULLs, I would recommend this instead:

CREATE VIEW metabib.record_attr AS
     SELECT id, HSTORE( ARRAY_AGG( attr ), ARRAY_AGG( value ) ) AS attrs FROM metabib.record_attr_flat WHERE value IS NOT NULL GROUP BY 1;

That way you're not changing something used by other code without having inspected that other code.

Revision history for this message
Mike Rylander (mrylander) wrote :

Based on discussion in IRC, I amend my recommendation to the following (null test on attr column, instead of value):

CREATE VIEW metabib.record_attr AS
     SELECT id, HSTORE( ARRAY_AGG( attr ), ARRAY_AGG( value ) ) AS attrs FROM metabib.record_attr_flat WHERE attr IS NOT NULL GROUP BY 1;

Revision history for this message
Thomas Berezansky (tsbere) wrote :

I have been told that the LEFT JOINs are needed for query planning in other queries when the view in question is deeper in.

I cannot find any significant differences beyond the lack of a "Left Join" modifier in query plans with vs without it, after running lots of test queries from different parts of the system.

Not using metabib.full_attr_id_map provides a major improvement. LEFT JOIN inside of the replacement does not, at least in my tests, on a copy of our production data set updated yesterday.

Given that the version that used full_attr_id_map did *not*, in my tests, output null attr and value entries, but the LEFT JOIN version of the replacement did, unless some indication as to what portion of the system gets an improvement from the LEFT JOIN variant is provided I am going to continue to recommend that my version of the fix goes in.

Implementing the extra protection on metabib.record_attr may be a good idea in addition to removing the LEFT portion of the joins, however, due to how hstore reacts to null keys.

Revision history for this message
Ben Shum (bshum) wrote :
Download full text (7.4 KiB)

Looking back at old bug 1374091 and working through some EXPLAIN ANALYZE we used at the time, this is what I saw in our database when trying with LEFT JOIN vs. (INNER) JOIN:

=== Query ===

EXPLAIN ANALYZE SELECT XMLAGG(foo.y)
          FROM (
            SELECT XMLELEMENT(
                        name field,
                        XMLATTRIBUTES(
                            mra.attr AS name,
                            cvm.value AS "coded-value",
                            cvm.id AS "cvmid",
                            rad.composite,
                            rad.multi,
                            rad.filter,
                            rad.sorter
                        ),
                        mra.value
                    )
              FROM metabib.record_attr_flat mra
                    JOIN config.record_attr_definition rad ON (mra.attr = rad.name)
                    LEFT JOIN config.coded_value_map cvm ON (cvm.ctype = mra.attr AND code = mra.value)
              WHERE mra.id = 93978
            )foo(y);

=== With LEFT JOIN ===

Aggregate (cost=65.30..65.31 rows=1 width=83) (actual time=0.024..0.025 rows=1 loops=1)
  -> Hash Right Join (cost=45.54..65.25 rows=20 width=83) (actual time=0.024..0.024 rows=0 loops=1)
        Hash Cond: ((cvm.ctype = mra.attr) AND (cvm.code = mra.value))
        -> Seq Scan on coded_value_map cvm (cost=0.00..14.26 rows=726 width=29) (never executed)
        -> Hash (cost=45.24..45.24 rows=20 width=68) (actual time=0.018..0.018 rows=0 loops=1)
              Buckets: 1024 Batches: 1 Memory Usage: 0kB
              -> Hash Join (cost=43.51..45.24 rows=20 width=68) (actual time=0.018..0.018 rows=0 loops=1)
                    Hash Cond: (rad.name = mra.attr)
                    -> Seq Scan on record_attr_definition rad (cost=0.00..1.38 rows=38 width=12) (actual time=0.004..0.004 rows=1 loops=1)
                    -> Hash (cost=43.26..43.26 rows=20 width=64) (actual time=0.010..0.010 rows=0 loops=1)
                          Buckets: 1024 Batches: 1 Memory Usage: 0kB
                          -> Subquery Scan on mra (cost=42.86..43.26 rows=20 width=64) (actual time=0.010..0.010 rows=0 loops=1)
                                -> HashAggregate (cost=42.86..43.06 rows=20 width=24) (actual time=0.010..0.010 rows=0 loops=1)
                                      -> Append (cost=0.86..42.71 rows=20 width=24) (actual time=0.010..0.010 rows=0 loops=1)
                                            -> Nested Loop Left Join (cost=0.86..24.57 rows=10 width=27) (actual time=0.008..0.008 rows=0 loops=1)
                                                  -> Index Scan using record_attr_vector_list_pkey on record_attr_vector_list v (cost=0.43..3.45 rows=1 width=90) (actual time=0.008..0.008 rows=0 loops=1)
                                                        Index Cond: (source = 93978)
                                                  -> Index Scan using uncontrolled_record_attr_value_pkey on uncontrolled_record_attr_value m (cost=0.43..21.02 rows=10 width=27) (never executed)
                                                        Index Cond: (id = ANY (v.vlist))
                        ...

Read more...

Revision history for this message
Mike Rylander (mrylander) wrote :

Ben,

Let's assume that both timings are stable. On their face, while both objectively "fast", the INNER JOIN version took very nearly twice the time as the LEFT JOIN. For one record, that may not matter. For many records, it could matter quite a lot.

Now, let's consider how the view itself might be used. If you use that view (say, through one of the dependent views) simply to probe for the existence of a row in record_attr_vector_list by selecting the id column with a where clause that filters on id (or, say, a subquery IN LIST of id's), then the LEFT JOIN version can skip the joins to the attribute value tables altogether. If used in an EXISTS (or similar) clause, it can even skip the execution of the "uncontrolled" side of the UNION, returning after the first row of the "controlled" side produces a row from record_attr_vector_list.

However, the INNER JOIN version /must/ produce the full output of one or the other arms (and, based on ORDER BY, LIMIT, or OFFSET clauses on wrapping subqueries, possible both arms) of the UNION, regardless of whether the attribute value is actually used anywhere in the query.

If you use the view to filter by attribute information, then they will act (generally) the same way in terms of planning.

Some queries will output or filter on the attribute information, some will only look at the id. One place, in particular, is search variants which use the dependent views to see if the record should be included in result output by probing the table underlying this view instead of looking at "bre", which is larger and more expensive to query.

I continue to recommend the LEFT JOIN version with a filter in the views above.

Revision history for this message
Thomas Berezansky (tsbere) wrote :

I just ran the same basic test query in a copy of our production system, basically doing what bshum did, though I grabbed a known record ID from our system first. I did more than just current master vs my proposed fix, though.

For one thing, we need to not just examine left vs inner join in record_attr_flat, but also the impact of changing record_attr to check for nulls.

Full explain output is available here: http://pastebin.com/Nciebnsf

A summary, however:

Current master with left joins and no null attr check:
Runtime of 1.183 ms

Remove left joins:
Runtime of 1.109 ms

Left joins put back, but with record_attr null check included:
Runtime of 1.110 ms

Both solutions at the same time:
Runtime of 1.133 ms

Given that I got comparable results on both solutions I don't see how either is "better" - especially as every use of record_attr_flat I can find outside of the record_attr view itself is checking against a non-null attr in some way as it is.

Revision history for this message
Mike Rylander (mrylander) wrote :

Thomas,

Sorry, but this is entirely unconvincing. You have shown exactly one run of one query (for each variant) for a single record. Your timings are worse than useless because they confuse the matter by being presented as evidence of a pattern, but they are not.

Before you start complaining about my use of Ben's timings, read my first sentence again: "assuming they are stable."

[Aside the first: In order to help future timing testers (you included), I need to point out that you're also reporting the wrong part of the output if you want to use E-A as evidence (and, in fairness, so did Ben, and I should have pointed it out before). The number you want is the ending time of the outermost part of the EXPLAIN output itself, not the post-result time, as that includes variable network latency as well (yes, even over loopback, and even over a unix socket). In that case you'll see that my solution has /exactly/ the same timing as yours, and retains the advantages I describe above. But, that fact shows the problem in your test -- the first one primed the cache for the rest, at the cost of seeming slower. ]

With that in mind, though, let's look at Ben's timing again, just for the sake of argument (it's still not statistically valid...): 0.025 vs 0.058 -- it's actually /worse/ than double. And, the first run was faster, or so the paste would make it seem.

Even so, and leaving the timing done so far out or not, the main point I raise stands: LEFT JOIN will allow for better plans (and faster execution of nearly identical plans) than INNER JOIN. If you'd tested the situations I described and found them to be faster over a statistically meaningful set of repeated runs, I would be happily convinced.

Here's some further reading as to my reasoning: http://rhaas.blogspot.com/2010/06/why-join-removal-is-cool.html (note, INNER JOIN removal, mentioned at the end, is not in PG yet, but it's being worked on for 9.5). A google for "postgres join removal" will turn up more.

[Aside the second: Finally, the solutions we respectively propose are mutually exclusive. There is no point in using mine with an INNER join on the underlying view, and there is no point in yours with the dependent view filtering null attr column values. To save everyone the trouble, testing both together can be skipped. ]

Thanks.

Revision history for this message
Mike Rylander (mrylander) wrote :

And, one more point I meant to make regarding, "...especially as every use of record_attr_flat I can find outside of the record_attr view itself..." It's exactly those uses that go through the record_attr view that we want to preserve the LEFT JOIN option for.

Revision history for this message
Mike Rylander (mrylander) wrote :
Revision history for this message
Ben Shum (bshum) wrote :

Picked Mike's change to master and backported to rel_2_7 and rel_2_6.

Changed in evergreen:
milestone: none → 2.7.2
status: New → Fix Committed
Galen Charlton (gmc)
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.