Web client: patron billing history results in long running query

Bug #1738488 reported by Blake GH
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.12
Won't Fix
High
Unassigned
3.0
Fix Released
High
Unassigned

Bug Description

We are seeing long running queries:

"SELECT "mp".amount, "mp".id, "mp".note, "mp".payment_ts, "mp".payment_type, "mp".xact, "mp".voided FROM money.payment_view AS "mp"
LEFT JOIN money.billable_xact AS "__title_mbt" ON ( "__title_mbt".id = "mp".xact )
LEFT JOIN action.circulation AS "__title_circ" ON ( "__title_circ".id = "__title_mbt".id )
LEFT JOIN asset.copy AS "__title_acp" ON ( "__title_acp".id = "__title_circ".target_copy )
LEFT JOIN asset.call_number AS "__title_acn" ON ( "__title_acn".id = "__title_acp".call_number )
LEFT JOIN biblio.record_entry AS "__title_bre" ON ( "__title_bre".id = "__title_acn".record )
LEFT JOIN reporter.materialized_simple_record AS "__title_rmsr" ON ( "__title_rmsr".id = "__title_bre".id )
LEFT JOIN money.billable_xact AS "__record_id_mbt" ON ( "__record_id_mbt".id = "mp".xact )
LEFT JOIN action.circulation AS "__record_id_circ" ON ( "__record_id_circ".id = "__record_id_mbt".id )
LEFT JOIN asset.copy AS "__record_id_acp" ON ( "__record_id_acp".id = "__record_id_circ".target_copy )
LEFT JOIN asset.call_number AS "__record_id_acn" ON ( "__record_id_acn".id = "__record_id_acp".call_number )
LEFT JOIN biblio.record_entry AS "__record_id_bre" ON ( "__record_id_bre".id = "__record_id_acn".record )
LEFT JOIN money.billable_xact AS "__copy_barcode_mbt" ON ( "__copy_barcode_mbt".id = "mp".xact )
LEFT JOIN action.circulation AS "__copy_barcode_circ" ON ( "__copy_barcode_circ".id = "__copy_barcode_mbt".id )
LEFT JOIN asset.copy AS "__copy_barcode_acp" ON ( "__copy_barcode_acp".id = "__copy_barcode_circ".target_copy )
LEFT JOIN money.billable_xact AS "__xact.summary.xact_type_mbt" ON ( "__xact.summary.xact_type_mbt".id = "mp".xact )
LEFT JOIN money.materialized_billable_xact_summary AS "__xact.summary.xact_type_mbts" ON ( "__xact.summary.xact_type_mbts".id = "__xact.summary.xact_type_mbt".id )
LEFT JOIN money.billable_xact AS "__xact.usr_mbt" ON ( "__xact.usr_mbt".id = "mp".xact )
WHERE "mp".payment_ts BETWEEN '2016-12-15' AND 'now' AND ( "__xact.usr_mbt".usr = '425135' ) ORDER BY "mp".payment_ts "

This is a result of opening this:
/eg/staff/circ/patron/425135/bill_history/payments

The explain analyze at the bottom.

from berick
" so, the query is generated by open-ils.fielder.flattened_search. it may just be it's not a good use case for it. the UI may need to move to a human-built pcrud query"

IRC Discussion:

http://irc.evergreen-ils.org/evergreen/2017-12-15#i_338699

"Nested Loop Left Join (cost=5.51..2093924919999451103232.00 rows=44575740147225592344870912 width=140) (actual time=225988.034..225991.355 rows=2 loops=1)"
" -> Nested Loop (cost=5.51..1050193655900159737856.00 rows=905036041403073691648 width=140) (actual time=225987.993..225991.271 rows=2 loops=1)"
" -> Nested Loop Left Join (cost=5.51..354629028297289957376.00 rows=550173885351412563968 width=140) (actual time=0.475..199966.610 rows=4568817 loops=1)"
" -> Nested Loop Left Join (cost=5.08..36915498324831752.00 rows=550173885351412563968 width=148) (actual time=0.468..182483.681 rows=4568817 loops=1)"
" -> Nested Loop Left Join (cost=4.64..1335471069989.13 rows=11170362929190354 width=140) (actual time=0.449..143065.397 rows=4568817 loops=1)"
" -> Nested Loop Left Join (cost=3.78..28878528.96 rows=226795584618 width=140) (actual time=0.427..84271.530 rows=4568817 loops=1)"
" -> Nested Loop (cost=2.91..1764603.08 rows=4604706 width=140) (actual time=0.361..15394.131 rows=4568817 loops=1)"
" -> Merge Append (cost=2.63..336448.22 rows=4604706 width=80) (actual time=0.342..4733.472 rows=4568817 loops=1)"
" Sort Key: p.payment_ts"
" -> Index Scan using m_p_time_idx on payment p (cost=0.12..4.14 rows=1 width=75) (actual time=0.005..0.005 rows=0 loops=1)"
" -> Sort (cost=0.01..0.02 rows=1 width=75) (actual time=0.010..0.010 rows=0 loops=1)"
" Sort Key: p_1.payment_ts"
" Sort Method: quicksort Memory: 25kB"
" -> Seq Scan on bnm_payment p_1 (cost=0.00..0.00 rows=1 width=75) (actual time=0.001..0.001 rows=0 loops=1)"
" -> Index Scan using money_account_adjustment_payment_ts_idx on account_adjustment p_2 (cost=0.43..141885.92 rows=3854936 width=88) (actual time=0.079..2552.360 rows=3819125 loops=1)"
" -> Sort (cost=0.01..0.02 rows=1 width=75) (actual time=0.007..0.007 rows=0 loops=1)"
" Sort Key: p_3.payment_ts"
" Sort Method: quicksort Memory: 25kB"
" -> Seq Scan on bnm_desk_payment p_3 (cost=0.00..0.00 rows=1 width=75) (actual time=0.000..0.000 rows=0 loops=1)"
" -> Index Scan using money_credit_payment_payment_ts_idx on credit_payment p_4 (cost=0.15..12.51 rows=291 width=38) (actual time=0.034..0.827 rows=203 loops=1)"
" -> Index Scan using money_forgive_payment_payment_ts_idx on forgive_payment p_5 (cost=0.29..2464.97 rows=80378 width=41) (actual time=0.044..71.766 rows=80440 loops=1)"
" -> Index Scan using money_goods_payment_payment_ts_idx on goods_payment p_6 (cost=0.28..230.42 rows=7409 width=42) (actual time=0.041..7.217 rows=7370 loops=1)"
" -> Index Scan using money_work_payment_payment_ts_idx on work_payment p_7 (cost=0.15..12.65 rows=300 width=43) (actual time=0.020..0.661 rows=262 loops=1)"
" -> Index Scan using money_cash_payment_ts_idx on cash_payment p_8 (cost=0.42..20172.07 rows=636169 width=35) (actual time=0.039..400.235 rows=636165 loops=1)"
" -> Index Scan using money_check_payment_ts_idx on check_payment p_9 (cost=0.29..557.86 rows=17437 width=41) (actual time=0.029..22.006 rows=17420 loops=1)"
" -> Index Scan using money_credit_card_payment_ts_idx on credit_card_payment p_10 (cost=0.28..299.04 rows=7783 width=35) (actual time=0.028..13.215 rows=7832 loops=1)"
" -> Index Scan using pg_class_oid_index on pg_class c (cost=0.28..0.30 rows=1 width=68) (actual time=0.001..0.002 rows=1 loops=4568817)"
" Index Cond: (oid = p.tableoid)"
" -> Nested Loop Left Join (cost=0.86..5.85 rows=4 width=8) (actual time=0.009..0.014 rows=1 loops=4568817)"
" -> Nested Loop Left Join (cost=0.43..3.26 rows=4 width=16) (actual time=0.006..0.009 rows=1 loops=4568817)"
" -> Append (cost=0.00..1.11 rows=4 width=8) (actual time=0.003..0.006 rows=1 loops=4568817)"
" -> Seq Scan on billable_xact __title_mbt (cost=0.00..0.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=4568817)"
" Filter: (id = p.xact)"
" -> Index Only Scan using circulation_pkey on circulation __title_mbt_1 (cost=0.43..0.51 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 3168243"
" -> Index Only Scan using reservation_pkey on reservation __title_mbt_2 (cost=0.14..0.16 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 0"
" -> Index Only Scan using grocery_pkey on grocery __title_mbt_3 (cost=0.42..0.44 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 134630"
" -> Index Scan using circulation_pkey on circulation __title_circ (cost=0.43..0.53 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=4545456)"
" Index Cond: (id = __title_mbt.id)"
" -> Append (cost=0.43..0.63 rows=2 width=16) (actual time=0.003..0.004 rows=1 loops=4545456)"
" -> Index Scan using copy_pkey on copy __title_acp (cost=0.43..0.46 rows=1 width=16) (actual time=0.003..0.003 rows=1 loops=4545456)"
" Index Cond: (id = __title_circ.target_copy)"
" -> Index Scan using unit_pkey on unit __title_acp_1 (cost=0.14..0.16 rows=1 width=16) (actual time=0.000..0.000 rows=0 loops=4545456)"
" Index Cond: (id = __title_circ.target_copy)"
" -> Nested Loop Left Join (cost=0.86..5.85 rows=4 width=8) (actual time=0.007..0.012 rows=1 loops=4568817)"
" -> Nested Loop Left Join (cost=0.43..3.26 rows=4 width=16) (actual time=0.005..0.008 rows=1 loops=4568817)"
" -> Append (cost=0.00..1.11 rows=4 width=8) (actual time=0.003..0.005 rows=1 loops=4568817)"
" -> Seq Scan on billable_xact __record_id_mbt (cost=0.00..0.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=4568817)"
" Filter: (id = p.xact)"
" -> Index Only Scan using circulation_pkey on circulation __record_id_mbt_1 (cost=0.43..0.51 rows=1 width=8) (actual time=0.001..0.002 rows=1 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 3168242"
" -> Index Only Scan using reservation_pkey on reservation __record_id_mbt_2 (cost=0.14..0.16 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 0"
" -> Index Only Scan using grocery_pkey on grocery __record_id_mbt_3 (cost=0.42..0.44 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 134630"
" -> Index Scan using circulation_pkey on circulation __record_id_circ (cost=0.43..0.53 rows=1 width=16) (actual time=0.001..0.002 rows=1 loops=4545456)"
" Index Cond: (id = __record_id_mbt.id)"
" -> Append (cost=0.43..0.63 rows=2 width=16) (actual time=0.002..0.003 rows=1 loops=4545456)"
" -> Index Scan using copy_pkey on copy __record_id_acp (cost=0.43..0.46 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=4545456)"
" Index Cond: (id = __record_id_circ.target_copy)"
" -> Index Scan using unit_pkey on unit __record_id_acp_1 (cost=0.14..0.16 rows=1 width=16) (actual time=0.000..0.000 rows=0 loops=4545456)"
" Index Cond: (id = __record_id_circ.target_copy)"
" -> Nested Loop Left Join (cost=0.43..3.26 rows=4 width=16) (actual time=0.005..0.008 rows=1 loops=4568817)"
" -> Append (cost=0.00..1.11 rows=4 width=8) (actual time=0.003..0.005 rows=1 loops=4568817)"
" -> Seq Scan on billable_xact __copy_barcode_mbt (cost=0.00..0.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=4568817)"
" Filter: (id = p.xact)"
" -> Index Only Scan using circulation_pkey on circulation __copy_barcode_mbt_1 (cost=0.43..0.51 rows=1 width=8) (actual time=0.001..0.002 rows=1 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 3168242"
" -> Index Only Scan using reservation_pkey on reservation __copy_barcode_mbt_2 (cost=0.14..0.16 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 0"
" -> Index Only Scan using grocery_pkey on grocery __copy_barcode_mbt_3 (cost=0.42..0.44 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 134630"
" -> Index Scan using circulation_pkey on circulation __copy_barcode_circ (cost=0.43..0.53 rows=1 width=16) (actual time=0.001..0.002 rows=1 loops=4545456)"
" Index Cond: (id = __copy_barcode_mbt.id)"
" -> Append (cost=0.43..0.62 rows=2 width=8) (actual time=0.002..0.003 rows=1 loops=4568817)"
" -> Index Only Scan using copy_pkey on copy __copy_barcode_acp (cost=0.43..0.46 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=4568817)"
" Index Cond: (id = __copy_barcode_circ.target_copy)"
" Heap Fetches: 3974892"
" -> Index Only Scan using unit_pkey on unit __copy_barcode_acp_1 (cost=0.14..0.16 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=4568817)"
" Index Cond: (id = __copy_barcode_circ.target_copy)"
" Heap Fetches: 0"
" -> Append (cost=0.00..1.22 rows=4 width=8) (actual time=0.005..0.005 rows=0 loops=4568817)"
" -> Seq Scan on billable_xact "__xact.usr_mbt" (cost=0.00..0.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=4568817)"
" Filter: ((usr = 425135) AND (p.xact = id))"
" -> Index Scan using circulation_pkey on circulation "__xact.usr_mbt_1" (cost=0.43..0.62 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Filter: (usr = 425135)"
" Rows Removed by Filter: 1"
" -> Index Scan using reservation_pkey on reservation "__xact.usr_mbt_2" (cost=0.14..0.16 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Filter: (usr = 425135)"
" -> Index Scan using grocery_pkey on grocery "__xact.usr_mbt_3" (cost=0.42..0.45 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4568817)"
" Index Cond: (id = p.xact)"
" Filter: (usr = 425135)"
" Rows Removed by Filter: 0"
" -> Append (cost=0.00..1.11 rows=4 width=8) (actual time=0.011..0.029 rows=1 loops=2)"
" -> Seq Scan on billable_xact "__xact.summary.xact_type_mbt" (cost=0.00..0.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=2)"
" Filter: (id = p.xact)"
" -> Index Only Scan using circulation_pkey on circulation "__xact.summary.xact_type_mbt_1" (cost=0.43..0.51 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=2)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 2"
" -> Index Only Scan using reservation_pkey on reservation "__xact.summary.xact_type_mbt_2" (cost=0.14..0.16 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=2)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 0"
" -> Index Only Scan using grocery_pkey on grocery "__xact.summary.xact_type_mbt_3" (cost=0.42..0.44 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=2)"
" Index Cond: (id = p.xact)"
" Heap Fetches: 0"
"Planning time: 4.570 ms"
"Execution time: 225992.136 ms"

Changed in evergreen:
status: New → Confirmed
Revision history for this message
Jason Stephenson (jstephenson) wrote :

At C/W MARS we're seeing this in both 2.12 and 3.0.2 in the web staff client. I've seen the above query running for 6 days on our training database (2.12) and testing database (3.0.2). The testing database is stock Evergreen 3.0.2 with anonymized local data. Training has customizations applied to the OPAC. We have not made any direct customization to the web staff client.

In production, on 2.12.8, we do see this query run long enough that no results get back to the web staff client. However, it does not run for days. I've not managed to actually time it, but it seems to run between 2 to 4 minutes. (Our production db server is a lot faster than the others.)

Changed in evergreen:
importance: Undecided → Medium
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I should add some more information relevant to my previous comment:

If I wait a couple of minutes and try that patron's payment history again, I usually get it. I believe this is because the query has finally finished and the results have been cached (somewhere).

I think the main reason for the huge discrepancy in run time between production and the other servers is not so much because of production's better hardware. We have set join_collapse_limit to 9 on production. That settings aids with the performance of queries with massive joins.

As a final thing worth noting, this does not happen in the XUL client. I've not looked, but I assume that the web staff client is doing something very different from the XUL client with respect to patron payment history.

Revision history for this message
Bill Erickson (berick) wrote :

The XUL client uses the payments API (open-ils.actor.user.payments.retrieve) while the web client uses a grid-driven open-ils.fielder call. We may need to move back to using the API.

Note the API has paging, but it will need to be tied into the grid paging controls -- fielder gives us paging for free. Alternatively, we could trim down the columns auto-fetched by fielder (i.e. the ones causing the painful joins) and fetch the extra data via pcrud after each row is rendered.

Changed in evergreen:
importance: Medium → High
importance: High → Medium
Changed in evergreen:
importance: Medium → High
Dan Wells (dbw2)
Changed in evergreen:
assignee: nobody → Dan Wells (dbw2)
Revision history for this message
Mike Rylander (mrylander) wrote :

The biggest issue with this flat_fielder query is that it's causing, like, 20 joins. That's well past both the join collapse limit and the GEQO threshold in any reasonable setup. Many of the joined tables seem duplicated, and I think we could probably trim those down significantly by only following one path to any particular table.

If we can get the list of joins down to just what we actually need, instead of providing separate paths to, for instance, bre, the obvious selectivity of the usr filter will probably be noticed by the planner. I tested this by pushing the join collapse limit and GEQO threshold way up to let the planner look for a better join order, and it did exactly that.

Revision history for this message
Dan Wells (dbw2) wrote :

Mike, we're on the same track. Here is a lightly tested branch which attempts to reduce the joins:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dbwells/lp1738488_optimize_flattener_joins

working/collab/dbwells/lp1738488_optimize_flattener_joins

More details from the commit message:

The current Flattener.pm autogenerates necessary joins for sorting and filtering, but in doing so, it gives every intermediate table a unique alias, even if the path to that table is exactly the same as another member in the map we are flattening.

Instead, let's reuse joins whenever the path is identical, even for intermediate tables. We do so by tracking every path to each core type, then reusing as much of that join path as we can. In cases where we have different paths to the same type, we still necessarily provide a new unique alias.

This problem was first noticed in the web staff billing history interface, where the particular stacking of joins resulted (for one specific library) in 17 joins and 44,575,740,147,225,592,344,870,912 potential rows.

Revision history for this message
Dan Wells (dbw2) wrote :

I also meant to point out that this is a fairly low level change with wide-ranging effects beyond this bug, so much testing and caution will be needed. On the flip side, this can potentially speed up other grids as well by reducing unnecessary joins.

Revision history for this message
Mike Rylander (mrylander) wrote :

IMO, while I'll always be in favor of more testing, especially by more individuals, I hope others agree that this should be an uncontroversial code change. In essence it takes this:

SELECT foo.bar
  FROM t1 x
        left join t2 y on (x.a = y.a)
        left join t3 foo on (y.b = foo.b)
        left join t2 a on (x.a = a.a) --duplicated JOIN branch
        left join t3 b on (b.b = b.b)
  WHERE foo.baz = '123'
  ORDER BY b.bar;

to

SELECT foo.bar
  FROM t1 x
        left join t2 y on (x.a = y.a)
        left join t3 foo on (y.b = foo.b)
  WHERE foo.baz = '123'
  ORDER BY foo.bar;

They'll produce the same results, but only because the query generator isn't trying to do anything sophisticated. If we teach json_query how to do LATERAL joins (say, by being able to include a function call in the FROM clause) then we'd very likely get what we'd consider an incorrect result. Probably in the form of duplicated rows, or unintended side effects.

Thanks, Dan!

Dan Wells (dbw2)
tags: added: pullrequest
Changed in evergreen:
assignee: Dan Wells (dbw2) → nobody
milestone: none → 3.1-rc
Revision history for this message
Dan Wells (dbw2) wrote :

I have decided to roll out this fix into production, so we should hopefully find any loose ends soon.

Revision history for this message
Mike Rylander (mrylander) wrote :

I realize I should clarify my comment above: Dan's patch will /prevent/, not create, potential future issues I theorized.

Dan, any updates? Is it going well?

Thanks!

Revision history for this message
Dan Wells (dbw2) wrote :

Yes, it's going well, I have not seen any issues caused by making this change in production. Only caveat I would add to that is that we are still using both clients, and a majority of our traffic is still from the XUL client, so this roll-out doesn't carry quite the same weight as a "full" production test.

Still, I've seen nothing to suggest it makes things worse, and I'm hoping to see this in for the RC!

Galen Charlton (gmc)
Changed in evergreen:
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Galen Charlton (gmc) wrote :

Looks good and doesn't appear to break anything. I have pushed this to master for inclusion in the 3.1 RC. Thanks, Dan!

I'm intentionally holding off on backporting to rel_3_0 to give some time to be tested more broadly (hopefully) with the RC (and, time permitting, some unit test coverage getting done), but barring problems, I agree it should make it into the next 3.0.x.

Changed in evergreen:
status: Confirmed → Won't Fix
status: Won't Fix → Fix Committed
assignee: Galen Charlton (gmc) → nobody
Changed in evergreen:
status: Fix Committed → Fix Released
Revision history for this message
Jason Stephenson (jstephenson) wrote :

We're testing this on 3.0.6 at CW MARS. If it works, we'll push to rel_3_0 for 3.0.7.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Dan's patch works for us on 3.0.6, so I've signed off and pushed to rel_3_0 for inclusion in 3.0.7.

Thanks, Dan, Galen and everyone else.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.