Evergreen

Holds shelf list can fail to retrieve results

Bug #1971745 reported by Galen Charlton on 2022-05-05

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Fix Released	Undecided	Unassigned	Evergreen 3.10.1
	3.9	Fix Released	Undecided	Unassigned	Evergreen 3.9.2

Bug Description

The holds shelf list can fail to retrieve results if there are a large number of holds on the shelf. Empirically, more than 350 has been observed to be problematic, particularly in environments with a fair amount of ejabberd s2s traffic.

The problem appears to be related to the fact that open-ils.circ.hold.wide_hash.stream streams a large number of results in two directions:

- fetching a stream of results from open-ils.storage.action.live_holds.wide_hash.atomic
- streaming those results back to the client

If "too many" results are received from the open-ils.storage call, eventually streaming the results back to the client times out on the ejabberd c2s connection, resulting in the method failing and only partial results ending up in the client's hands.

Investigation has ruled out a number of possible suspects, including ejabberd shaper and max_stanza_size settings. At the moment, this sounds like it might be a subtle OpenSRF bug. One piece of evidence for that is that the atomic version of the open-ils.circ call works consistently even when the streaming version does not.

A patch is forthcoming that calls the open-ils.storage method atomically, then streams the results back. This will have the following effect:

- the open-ils.circ and open-ils.storage drones may end up using a bit more memory to store the intermediate results of the entire holds shelf contents
- clients of the open-ils.circ call will see a pause while the open-ils.storage method completes, then the streaming results

Evergreen 3.7+

Tags:

Revision history for this message

Galen Charlton (gmc) wrote on 2022-05-05:

A patch is available at the tip of

working/user/gmcharlt/lp1971745_adjust_hold_shelf_list_streaming / https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/gmcharlt/lp1971745_adjust_hold_shelf_list_streaming

tags:

added: circ-holds pullrequest

Revision history for this message

Michele Morgan (mmorgan) wrote on 2022-05-06:

I am not seeing this issue on our production system running 3.7.2. We currently have four libraries with more than 350 items on the holds shelf, and I had no trouble retrieving results.

Revision history for this message

Terran McCanna (tmccanna) wrote on 2022-05-09:

Dupe of https://bugs.launchpad.net/evergreen/+bug/1962768 but I'll mark that one as a dupe since this one has the patch.

Revision history for this message

Terran McCanna (tmccanna) wrote on 2022-08-01:

We applied this to one of our test servers that has a copy of our production data set. Prior to applying the patch it took 15-20 seconds to load a large (~700 item) hold shelf on that server.

Unfortunately, I'm really not noticing any improvement in the interface. It doesn't break anything, but it still takes around the same time to load. If there is an improvement, then it doesn't seem to be hugely significant.

Revision history for this message

Galen Charlton (gmc) wrote on 2022-10-11:

Following up, I agree that the patch isn't really about speeding it up, it's about making the pull list not fail entirely. Do you have any locations that, without the patch, sometimes fail to fetch the results at all?

Revision history for this message

Terran McCanna (tmccanna) wrote on 2022-10-12:

We haven't had any complaints about this problem on our production server (3.8 - no patch) for a while. I just tested a branch that has 975 items on their hold shelf and it loaded the full list and printed it with no problems.

(That being said, I'm certainly not opposed to the patch, I just don't have any good ability to test it other than saying that it doesn't appear to break anything.)

Revision history for this message

Josh Stompro (u-launchpad-stompro-org) wrote on 2022-11-21:

Hello, I setup a 3.9 test system a month or so ago with a copy of our production system to get prepared to upgrade. And when I try to view the holdshelf on the new system, for our largest branch, I get no results.

It seems like the request to open-ils.storage.action.live_holds.wide_hash is taking 34 seconds for the 450 items on the holdshelf to be processed.

The open-ils-circ call for open-ils.circ.hold.wide_hash.stream is getting canceled after 30 seconds.

Logs of the request are at:
https://gist.github.com/stompro/5e43f8ef97989b8a9db822459e5721e5

I'll try out the patch and see if it helps.
Josh

Revision history for this message

Josh Stompro (u-launchpad-stompro-org) wrote on 2022-11-21:

larl-wide-holds-logs-2022-11-21.txt Edit (3.0 KiB, text/plain)

I applied the changes, but I'm not seeing any change in the behavior. I'm still seeing open-ils.circ.hold.wide_hash.stream complete after 30.019 seconds, and the open-ils.storage.action.live_holds.wide_hash.atomic call complete in 34 seconds.

I attached a snippet of the logs. I'm going to enable debug logging and see what that shows.

Revision history for this message

Josh Stompro (u-launchpad-stompro-org) wrote on 2022-11-21:

larl-wide-holds-log-debug-2022-11-21.txt Edit (100.6 KiB, text/plain)

I've attached a debug level log file. I didn't spot anything that looked enlightening... but I probably just don't know what to look for.

Josh

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2022-11-21:

#10

Josh, are you able to share EXPLAIN output for the underlying SQL query?

Revision history for this message

Josh Stompro (u-launchpad-stompro-org) wrote on 2022-11-22:

#11

Here are details about the query and a possible solution. I'm not sure if this still belongs with this ticket and Galen's solution.

I put the query's and explain outputs in a gist at
https://gist.github.com/stompro/0a716f912c1f27d1de8bec684552bfb5

Running on PG 10.22

This left join seems like it is the culprit.
LEFT JOIN asset.call_number cn ON (cn.id = cp.call_number OR (h.hold_type = 'V' AND cn.id = h.target))

The OR seems to be causing PG to do a sequential scan of asset.call_number instead of using an index. I think because the two conditions are not mutually exclusive... cp is already joined for volume holds so both conditions match.

Changing it to explicitly say for volume holds, and not for volume holds results in a 2 seconds execution time for me.
LEFT JOIN asset.call_number cn ON ((cn.id = cp.call_number AND h.hold_type !='V') OR (h.hold_type = 'V' AND cn.id = h.target))

          -> Materialize (cost=0.00..24148.39 rows=813359 width=83) (actual time=0.000..32.230 rows=813359 loops=466)
          Output: cn.id, cn.creator, cn.create_date, cn.editor, cn.edit_date, cn.record, cn.owning_lib, cn.label, cn.deleted, cn.prefix, cn.suffix, cn.label_class, cn.label_sortkey
          Buffers: shared hit=11948
          -> Seq Scan on asset.call_number cn (cost=0.00..20081.59 rows=813359 width=83) (actual time=0.003..49.131 rows=813359 loops=1)
                Output: cn.id, cn.creator, cn.create_date, cn.editor, cn.edit_date, cn.record, cn.owning_lib, cn.label, cn.deleted, cn.prefix, cn.suffix, cn.label_class, cn.label_sortkey
                Buffers: shared hit=11948

Altered Explain:
-> Bitmap Heap Scan on asset.call_number cn (cost=411.19..419.15 rows=1 width=83) (actual time=0.007..0.007 rows=1 loops=466)
    Recheck Cond: ((cn.id = cp.call_number) OR (cn.id = h.target))
    Filter: (((cn.id = cp.call_number) AND (h.hold_type <> 'V'::text)) OR ((h.hold_type = 'V'::text) AND (cn.id = h.target)))
    Heap Blocks: exact=466
    Buffers: shared hit=3263
    -> BitmapOr (cost=411.19..411.19 rows=2 width=0) (actual time=0.005..0.005 rows=0 loops=466)
          Buffers: shared hit=2797
          -> Bitmap Index Scan on call_number_pkey (cost=0.00..3.59 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=466)
                Index Cond: (cn.id = cp.call_number)
                Buffers: shared hit=1399
          -> Bitmap Index Scan on call_number_pkey (cost=0.00..0.92 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=466)
                Index Cond: (cn.id = h.target)
                Buffers: shared hit=1398

Here are details about the query and a possible solution.  I'm not sure if this still belongs with this ticket and Galen's solution.

I put the query's and explain outputs in a gist at
https://gist.github.com/stompro/0a716f912c1f27d1de8bec684552bfb5

Running on PG 10.22

This left join seems like it is the culprit.
LEFT JOIN asset.call_number cn ON (cn.id = cp.call_number OR (h.hold_type = 'V' AND cn.id = h.target))

The OR seems to be causing PG to do a sequential scan of asset.call_number instead of using an index.  I think because the two conditions are not mutually exclusive... cp is already joined for volume holds so both conditions match.

Original Explain:
    ->  Nested Loop Left Join  (cost=9.44..15929691.80 rows=8 width=1418) (actual time=73.670..55574.261 rows=466 loops=1)
    Join Filter: ((cp.call_number = cn.id) OR ((h.hold_type = 'V'::text) AND (cn.id = h.target)))
    Rows Removed by Join Filter: 379024828
    Buffers: shared hit=1258477
                                                                                  						
          ->  Materialize  (cost=0.00..24148.39 rows=813359 width=83) (actual time=0.000..32.230 rows=813359 loops=466)
          Output: cn.id, cn.creator, cn.create_date, cn.editor, cn.edit_date, cn.record, cn.owning_lib, cn.label, cn.deleted, cn.prefix, cn.suffix, cn.label_class, cn.label_sortkey
          Buffers: shared hit=11948
          ->  Seq Scan on asset.call_number cn  (cost=0.00..20081.59 rows=813359 width=83) (actual time=0.003..49.131 rows=813359 loops=1)
                Output: cn.id, cn.creator, cn.create_date, cn.editor, cn.edit_date, cn.record, cn.owning_lib, cn.label, cn.deleted, cn.prefix, cn.suffix, cn.label_class, cn.label_sortkey
                Buffers: shared hit=11948

Altered Explain:
->  Bitmap Heap Scan on asset.call_number cn  (cost=411.19..419.15 rows=1 width=83) (actual time=0.007..0.007 rows=1 loops=466)
    Recheck Cond: ((cn.id = cp.call_number) OR (cn.id = h.target))
    Filter: (((cn.id = cp.call_number) AND (h.hold_type <> 'V'::text)) OR ((h.hold_type = 'V'::text) AND (cn.id = h.target)))
    Heap Blocks: exact=466
    Buffers: shared hit=3263
    ->  BitmapOr  (cost=411.19..411.19 rows=2 width=0) (actual time=0.005..0.005 rows=0 loops=466)
          Buffers: shared hit=2797
          ->  Bitmap Index Scan on call_number_pkey  (cost=0.00..3.59 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=466)
                Index Cond: (cn.id = cp.call_number)
                Buffers: shared hit=1399
          ->  Bitmap Index Scan on call_number_pkey  (cost=0.00..0.92 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=466)
                Index Cond: (cn.id = h.target)
                Buffers: shared hit=1398

Revision history for this message

Josh Stompro (u-launchpad-stompro-org) wrote on 2022-11-22:

#12

I put the query change into a working branch at
user/stompro/lp1971745_speed_up_holds_query

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/stompro/lp1971745_speed_up_holds_query

Revision history for this message

Josh Stompro (u-launchpad-stompro-org) wrote on 2022-11-28:

#13

Hello, I moved all the holds on the holdshelf into one OU on our test system to give a larger list to test with. With 1800 holds on the holdshelf, the updated query returns in 5 seconds.

I removed Galen's patch and tried to load the list again with 1800 holds, and it failed to load. So I think I can confirm the original issue. And that the fix allows larger lists to fully load quickly. Without Galen's patch I get one update to the loading widget to around 10%, and then it just stops. The logs show some errors about the jabber connection closing.

I also needed my patch loaded that changes the join to make it faster.

My logs with and without Galen's patch are at: https://gist.github.com/stompro/1f64ed8aaf5fa6e9ad7530f3eeac057d

Signoff branch at
user/stompro/lp1971745_adjust_hold_shelf_list_streaming_signoff / https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/stompro/lp1971745_adjust_hold_shelf_list_streaming_signoff

tags:

added: signedoff

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2022-11-28:

#14

I'm setting this confirmed because this bug affects multiple people.

I have not confirmed this bug myself, but I have a slightly slower than production database where I can test the bug and the patches.

My intention is to try Galen's path that Josh singed off on its own and in combination with Josh's patch.

Changed in evergreen:
status:	New → Confirmed
milestone:	none → 3.10.1
assignee:	nobody → Jason Stephenson (jstephenson)

Jason Stephenson (jstephenson) on 2022-11-28

Changed in evergreen:
status:	Confirmed → Fix Committed

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2022-11-28:

#15

I was able to confirm the issue on production data with hold shelf lists of up to 891 copies.

I attempted to test the fixes on three different VMs using two different database servers. One db server has a known hardware issue and gave miserable results, though the patches did seem to help.

The VM using the other database server showed definite improvements with each patch, going from not working at all to working in a reasonable amount of time after applying the patches.

I pushed the patches to master, rel_3_9 and rel_3_10.

Thanks Galen and Josh for the patches, and Michele, Terran, and Jeff for the comments.