Webclient Holds Pull List Returns 0 Results

Bug #1768022 reported by Robert J Jackson on 2018-04-30
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Undecided
Unassigned

Bug Description

webclient 3.0.6

When attempting to retrieve holds pull list for a site having close to 50 holds the web client shows a progress bar for an extended period of time and then returns a blank screen. Reducing the number of holds to be displayed to a count of 25 allows for results to be returned.

Robert J Jackson (rjackson) wrote :

This was originally noted early in the year but Jason Boyer indicated the community was aware of the slow query/time out issues and no bug was filed at that time. Sorry if this is a duplicate but initial queries of existing bugs did not find a match.

tags: added: webstaffclient
Jane Sandberg (sandbej) on 2018-05-20
tags: added: holds
Robert J Jackson (rjackson) wrote :

There was another ticket: https://bugs.launchpad.net/evergreen/+bug/1738488 that should have helped with this issue but we are running currently with that patch and still timing out on holds pull list generation when the site has 50+ holds.

Bill Erickson (berick) wrote :

Here's a potential fix:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1768022-pull-list-chunk-size

I have not tested this. My theory is the holds are getting stuck on the server waiting for chunking to have enough data to send. Since the API as a whole is long-running, the patch tells the API to return every hold as soon as it's available by disabling chunking.

Mike Rylander (mrylander) wrote :

I think we may want to spell that "max_bundle_size" rather than "max_chunk_size" in modern OpenSRF, but otherwise the theory seems sound. Thanks, Bill!

Jason Boyer (jboyer) wrote :

It looks like max_chunk_size is what's en vogue at the moment. That said it hasn't had the desired effect on its own; still not displaying any entries. I'm trying to follow the path from call to db to see what could be the holdup but the more the merrier if there are any other ideas.

Mike Rylander (mrylander) wrote :

The name "max_chunk_size" is incorrect because this is not about chunking, it's about bundling. The fact that "max_chunk_size" is used more times doesn't mean it's the right thing to use here.

Previous to what we now call chunking (taking large OpenSRF messages and chopping them up to send across several XMPP messages) we had a feature that, unfortunately, was also called chunking that packed several small OpenSRF messages into one XMPP message. Back then we marked some methods with max_chunk_size. When the chopping-up feature was invented, it took the name "chunking" and the old functionality became bundling. Annoying, I know... Evergreen checks for OpenSRF's capabilities WRT bundling and chunking in a couple places, actually, due to this change.

Oh, one other thing -- I looked at the OpenSRF code, and a value of 0 means "no limit", so what we really want is:

  max_bundle_size => 1

If we want to force sending every hold individually, that is.

Mike Rylander (mrylander) wrote :

And, just because my morning wouldn't be complete without a small side of crow, the thing we /actually/ want is:

  max_bundle_count => 1

The count rather than size is important there. The effect would probably be the same, but for the wrong reasons, and we should really just use the feature as intended, Mike.

:)

Jason Boyer (jboyer) wrote :

Ok. My confusion came from the fact that across every perl module there are only 5 references to max_bundle_size and they're all only in $client->can() calls so I made an assumption. Had I just searched for max_bundle_ I'd have noticed that sooner. :) In any case, it didn't have any visible effect either. Looking at the JS, it looks like fielder is involved so I'm not sure that method is involved or when it's called. I do see 20 calls to a fielder drone (there are more than 20 holds).

Here's something fun: for each hold an sql query is run that takes 1.4 seconds to plan and 13ms to run.

https://pastebin.com/gMYrDxKq shows the details and timing from our dev server.

Bill Erickson (berick) wrote :

Thanks, Mike. max_bundle_count => 1 is really what I was going for.

Jason, that's my bad... The code makes 2 calls, one flattener call to get the core data, then a detail streaming call to get the extra display details for the hold (queue position, etc.).

Try also adding max_bundle_count => 1 to the "open-ils.fielder.flattened_search" API (in Fielder.pm)?

We may not want it applied to fielder in all cases, but would be good to know if that fixes it.

Jason Boyer (jboyer) wrote :

So this is interesting. I've got those 2 entries for max_bundle_count and it's just not streaming anything. If I set the grid to show 10 or 25 they will eventually appear but all at once, and if I choose 50 or 100 nothing appears after the timeout. (All of opensrf and apache were restarted after making these changes so it shouldn't be old code or anything like that.)

Bill Erickson (berick) wrote :

Huh, Jason, I'm seeing just the opposite. When I add max_bundle_count=>1 to Fielder.pm, I start seeing the pull list render as a stream, where before it rendered only after filling up the page.

My pull list has 40 holds, running OpenSRF master, minus a few patches.

I'll try to add some more holds.

Mike Rylander (mrylander) wrote :

Jason, that planning time is ... crazy. I'm seeing sub-milisecond planning times on data sets that are comparable to yours in size (within an order of magnitude). There are lots of things that could be contributing, but to check the obvious ones, what do these give you:

show geqo_threshold;
show join_collapse_limit;
show work_mem;
show shared_buffers;
show random_page_cost;

Jason Boyer (jboyer) wrote :

I definitely agree that these plans are bananas. Here are the details:

geqo: 12
join: 14
work: 4GB
shared: 240GB
random: 4

geqo and join are fairly high, but even changing those to 3-4 doesn't significantly alter the 1.4s plan. :/

Jason Boyer (jboyer) wrote :

Bill, I did apply the max_bundle_count to the wrong method initially (the .prepare method; not the plain search) but even applied to the correct method nothing appears for a while, then the 40 that I'm looking at come in rapidly in 3-4 chunks. (Possibly a web view redrawing artifact, hard to say.) It's still quite the delay from start to finish because of the 56+ seconds of query planning but it does appear that the lowered bundle count is keeping things alive long enough to eventually finish.

Bill Erickson (berick) wrote :

Thanks, Jason. That's progress, anyway.

Improving the query plan is key, of course, but as a release valve, should we apply a max_bundle_count => 5 (or 10?) to Fielder? That way it benefits from bundling, but ensures a relatively quick turnaround of initial data for all calls.

Mike Rylander (mrylander) wrote :

Bill, I'm not against that. A stop gap seems useful.

Jason, would you test with a lower work_mem? You can do that live in a single session with:
  SET work_mem TO '1GB';

It's possible you're just giving Postgres /too many/ options with that size of work_mem. I'd personally start at 32MB or 64MB and work my way up to what's needed. I can't think of a reason to have a 4GB work_mem in Evergreen... 256MB or 512MB is my normal top even for the largest installations (with the exception of some one-off processes).

Jason Boyer (jboyer) wrote :

That this is apparently limited to us is a good sign, I guess. The frustrating thing is that there doesn't appear to be anything I can do to knock that timing down. I've adjusted work_mem, join_collapse_limit, geqo_threshold, random_page_count, effective_cache_size, effective_io_concurrency, and temp_buffers to varying sizes from comically small to excessively large and it just won't budge. I haven't checked shared_buffers since the machine is in use and that one requires a restart, but I'll try that next.

If this type of bizarre planning top-heaviness is unique to us though this may as well be marked invalid.

Bill Erickson (berick) wrote :

Branch for the max_bundle_count stop-gap:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1768022-flat-fielder-stream-bundle-count

Would be good to know if others who have reported this issue see any improvement using this patch.

Jason Boyer (jboyer) wrote :

So, uh, interesting finding here. I pulled out each join one at a time and nothing would budge the planning time until I took out ausp and it dropped to > 100ms. Not wanting to remove it entirely I looked at the ON clause and on a whim swapped ahr.usr = ausp.usr for au.id = ausp.usr and that's all it takes. Is this whole thing built by the fielder or is this something else that could be gently nudged to make all of our web client users extremely happy?

I have no idea why that particular way of making that join freaks postgres out but it doesn't seem like something I can alter settings to fix. (vacuum analyze-ing tables all over doesn't make any difference.)

Same query, same results, one line change, super fast plan: https://pastebin.com/GYmwVZ1U

Bill Erickson (berick) wrote :

Jason, see the "ahopl" class in the IDL. The SQL you want to tweak is in there.

Come to think of it, this is starting to sound familiar...

https://bugs.launchpad.net/evergreen/+bug/1469758

Jason Boyer (jboyer) wrote :

Thanks for the pointer Bill, that's definitely a best case for this fix. I had that bug in the back of my mind for a little bit but since I didn't see anything with that kind of wildly inaccurate row estimates I didn't think to look at it that closely. (Ah, but Pg 9.3 didn't break out plan/exec time, at least not in Chris's paste...)

Jason Boyer (jboyer) wrote :

Since I'm in a mood to get this fixed on my end I went ahead and threw together a branch others can test. Hopefully this will fix it for the installations where it's an issue (EI, PINES) without causing any speed issues anywhere else.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/jboyer/lp1768022_ahopl_speedup

working/user/jboyer/lp1768022_ahopl_speedup

tags: added: pullrequest
Mike Rylander (mrylander) wrote :

Jason,

In theory those join columns should be entirely fungible by the planner based on its knowledge of fkeys and by looking at other joins' ON clauses, but with so many joins to process it will give up after a while. That's why I asked about the join_collapse limit and the geqo_threshold. And, in the face of potentially inaccurate stats, all bets are off. I'm not sure why our plans are so different -- they really shouldn't be other than for stats -- but, that's beside the point, because in certain cases (such as this) we can in fact give the planner more explicit info up front, so that the plan search space starts out with better stats correlation.

That, in essence, is exactly what you've done by changing the ON clause, so kudos to you on that! There's a risk of changing query semantics with different join columns, but because ahr.usr and au.id are prior inner join keys, and therefore essentially duplicates in the projection before the left join, there's no risk in this case.

FTR, I think the win here comes from the unique index on au.id (from the pkey) that gives the planner more options based on the expected cardinality of the left join.

I will see about testing the change on some differing scales of data, but I don't expect any issues.

Thanks for your persistence!

Bill Erickson (berick) wrote :

I did some test runs with the original and new query on a large test database. Neither performed well, but the one with Jason's patch was about 10 seconds faster. My best time was 57 seconds. I can post the analyze if that would be helpful.

Jason Boyer (jboyer) wrote :

That still seems awfully long Bill, though I don't know how many holds it was looking at. I would be curious to see that analyze output if you wouldn't mind posting it somewhere. I don't suspect there's much more tweaking to be done for now though.

Bill Erickson (berick) wrote :

To compare apples to apples, I reviewed the PG settings and made some changes. The biggest impact came from changing join_collapse_limit from 8 to 14. Now the query runs about 25 seconds. Not great, but at least it doesn't time out any more.

Here's the analyze:

http://evgtesting.kcls.org/hold-pull-list-query-analyze.txt

action.hold_request has ~450k total holds, ~400k with fulfillment_time=null and cancel_time=null.

This is a test DB running on a VM with 114G RAM / SSDs. I don't expect it to be blazing fast, but it's generally usable.

Mike Rylander (mrylander) wrote :

Jason, could you post the explain (or explain analyze) of the fast version? Mine is here: https://pastebin.com/wHBNezUf ... the explain portion (and planning time) are what I'm interested in ATM.

Bill, what query generated that plan? I think Jason's and mine are focusing on one hold, where yours looks to be ... broader.

Thanks!

Bill Erickson (berick) wrote :

Arg, sorry, I'm looking at query to build the initial holds pull list. Same "ahopl" SQL, different WHERE clause.

Jason Boyer (jboyer) wrote :

https://pastebin.com/nMAwFqc1 is the explanation for the above edit. Note I've been using the same hold id (it hasn't made a difference to the planning time).

Bill Erickson (berick) wrote :

To recap the pullrequest, we have two independent branches that should help with this issue:

https://bugs.launchpad.net/evergreen/+bug/1768022/comments/18
https://bugs.launchpad.net/evergreen/+bug/1768022/comments/22

Hello, I just tried out the "max_bundle_count=> 5" change on a test 3.2.3 system, and now the pull list loads for me without any trouble.

I had initially wondered what was up with our system when I went to view the pull list and received a blank page after waiting a while. I noticed that some locations would have it, but some wouldn't. The one that always seemed to trigger the issue has 49 items on the pull list.

Josh

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers