much slower holds processing in 2.4+

Bug #1272316 reported by Chris Sharp
46
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.4
Fix Released
Undecided
Unassigned
2.5
Fix Released
Undecided
Unassigned

Bug Description

This is a new bug with the goal of generalizing the report of problems in bug 1185865 (about the holds targeter script taking a long time to run) to include all hold targeting in 2.4+. In our newly-upgraded 2.5.1 instance, we are seeing greatly increased hold targeting times, both in the hold targeting process (which has yet to finish within 24 hours with 5 concurrent processes enabled since the upgrade) and when doing real-time holds placement. The delays in real-time hold placement have had a significant service impact within our libraries and for our OPAC patrons. From our office network connection, which is not under the normal duress that our public libraries experience, hold placement on certain titles takes 40+ seconds to complete. When you factor in network and workstation problems at local libraries, we're talking about real delays here.

I've asked our libraries for example records that may illustrate the issue. We're not sure at this point whether the number of hold requests per bib record or the number of copies/volumes attached to each bib record are a factor. We suspect one or the other or both.

I'll mention that we do not have any hold proximity adjustments in place. That was a feature we were intending to test post-upgrade, though if it comes at this cost, we will probably avoid that feature altogether until these delay issues are resolved.

Evergreen 2.5.1
OpenSRF 2.2.1
PostgreSQL 9.1.11
Ubuntu 12.04 LTS

Tags: pullrequest
Revision history for this message
Chris Sharp (chrissharp123) wrote :

Referencing IRC discussion here http://irc.evergreen-ils.org/evergreen/2014-01-24#i_62613, we now see that the delay appears to be caused by a large number of actor.org_unit_setting calls to see the unit's target_weight and whether to target when the unit is closed. On bib records with many copies attached, this becomes a very long process.

Lebbeous Fogle-Weekley (senator) provided a patch to test at working/collab/senator/slow-targeter-try-faster-settings-lookup, and I can see a modest improvement, but the frontend still times out after applying those patches.

Revision history for this message
Lebbeous Fogle-Weekley (lebbeous) wrote :

Can we get fresh (redacted) logs from after the application of this branch linked into the bug report? That could help us see for if the remaining slowness is still related to ou setting lookups or is in other areas.

Plus I think Thomas Berezansky and Jason Stephenson have some findings or data on this same issue?

Revision history for this message
Kathy Lussier (klussier) wrote :

I checked in with C/W MARS on this issue since they have also seen holds targeter slowness since upgrading to 2.4. The problems they have seen are that the holds targeter process that is done via a cron job takes longer to complete (bug 1185865) and that holds targeting that occurs when the "Retarget Local Holds" checkin modifier is significantly slower. The latter was also mentioned by Jason Stephenson in a comment on bug 1189617 - https://bugs.launchpad.net/evergreen/+bug/1189617/comments/3.

They have not seen the problem with slow holds placement in 2.4.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Attaching the requested log info.

Revision history for this message
Mike Rylander (mrylander) wrote :

IIUC, the improvement from Lebbeous' change was on the order of 17%. While still just over the 60s threshold, I would call that a pretty big improvement. Interestingly, the code he touched has to do with fetching org unit settings for max-loops and org-weight. Chris, is there any chance that PINES recently started making use of these settings?

Revision history for this message
Mike Rylander (mrylander) wrote :

Chris,

I have a crazy theory. First which version of PG are you on now? Then, what does the bloat of action.hold_copy_map and action.hold_request look like, based on the query here: http://wiki.postgresql.org/wiki/Show_database_bloat ? (note, you might need to adjust that query for some very recent versions of Postgres)

If either of those tables are significantly bloated, then autovacuum will have trouble, with the default settings, effectively keeping up with the churn on them. To address that, you'll want to both lower the auto vac cost delay and increase the auto vac max cost settings.

Crazy huge bloat on those tables would have the cumulative effect of significantly slowing down otherwise fast queries, whilst the high-churn nature of those tables, especially the hold copy map during targetting, will defeat autovacuums attempts to clean up the dead space, exacerbating the bloat and speed issue.

Revision history for this message
Ben Shum (bshum) wrote :

For fun, I ran the query on that page against our production database (included a WHERE clause to look for action.hold_request or action.hold_copy_map as indicated). Going to try giving it some more thought and attention next week.

I put the output result at http://evergreen-ils.org/~bshum/misc/bloat-query-ahr-ahcm.html

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Here is the same query as Ben's against PINES' master DB: http://georgialibraries.org/~csharp/pines-hold-bloat-2014-01-25.html

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Mike,

We are not using max loops or org weight either, FYI.

Thanks!

Chris

Revision history for this message
Chris Sharp (chrissharp123) wrote :

There are probably pitfalls to this approach, so forgive me if it's naive, but could we put in new boolean YAOUSs that basically say "use weights for hold targeting" and "use max loops for hold targeting". If false, we don't check anything else?

Revision history for this message
Kathy Lussier (klussier) wrote :

I'm going to take back what I previously said about placing holds. I'm hearing from C/W MARS library staff that it can sometimes take up to 30 seconds to place a hold. Anecdotally, it seems to be worse on more popular titles that already have a lot of hold.

Changed in evergreen:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Kathy Lussier (klussier) wrote :

I don't know if it's helpful, but I'm adding data from some timing tests that were done at C/W MARS. It does appear that the issue is more pronounced on records with many copies and holds.

Placing Holds on C/W MARS test records.

04.88

03.85

03.91

01.75 – Hold not successful – no available copies

05.07

05.

05.25

03.88

Then I placed a Hold on a real title and popular, Captain Phillips. 533 current holds with 69 total copies. 11.25 to place a Hold.

Mrs. Lincoln's rival : a novel , 20 current holds with 81 total copies. 14.31 to place a Hold.

The invention of wings, 58 current holds with 44 total. copies. 10:50

The invention of wings, 460 current holds with 150 total copies. 20:47

06:57 seconds to return to search results after clicking continue.

Revision history for this message
Michele Morgan (mmorgan) wrote :

Not sure if this is helpful, either, but since upgrading to 2.4, our libraries that make heavy use of the checkin modifiers that retarget local holds have seen errors when trying to check in items with large numbers of holds. Kathy Lussier pointed out that these are likely timeout errors due to the slower holds processing described in this bug report. Finding no other cause for the errors, I'm inclined to agree.

We are currently running 2.4.4 and have not yet implemented any of the 2.4 holds enhancements.

Revision history for this message
Mike Rylander (mrylander) wrote :

Looking at the logs that Chris sent, I continue to believe that the observed slowdown is due not to a specific feature, or its lone cost, but the general accretion of functionality.

That said, I did see a good opportunity for an optimization that starts with the prox adjustment and extends to the hold targetter in general. I've implemented a branch to test this out, and eyes would be appreciated. If it works well in testing I would like to see this treated as a bug fix, and get it into both 2.5 and 2.6.

No pullrequest yet, as it's not well tested, and could very well contain bugs up to and including syntax thinkos ...

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/prox-calc-speedup

Revision history for this message
Mike Rylander (mrylander) wrote :

Following Dan's lead on another bug, I'm tossing pullrequest on this to elicit some feedback. TIA!

tags: added: pullrequest
Revision history for this message
Mike Rylander (mrylander) wrote :
Revision history for this message
Chris Sharp (chrissharp123) wrote :

I can confirm big improvements with both Lebbeous's and Mike's patches applied. Stats between applying Lebbeous's fixes and applying Mike's:

A hold that took 55 seconds now takes 18 seconds.

A hold that took 65 seconds now takes 24 seconds.

A hold that took 24 seconds now takes 6 seconds.

Thanks to both Lebbeous and Mike for their efforts!

Revision history for this message
Chris Sharp (chrissharp123) wrote :
Revision history for this message
Ben Shum (bshum) wrote :

Continuing testing and noticed this error coming up every time a new hold was placed.

open-ils.storage: [ERR :2504:action.pm:1819:1393482443255619] Processing of hold failed: Can't use string ("1") as an ARRAY ref while "strict refs" in use at /usr/local/share/perl/5.14.2/OpenILS/Application/AppUtils.pm line 1302.

Revision history for this message
Ben Shum (bshum) wrote :

Actually, nevermind that last comment. trouble with fully applying part of one commit, all seems well now.

Revision history for this message
Ben Shum (bshum) wrote :

Since this works for us, I've pushed it to master and rel_2_5.

rel_2_4 has some conflicts with whitespace vs. tab that needs more manual resolution.

Changed in evergreen:
milestone: none → 2.6.0-rc1
status: Confirmed → Fix Committed
Revision history for this message
Mike Rylander (mrylander) wrote :

I've backported this to rel_2_4.

Here's a general tip for backporting things like this, which cross a mass tab/space whitespace cleanup, and so don't pick clean with just git tools. With your commit hash in hand, from the top of your working tree, do this:

 $ git log --pretty=oneline --patch b2c70c5^..b2c70c5 | patch -p1 -l --dry-run

The -l switch tells patch to ignore whitespace differences. If you get no errors, you can remove the --dry-run part to apply the patch and duplicate the commit metadata from the original commit message. git-commit takes a --author switch to help you lie about the authorship, and signoffs are really just strings recognized by convention inside the commit message.

Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.