Hold targeter taking a long time to run

Bug #1185865 reported by Ben Shum
This bug report is a duplicate of:  Bug #1272316: much slower holds processing in 2.4+. Edit Remove
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Evergreen
New
Undecided
Unassigned

Bug Description

Evergreen master / 2.4

So since our upgrade, we've noticed that the hold_targeter script is taking unusually long to run its course. Symptoms of that started immediately when we had email notifications arrive from cron telling us that it couldn't run hold_targeter again while it was still running. The first day, it ran into itself throughout the whole day (our interval is every 15 minutes), the second day, it quieted down a little, and by the third day it was noticeably quieter but still complaining periodically.

My assumption here is that as the holds take their time being processed, it slowly spreads out over the hours and thus leaves fewer holds per interval to be processed eventually leading to less warnings about hold targeter already running. Instead of running hardest from 1 am through 5 am, now it runs over 1 am through 10 am or later, or something akin to that. As a result of that time displacement of hold processing, I think our libraries who run their hold pull lists in the morning are seeing more fluctuation than normal as they go to view holds, print holds, and check in holds. By the time they reach the last step and even while printing holds from moment to moment, the actual pull list may have changed to include more titles, different titles, or fewer titles, which disrupts staff workflows.

Another symptom that we've noticed is that our postgresql logs are regularly containing entries like:

automatic vacuum of table "evergreen.action.hold_copy_map": could not (re)acquire exclusive lock for truncate scan

Which seems to indicate that the hold targeter process might be taking longer than before and actually causing it to bump against the autovacuum processes. This shows up every couple hours for us too.

Just reporting potential performance issue and hoping to gather feedback from other master or 2.4 sites to see if they have also noticed the hold targeter running longer than before or impacting other areas.

Revision history for this message
Robert J Jackson (rjackson-deactivatedaccount) wrote :

Evergreen Indiana Consortium is on 2.2 so this may or may not relate. Our utility server handles the holds targeter and recently it crashed in off hours. We restarted processes during core library hours and noticed that due to the manner in which the select is performed (holds not checked in last 24 hours or later) the holds targeting was being performed during library core hours. The processing appears to be spreading out over time as Ben describes in his original bug report due to the 15 minute cron running of the job and the length of time it actually takes to process the 12K of holds that we have. However, the spread is currently only through library core hours and none have actually been moved outside of those hours for processing.

Revision history for this message
Ruth Frasur Davis (redavis) wrote :

Just adding that we've noticed a performance issue related to holds. of course, 2.2

Revision history for this message
Mike Rylander (mrylander) wrote :

Ben, to what degree are you parallelizing the hold targetter? See: opensrf.xml, setting at XPath //opensrf/default/hold_targetter/parallel

Revision history for this message
Ben Shum (bshum) wrote :

To Mike, ours is set to 3 for the moment for hold targeter parallel. We haven't changed this since during our upgrade, so I didn't consider that to be a factor in figuring whether the overall process was "slower" in our current version.

Revision history for this message
Mike Rylander (mrylander) wrote :

I don't doubt that hold targetting can be noticeably slower in 2.4, between the adjusted proximity calculation and general data size increase that creates. Even when no prox adjustment rules are defined we're calling the stored proc to attempt that calculation for every copy considered for every hold, and that is more expensive than the indexed lookup used previously .

There are certainly some optimization opportunities (the simplest case being "there are no rules, use natural proximity"). In the mean time, however, if you can spare the DB server CPU cycles then increasing the parallelism should help shorten the run time.

For future consideration, we may look at materializing or caching some of the prox info ... though it's hard to see many cases where re-use is likely.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

This issue is definitely affecting PINES. Increasing the parallel settings has not yet improved the situation. Still watching it, but it is unacceptably slow.

Revision history for this message
Ben Shum (bshum) wrote :

Marking as duplicate because the new bug 1272316 will be superceding this one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.