Hold targeter features and refactoring

Bug #1596595 reported by Bill Erickson
26
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Wishlist
Unassigned

Bug Description

Evergreen 2.10

I set out to make various changes to the hold targeter, but ended up rewriting it, because it allowed me to do some heavy refactoring. Linked is branch with the new code. It adds some features, improves targeting speed, and refactors the code in a way that is hopefully easier to understand and expand. (The existing targeter code has acquired a lot of additions over the years. It was probably time for general refactoring and optimization).

The goal was to mimic the existing targeter logic, with the exception of some changes to the max hold target loop behavior.

I've tested the code on 3 data sets. On 2 of them, I have confirmed that the final hold copy maps are identical and that the number of holds with non-NULL current copies match. (The exact targeted copies will vary due to a certain amount of randomness in the targeting logic).

On the relatively small concerto data set, I get a 10% increase in batch targeting speed. On a larger data set of 515 holds (w/ local hold configs), it runs 6 times faster overall. On an even larger data set of 280k holds (in a multi-server environment), it runs 4.5 times faster as a single process than the traditional hold targeter running with 3 parallel.

These speed improvements are, of course, highly subjective and will be affected by local configuration and system resources. It's encouraging, though, that it performs faster in every scenario I've tried so far (assuming the speed improvement are the result of efficiency and not logic errors).

In all cases above, it uses significantly less RAM, because open-ils.storage is not used. DB communication uses cstore.

Changing a complex, core part of the ILS logic is no small thing. This will require lots of additional testing and review.

I'll document features and changes in a follow-up comment.

Thoughts, questions, feedback, and testing appreciated.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/hold-targeter-reify-expand

Revision history for this message
Bill Erickson (berick) wrote :

From the features documented in the commit:

    * Ports hold targeter code to a Perl utility function, communicating w/ the DB via cstore instead of storage.

    * Adds a new global flag 'circ.holds.retarget_interval' for configuring the hold retarget interval in the database. (Still overridable via targeter script).

    * Adds a new DB function to regenerating hold copy maps to make map deletion and creation more efficient.

    * Adds an option for targeting holds in newest to oldest order.

    * Caches all org unit settings per targeter run.

    * Adds support for "skip_viable" option. This tells the hold targeter to avoid modifying any holds that target viable copies. AKA "fix broken" mode.

      For example, you might run in skip_viable mode with a retarget interval of 24hr once a day to repair non-viable holds, then also run the targeter in regular mode once a day with a retarget interval of 48 hours to give staff 2 days to process viable holds.

    * Hold target loops logic changes:

     ** Org units with fewer target attempts are prioritized during loop processing. So, instead of segregating org units into 2 categories, those attempted in the current loop and those not attempted, sort org units by the number number of times they have been attempted. Within each grouping, prioritize by target weight/proximity as before.

     ** Target looping treats the pickup lib like any other org unit. If a targeted copy at the pickup lib remains un-captured, at re-target time, a copy at a different branch is chosen (if one is available) even if other copies at the pickup lib are targetable.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed to a new lp-tagged branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1596595-hold-targeter-reify-expand

Branch includes a new commit to support metarecord holdable formats filtering.

TODO (from IRC discussion):

* Add new private open-ils.hold-targeter service to provide API access to the hold targeter (as before)
* Teach any modified code to call this new service instead of calling the utility function directly.
* Similarly, recover parallel targeting support in hold_targeter.pl using new service.

Revision history for this message
Bill Erickson (berick) wrote :

Branch updated with:

1. open-ils.hold-targeter service. All code that calls the targeter now calls the new service.

2. updated hold_targer.pl to use the new service and re-support parallel targeting.

Parallel-ization (potential copy collision grouping on metarecords) logic is now baked into the targeter back-end code and no longer handled by the targeter script.

3. release notes of what we have so far.

Revision history for this message
Bill Erickson (berick) wrote :

Rebased to master. Slapping a pullrequest on this.

The branch includes release notes, but no unit tests. My concern with unit tests is that we'd have to create /a lot/ of unit test code to have even a modest amount of coverage for the hold targeter. Thoughts welcome on that, of course.

At minimum, I believe this code is ready for human testing.

tags: added: pullrequest
Changed in evergreen:
assignee: Bill Erickson (berick) → nobody
milestone: none → 2.next
Revision history for this message
Galen Charlton (gmc) wrote : Re: [Bug 1596595] Re: Hold targeter features and refactoring

> The branch includes release notes, but no unit tests. My concern with
> unit tests is that we'd have to create /a lot/ of unit test code to have
> even a modest amount of coverage for the hold targeter. Thoughts
> welcome on that, of course.

The lack of unit tests really concerns me, especially since the
patches *both* refactor the code and add new features; unexpected
regressions and changes in behavior would be painful.

Is there any chance of breaking some of this up into smaller pieces?

Revision history for this message
Bill Erickson (berick) wrote :

Thanks for the feedback, Galen.

If the lack of unit tests is the main concern, then my preference would be to start working on those over breaking the code into smaller pieces. I don't see a clear path forward there.

The one exception is that it should be possible (from a user perspective) to ignore the new features and run the code as before to get the same results as the existing hold targeter, randomness in current_copy selection notwithstanding.

==

For the unit tests, my plan is to write Perl live tests that run the new targeter against a set of specially selected/crafted holds and analyze the results for correctness. My goal is not to implement a tool for comparing the results of the existing targeter to those of the new targeter.

==

We could also consider supporting both hold targeters for a release cycle, so that people can easily fall back to the original if problems arise. This would be a simple matter of creating an alternate hold_targeter.pl.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a small pile of tests to the same branch.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=commitdiff;h=99805dedd1df7499ab8ee6cf40d0a31bef9642f7

These are Perl live tests. Roughly half the tests focus on a simple title hold. The other half focus on a metarecord hold with a holdable_formats restriction.

I don't anticipate testing all targeting scenarios. If there are any areas of targeting logic anyone is particularly concerned with, let me know and I'll add it.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a --skip-viable Perl live test and augmented the release notes for the --skip-viable option.

For the release notes, I wanted to clarify that hold copy maps (potential copies) are updated for all processed holds in --skip-viable mode, even if the target is still viable/capturable.

This may help address a common problem where new copies are added to the database (or existing copies are modified to become holdable) but are not treated as potential copies for op-capture until the holds that use them are fully re-targeted.

In practical terms, if the targeter runs in --skip-viable mode more or less constantly, then potential copies lists will be refreshed throughout the day without unnecessarily shuffling the pull list.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed changes to move the new targeter into a new hold_targeter_v2.pl script, so we can continue using the existing hold targeter and/or fall back to using it if serious bugs are found in the new script/service.

Did some squashing and rebased code to master. Will need another rebase after master settles down from 2.11.

Revision history for this message
Bill Erickson (berick) wrote :

* Rebased and further squashed after 2.11 code rush
* Fixed an ordering issue with the base SQL schema file.
* Repaired live tests to take into account changes from new tests.
* Added a metarecord hold to concerto data set (for live tests)

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1596595-hold-targeter-reify-expand-2.11-rebase

Revision history for this message
Bill Erickson (berick) wrote :

Rebased to current master circa 2.11.1. Pushed to same branch as previous comment.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a commit to apply parallel holds filtering within the main holds-to-target query. (My last TODO item from the code). With this, each parallel process only has to retrieve the holds it plans to process, instead of retrieving all targetable holds then filtering out the holds that should be ignored by the current parallel targeter slot.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Here are some timing stats. This is on a VM with 6GB of RAM and 2 processor cores, processing a full set of PINES holds (a little less than 30K):

old - 6 parallel:

real 746m42.842s
user 3m19.152s
sys 0m8.484s

v2 - 3 parallel:

real 134m17.646s
user 0m2.024s
sys 0m0.160s

Revision history for this message
Terran McCanna (tmccanna) wrote :

Hi Bill,

We are seeing a new problem at PINES that has been introduced with this that is confusing our circ staff:

Patron places a title-level hold on an age-protected title that their home library (Library A) owns, but chooses a different pickup library (Library B) that doesn't own a copy of that title.

The system allows the hold to be placed, and the holds targeter identifies the copy owned by Library A and it appears on Library A's pull list.

Staff at Library A retrieve the item and scan it to fulfill the hold, but it won't capture when it scans because it sees that it's age protected and cannot be sent to Library B for pickup. This is causing extra work for staff (having to retrieve the item and then reshelve it) as well as causing confusion for both staff and patrons.

(Prior to the upgrade, the hold would have been placed, but it would have given the message "No Copy Available" and not appeared on the holds pull list, which was easier to understand and didn't create any unnecessary work.)

Please let me know if you need any additional information from us about this, or if there is anything we can implement in our local hold rules to resolve the problem for now.

Thanks,
Terran

Revision history for this message
Bill Erickson (berick) wrote :

Thanks for the details, Terran. I'll see what I can find.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a fix. From the commit:

    Pickup and requesting org unit IDs were passed in the wrong order to
    the copy permit test. This resulted in some items, particularly age
    protected copies, appearing on the holds pull list even though they were
    not (yet) permitted for hold capture at a remote library.

I've also merged origin/master into the branch for completeness. Fix above is at the tip of the branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=cea03ca2fe3bad3c5a2c6237086d91f3d4b847f2

Revision history for this message
Terran McCanna (tmccanna) wrote :

You rock, Bill! Thanks!

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a fix to resolve a discussion from IRC. In short, during checkin, the targeter would sometimes fail because it was passed an array of hold IDs instead of a single hold ID (notably during checkin re-targeting). Fix teaches the targeter to accept an array ref of IDs in addition to a single ID.

Pushed to same branch (user/berick/lp1596595-hold-targeter-reify-expand-2.11-rebase)

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=commitdiff;h=320cc9fe56d38c79da2da064a68fec746ae3385e

Kathy Lussier (klussier)
Changed in evergreen:
milestone: 2.next → 2.12-beta
Revision history for this message
Bill Erickson (berick) wrote :

Chris encountered a bug when testing where holds with lots of copies across a wide variety of circ libs can result in a transaction timeout as a large set of org unit settings is fetched on the non-transaction editor. Fix pushed to same branch which forces all org setting lookups to occur on the in-transaction editor when one is provided. This has the secondary benefit of causing org setting lookups (from the child targeter) to avoid a CONNECT dance for each.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=commitdiff;h=2771cbe124221842271c0b06c9b517fe6673256b

A secondary optimization not implemented here would be a batch org setting lookup DB func, similar to actor.org_unit_ancestor_setting_batch, that accepts multiple org units instead of (or in addition to) multiple setting names. Something like this could reduce the number of cstore calls significantly in these cases. I mention this as a possible addition, not as a requirement for this bug.

Revision history for this message
Bill Erickson (berick) wrote :

I have pushed 2 more commits to address the final paragraph from my previous comment.

1. Adds a DB function actor.org_unit_ancestor_setting_batch_by_org(setting_name, org1, org2, ...)
which returns org settings for a list of org units, an AppUtils function to invoke the DB function, and a Perl live test to exercise the code.

2. A commit to the hold targeter to use the new batch function when looking up values for the circ.holds.org_unit_target_weight and circ.holds.target_when_closed settings.

This primarily affects single-use targeting, like placement time targeting, checkin retargeting, etc.

For example, from Chris's issue in my previous comment, the in-transaction cstore timed out waiting on the parent cstore to make 238 cstore calls to get values for the circ.holds.org_unit_target_weight setting for every circ lib. (In this case, there were targetable copies at all of those circ libs). With the new code, those 238 cstore calls will be handled with 1 batch call.

Revision history for this message
Bill Erickson (berick) wrote :

Note, one of the commits from the previous comment adds another DB update script.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Just pasting a log threadtrace here that appeared after I applied the last two fixes:

http://pastebin.com/brVt8xrj

Key line appears to be: ERROR: cannot pass more than 100 arguments to a function

Revision history for this message
Bill Erickson (berick) wrote :

Force-pushed a fix to make the batch setting lookup use an int array instead of a variadic to avoid the max arguments issue above.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Since this has been running in PINES production for nearly a month and each of the fixes pushed has resolved each found problem, I'm ready to sign off. However, the Perl test is failing with the following:

1..15
# General hold targeter tests
Can't use an undefined value as an ARRAY reference at /usr/local/share/perl/5.18.2/OpenILS/Application/AppUtils.pm line 1303.
# Looks like your test exited with 255 before it could output anything.

Looks like the line it's failing on is:

my $result = $targeter->target(hold => $hold_id)->[0];

I've verified that hold id 1 does exist in the database (concerto).

Revision history for this message
Bill Erickson (berick) wrote :

Thanks, Chris. The test was failing because I neglected to copy the latest batch-org-setting SQL upgrade from the upgrade script into the base schema file. (tests++). I added code for that and renamed the test file from *.pl to *.t for consistency with the other test files. I squashed in these fixes and rebased to master.

Revision history for this message
Chris Sharp (chrissharp123) wrote :

Okay, tests pass now - thanks, Bill, and thanks for taking the time to redevelop this crucial part of Evergreen's functionality. We've seen a huge improvement in hold processing all over since implementing this, especially after the last commits that improved speed of settings lookups.

Signoff branch here:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/csharp/lp1596595-hold-targeter-reify-expand-2.11-rebase

tags: added: signedoff
Revision history for this message
Kathy Lussier (klussier) wrote :

I got a minor merge conflict in 950.data-seed-values when I tried loading this code. I resolved the conflict and pushed a new branch to:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/kmlussier/lp1596595-hold-targeter-reify-expand-2.11-rebase

Revision history for this message
Bill Erickson (berick) wrote :
Revision history for this message
Kathy Lussier (klussier) wrote :

Thank you to Bill for the code and to Chris for testing the code in production! I tested targeting using different hold selection sort orders and different types of holds. I know I only scratched the surface, but all of my tests worked as expected. Merged to master for inclusion in 2.12.

Changed in evergreen:
status: New → Fix Committed
Revision history for this message
Kathy Lussier (klussier) wrote :

Adding a note that we also had another test failure due to some changes in the Concerto date set. berick's branch from user/berick/lp1596595-hold-target-tests-update was committed to address the failures.

Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.