Holds not displaying in web client Holds Shelf

Bug #1827250 reported by John Amundson
28
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Unassigned
3.2
Fix Released
Medium
Unassigned
3.4
Fix Released
Medium
Unassigned

Bug Description

Observed in EG 3.2.4

Since moving to Evergreen 3.2.4, we have had some instances where holds that should be appearing on a library's holds shelf (Circulation>Holds Shelf), are not displaying.

We can't determine what makes these holds special. No issues with the attached patron records, barcodes and call numbers not deleted, bib record not deleted, hold on the correct shelf, etc. There is also nothing consistent between the holds that are affected.

The issue is not with cache or a specific workstation. No matter where the holds shelf is pulled up, the same hold(s) fail to display.

The holds also behave correctly in other ways, (i.e. still displays as Ready for Pickup while on the patron's record, shows attached to the patron from Item Status, are fulfilled with checked out, etc).

This makes the problem hard to troubleshoot.

I'll note that we still have access to a XUL client component, and the holds that fail to display in the web client still display correctly in the XUL client.

We had no reports of this while on 3.0, but I cannot confirm one way or the other if older versions of the web client are also affected.

Changed in evergreen:
status: New → Confirmed
milestone: none → 3.3.1
Revision history for this message
Jason Stephenson (jstephenson) wrote :

This appears to be caused by the last captured hold check in the new open-ils.storage.action.live_holds.wide_hash method. This method was introduced in Evergreen 3.2, so 3.1 should not be affected by this bug.

This code check attempts to find the most recently captured hold for the copy, but it fails if there are uncaptured holds that have the same current hold copy as the one being checked in. My experimentation on PostgreSQL 9.5 shows that these holds with a NULL capture_time sort before the holds with a capture_time in the below query:

SELECT id
FROM action.hold_request recheck
WHERE recheck.current_copy = cp.id
ORDER BY capture_time DESC
LIMIT 1

Thus returning a different hold from the one being put on the shelf. This id mismatch prevents the hold data from being returned to the client. This causes the hold not to show up in the holds shelf list.

The query above needs to have "AND capture_time IS NOT NULL" added to the WHERE clause. Branch forthcoming.

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

My fix branch is here:

working/user/dyrcona/lp1827250_last_captured_hold_fix

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp1827250_last_captured_hold_fix

Unfortunately, there is not a really good way to test this. It will most likely require production data as Concerto is likely to small and clean to turn the issue up.

If you have a hold shelf for a library that you know is missing holds in the list, then you could look at the list both before and after this patch and see that the missing holds turn up.

If pressed, I'l see if I can come up with a query to turn up the missing holds so you can know what to look for. The problem with that is the query used by the storage function to retrieve the hold shelf list is very complicated and replicating it in plain SQL would be no small feat.

tags: added: pullrequest
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Comment #1 is a case of, I should have edited my comment in a text editor instead of directly into the web form.

This isn't about putting holds on the shelf, but rather viewing the holds. I don't know why I was typing that about putting holds on the shelf. Suffice it to say, that the check for the last captured hold is preventing certain holds from show up on the holds shelf list in the 3.2+ web client and my patch appears to correct this in our testing.

Everything I said in comment #2 still stands. I'll blame the mess in comment #1 on my eagerness to explain the problem and thus rushing through typing up the comment.

Revision history for this message
Mike Rylander (mrylander) wrote :

Great catch, Jason.

It may be worth testing an alternate of

 ORDER BY capture_time DESC NULLS LAST

rather than adding the IS NOT NULL test with regard to performance. Reason being, the plans may be different in the details, and use different indexes.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Thanks for the suggestion, Mike. I tried both variants with explain analyze, and the nulls last version is overall faster. (See the attached document with the results for a single copy.)

I made a new branch using nulls last on the order by and posted it to the working repo. It still resolves the issue for our example holds shelf where there are multiple copies missing. Since it about 1 ms faster than the is not null version, we should use this one:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp1827250_last_captured_hold_fix_nulls_last

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Revision history for this message
Jason Stephenson (jstephenson) wrote :

BTW, CW MARS is using the branch from comment #5 in production since the 14th of May 2019 with no adverse consequences. It is working for us as intended.

Changed in evergreen:
milestone: 3.3.1 → 3.3.2
Changed in evergreen:
milestone: 3.3.2 → 3.3.3
Michele Morgan (mmorgan)
Changed in evergreen:
assignee: nobody → Michele Morgan (mmorgan)
Revision history for this message
Michele Morgan (mmorgan) wrote :

We've gotten reports of this on our 3.2.4 system. I'll test the patch, but here's a question:

Why do we retain current_copy in a cancelled hold?

The two examples I saw in our production system were situations where a cancelled hold existed with the same current_copy as the item that was on the holds shelf for a current ready for pickup hold.

Our production system has 414,000 cancelled holds with current_copy set. About 367,000 of these do NOT have a NULL capture_time.

Revision history for this message
John Amundson (jamundson) wrote :

Michele,

I think one of the reasons retaining the current_copy is important is the clear holds shelf function. The holds themselves may be cancelled, but staff still need to know which items are attached so that they can remove them from their holds shelf.

It is also beneficial for general troubleshooting. Cancelled holds are accessible from within a patron's record. It may be useful to see if and which item the hold had captured.

Changed in evergreen:
milestone: 3.3.3 → 3.3.4
Revision history for this message
Michele Morgan (mmorgan) wrote :

John, good points about retaining the current_copy for cancelled holds for the purposes of clearing the holds shelf and troubleshooting. I can see the value there, but it seems like retaining current_copy in every hold makes it more cumbersome to get the pertinent data out of action.hold_request.

A quick query in our production database shows hundreds of holds with the same current_copy, fulfilled or cancelled long ago. I'm not sure we need to keep that data back to the beginning of holds.

All that said, I tested both patches with production data and am adding my signoff to the nulls last patch as it seems to perform just a bit better. Here's my signoff branch:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/mmorgan/lp1827250_last_captured_hold_fix_nulls_last_signoff

tags: added: signedoff
Changed in evergreen:
assignee: Michele Morgan (mmorgan) → nobody
Changed in evergreen:
milestone: 3.3.4 → 3.3.5
Revision history for this message
Galen Charlton (gmc) wrote :

Pushed all the way back to rel_3_2. Thanks, Jason and Michele!

Changed in evergreen:
importance: Undecided → Medium
status: Confirmed → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :

Hello, I'm finding that the "NULLS LAST" query takes 65 seconds to return 372 rows, while the "AND RECHECK.capture_time IS NOT null" takes 1.3 seconds to return 372 rows.

We are running PG 9.4 yet, so that version is probably missing an optimization that makes the nulls first faster.

The explain analyze seems to be saying that the call number join is what is blowing it up, evaluating 250 million rows for some reason.

We will use the "AND RECHECK.capture_time IS NOT null" until we upgrade to a newer PG version.
Josh

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Josh, see bug 1855329.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.