Z39.50 Search - open-ils.search.z3950.search_class - occasional null results when use_perm in use

Bug #2044854 reported by Josh Stompro
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Medium
Josh Stompro

Bug Description

EG 3.11.1 but I think it has been an issue for a while.

When searching Z39.50 sources we are seeing the search results come back empty when they should have results. It was somewhat random when this would happen, the same search immediately repeated would sometimes keep failing and sometimes work successfully.

The searches are not actually getting sent to the remote z39.50 server, and we are not getting back that there were zero results, the call to open-ils.search.z3950.search_class was just returning null/undefined.

We are using the restrict z39.50 sources by permission group feature.
https://docs.evergreen-ils.org/docs/latest/admin/z3950_servers.html#_restrict_z39_50_sources_by_permission_group_2

We have two systems that each have their own set of Z39.50 servers. The use of that feature seems to be causing the problem.

This issue is hard to replicate, I think we started seeing it more because we have single VM dedicated to cataloging and acquisitions. So our z39.50 users are both hitting the same server that also doesn't have as much other searching going on.

The issue seems to be that the Z39.50 sources are cached in a %services hash, and when a user requests the list of sources that they have permission to use (open-ils.search.z3950.retrieve_services), that hash gets set with their filtered list. If the next user that happens to hit that same process doing a z39.50 search (open-ils.search.z3950.search_class) has access to a different list of sources, then the search fails because those sources are not in the %services hash.

I added a bunch of debugging notes to bug 1271559 before realizing that the original issue there didn't match what we were seeing.

I'll copy some of that over here.

Example of srfsh output from the problem.
**************
srfsh# request open-ils.search open-ils.search.z3950.search_class "validauthkey",{ "service_array": [ "oclc" ] , "username_array": [ "" ] , "password_array": [ "" ] , "limit":10, "offset":0, "search": { "isbn":"9781639731930x" } , "service": [ "oclc" ] , "username": [ "" ] , "password": [ "" ]}

Received Data: {
  "count":0,
  "service":"oclc",
  "records":[

  ]
}

------------------------------------
Request Completed Successfully
Request Time in seconds: 0.234018
------------------------------------
srfsh# request open-ils.search open-ils.search.z3950.search_class "validauthkey",{ "service_array": [ "oclc" ] , "username_array": [ "" ] , "password_array": [ "" ] , "limit":10, "offset":0, "search": { "isbn":"9781639731930x" } , "service": [ "oclc" ] , "username": [ "" ] , "password": [ "" ]}

Received no data from server

------------------------------------
Request Completed Successfully
Request Time in seconds: 0.006879
------------------------------------
**************

=================

Steps to reproduce issue.

1. Have 2 or more users with different Z39.50 use_perm permissions, that are limited to different sets of z39.50 servers.

2. Grab or generate auth keys for those users.

3. Restart open-ils.search to clear the %services cache : osrf_control --restart --service open-ils.search -l

4. Watch your system logs for z3950 requests. tail -f /openils/var/log/osrfsys.log| fgrep 'z3950'

5. start srfsh and perform a request for user 1 to get the list of z39.50 servers they can access.
   a. request open-ils.search open-ils.search.z3950.retrieve_services "user 1 auth key"
   b. Make note of which process handled the request from logs.
   c. Process 50493 handled this request for example open-ils.search [INFO:50493:Z3950.pm:454:170110717350926105] z3950:

6. Perform the request again for user 2. Hopefully you hit a different process. If you hit the same process you will need to go back and request for user 1 again.

7. Perform some searches and watch to see which process you hit. If you are searching using User2, and you hit the process that has the services for user1, then you will get a "Received no data from server" result.

Search example: Set your service name to match what you have.

srfsh# request open-ils.search open-ils.search.z3950.search_class "user 2 auth token",{ "service_array": [ "setservicename" ] , "username_array": [ "" ] , "password_array": [ "" ] , "limit":10, "offset":0, "search": { "isbn":"9781639731930y" } , "service": [ "setservicename" ] , "username": [ "" ] , "password": [ "" ]}

=================

A simple fix is to not cache the results of the filtered open-ils.search.z3950.retrieve_services, and only cache the list when performing a search, which only creates an unfiltered %services hash.

Or maybe it would be better to cache before the filtering happens to not waste the db lookups?

Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :
tags: added: pullrequest z3950
Revision history for this message
Galen Charlton (gmc) wrote :

After looking at the code in question, the services lookup doesn't appear to be all that expensive, just a settings lookup and a fleshed query from tables that are unlikely to be particularly large, even if you add dozens of Z39.50 targets.

Consequently, I think the simplest thing to do might be to not cache the services definitions at all (i.e., drop the package-level %services hash). If caching is truly needed for performance reasons (and I'm dubious, but of course actually measuring would trump a feeling), it would be more correct to (say) include the authtoken in the cache key (and use memcached for the actual cache).

Revision history for this message
Galen Charlton (gmc) wrote :

Josh, I've pushed an alternative patch for your consideration here:

user/gmcharlt/lp2044854_remove_z39_services_caching / https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/gmcharlt/lp2044854_remove_z39_services_caching

Galen Charlton (gmc)
Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Galen Charlton (gmc) wrote :

(Noting that my patch likely has some inefficiencies, as some of the fetch_service_defs() calls could likely be avoided by passing the $services hash around during the course of a method request. However, its current state should give an idea of the lack of need for child-process-level cache.)

tags: added: caching
Changed in evergreen:
assignee: nobody → Josh Stompro (u-launchpad-stompro-org)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.