Label in call number browse needs to be normalized for maximum correctness

Bug #690829 reported by Dan Scott
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Wishlist
Unassigned

Bug Description

  * Evergreen version: trunk and 2.0-beta5
  * PostgreSQL version: 8.4

I moved this problem discussion over from https://bugs.launchpad.net/evergreen/+bug/690242, as that bug actually pointed at two distinct problems, one of which was easy to close. I'm marking this bug as "Wishlist" because it's a request for enhancement to how call number browse picks its first result, over and above correct sorting of call numbers in the remaining results that are returned.

Mike Rylander wrote, in a description of the use of asset.call_number.sortkey and its current non-use in the WHERE clause of the query used by call number browsing:

"""
<snip>
The only way to make this infrastructure useful is to construct a query like so:

evergreen=# EXPLAIN SELECT "acn".create_date, "acn".creator, "acn".deleted, "acn".edit_date, "acn".editor, "acn".id, "acn".label, "acn".owning_lib, "acn".record, "acn".label_sortkey, "acn".label_class FROM asset.call_number AS "acn" WHERE oils_text_as_bytea("acn".label_sortkey ) >= oils_text_as_bytea( asset.label_normalizer_generic('741.2 NIC') ) AND "acn".deleted = 'f' ORDER BY oils_text_as_bytea(label_sortkey) LIMIT 9 OFFSET 5;
                                                                                                                           QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
 Limit (cost=7.35..20.58 rows=9 width=92)
   -> Index Scan using asset_call_number_label_sortkey on call_number acn (cost=0.00..1882160.23 rows=1280568 width=92)
         Filter: ((NOT deleted) AND ((regexp_replace(upper(label_sortkey), '\\\\'::text, '\\\\\\\\'::text, 'g'::text))::bytea >= (regexp_replace(upper(asset.label_normalizer_generic('741.2 NIC'::t
ext)), '\\\\'::text, '\\\\\\\\'::text, 'g'::text))::bytea))
(3 rows)

But, there's a problem with that -- we have to know the normalizer type to use for the user-supplied call number value so that we know how we should convert it. One option would be to use the value configured for the search OU to normalize the input. We could offer the choice to users (or, at least, to staff) for correctness, but a choice needs to be made by someone, and a single normalizer applied.

We will also need to adjust the sortkey index ( "asset_call_number_label_sortkey" btree (oils_text_as_bytea(label_sortkey)) ) to look like the old label index ("asset_call_number_upper_label_id_owning_lib_idx" btree (oils_text_as_bytea(label), id, owning_lib) ).

This may be a longer term project that we can handle before 2.0. Therefore, I suggest an alternate short-term solution: go back to label for sorting (though, with the as_bytea work in place) which is how 1.6 works, and (while not allowing different normalization forms) is know to work well enough for institutionally uniform configurations.
"""

To which Dan Scott responded:

"""
* The WHERE clause compares the raw label against the raw user-supplied call number - which means that a legitimate range of call numbers might be skipped for a given normalization. This is actually no different than how Evergreen works (including incorrect results) in previous versions. And it's important to underscore that the current use of acn.label in the WHERE clause is not the reason why the sequential scan occurs.

I disagree with the concluding assertion that "going back to label for sorting (though, with the as_bytea work in place) which is how 1.6 works, and (while not allowing different normalization forms) is know to work well enough for institutionally uniform configurations". It does not work well for libraries that use Library of Congress call numbers; it has been the source of many complaints in Conifer, and was the reason that I worked on the call number normalization in the first place. It was not simply an academic (ha ha) exercise; sorting on normalized call numbers is required to tackle actual user visible problems.

<snip>

For absolute correctness of call number searching and browsing (which probably should be a completely separate bug, but I'll address it here for now as a start and we can move to a separate bug if necessary), we need to know how to normalize the incoming call number. Let's consider the current user-visible entry points to call number browsing:

  * Clicking on the "Shelf Browser" or "Browse Call Numbers" in the detailed item view, or clicking on the call number in the unapi htmlholdings-full format. From these points, we have access to the source acn, and therefore we have access to the source acn.label_class column, which can then be fed into the call number browsing method. I think it is a reasonable assumption that, when a person invokes the shelf browser, they expect to see other call numbers in the vicinity of this call number, and therefore we can use the source item's call number class.

So, if we give O:A:SuperCat:cn_browse() and O:A:SuperCat:cn_startwith() each an extra, optional argument for label_class, and teach O:A:Storage:Publisher:biblio:record_copy_status_count() and O:A:Storage:Publisher:biblio:record_copy_status_location_count() to return cn.label_class in their payload (as rdetail.js uses this call as the basis of building its list of call numbers for the shelf browser through _rdetailBuildInfoRows()), then I think we can mitigate this part of the problem. If cn_browse() and cn_startwith() don't receive an explicit label_class, then we fall back to the OU setting. The only problem here is that cn_browse() and cn_startwith() currently use only positional arguments, and there are already a number of optional arguments - so either we find all of the calls that don't use the positional arguments and adjust them to explicitly pass undefs for the intervening arguments, or we convert the optional arguments to a hash of named parameters and convert the existing calls to that approach. I suppose we make that choice depending on how many calls of each kind currently exist.

  * In the "Advanced Search", the call number search is currently a simplistic text field - one of the options in "Quick Search". In this case, as you suggest, we could break out the call number search into its own UI element and add user-selectable options for the normalization (defaulting to the normalization in the OU setting for the current search scope, of course).

The advanced search option currently opens cn_browse.xml, which pulls in cn_browse.js to invoke getCallnumber(), which currently just grabs the text string (the PARAM_CN (cn) GET param). We can add a PARAM_CNCLASS to this to provide the classification to the method call.

Once we get this far, we could actually teach the call number browsing methods to use label_sortkey in the WHERE clause and to normalize the incoming call number. I think we can write a two-argument and three-argument SQL function that takes the incoming call number text string, wraps it in oils_text_as_bytea(), and also accepts a context OU id and optional normalization ID so that we can look up the appropriate normalizer and return then normalized text string, rather than having to rewrite the json query to include the appropriate joins.
"""

Tags: opac-browse
Elaine Hardy (ehardy)
tags: added: browse
tags: added: opac-browse
removed: browse
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.