TPAC: Search grammar in record data affects links in surprising ways
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
* Evergreen 2.3.0
Building on bug # 856811 where we discovered that the # symbol in facets caused searches to fail, I've found that the problem goes deeper; it's not just facets, it's almost any link that we surface in the TPAC that uses search syntax drawn from the MARC data, including author: / title: / negation operator search syntax. I added another commit to user/dbs/
I then added a third commit to address some of the low-hanging fruit in the TPAC that should produce fewer negatively surprising results. From the commit log:
"""
Expand the list of filtered characters to cover all of the special characters documented for the Evergreen search grammar (http://
For example, if a title includes "Presenting a subject: tips for consultants", it should _not_ launch a search for "subject" containing "tips for consultants".
This commit addresses most of the link problems in the record display, as well as the author links in the search results table.
Still problematic are the facets (which seem to rely on exact matching, such that filtering out the problematic characters is itself problematic) and autocomplete (which requires modifying the Autocomplete Dojo widget).
In addition, this commit makes the series code actually display, as it was using a non-standard method to attempt to return the results from the BLOCK (and failing). Also, it makes the links for authors in the record details match the MODS32 definition for personal name parts and only use the "acdq" subfields. This enables a click on the link to actually return results; previously, in the case where the author field included (for example) a subfield "g" value, that value would be included in the generated link and would likely lead to 0 hits.
For authors, we substitute with a space rather than just eliding the substituted value. Authors are particularly likely to have dates like 1899-1978; "1899 1978" matches, but "18991978" will not.
Perhaps we should take the same approach with the others, or break down the search/replace logic a little further (for example, we could remove the "-" only if it is preceded by a space or is at the start of the string and is followed immediately by a character, and preserve it if it is surrounded by digits). But this seems to take us pretty far down the road of less negatively surprising results.
"""
So... please see the three commits in http://
Changed in evergreen: | |
assignee: | nobody → Bill Erickson (erickson-esilibrary) |
status: | New → In Progress |
Changed in evergreen: | |
milestone: | 2.3.1 → 2.4.0-alpha |
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
* Confirmed the removal subfield 'g' for author search
* Confirmed removal of special chars in subject links
* Confirmed "Search for related items by.." is showing up.
Signed off and pushed to master, rel_2_3.
Related, I found an issue with series extraction/display:
For "Trombone Concerto", the related series section shows two entries, one for "American Trombone Concertos ;" (440a) and one for (440v) "2". Possible solutions are to change the XPATH to '//*[@tag="' _ tag _ '"]' to pick up the 440 as a whole blob or limit it to the 440a (or similar). Not sure which is preferred...
In hindsight, I should have waited to push these commits until this was resolved, but got ahead of myself...