Certain punctuation creates problems for search in public catalog

Bug #2008423 reported by Benjamin Kalish
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Triaged
Undecided
Unassigned

Bug Description

Currently Evergreen does not handle some subject headings with punctuation well, for example C++ (Computer program language)

A search for subject:C++ (Computer program language) will return works on C and C# as well as C++ resulting in many false positives. On the other hand, a search for subject:"C++ (Computer program language)" returns no results at all.

There should be a way of improving the search experience for folks searching for these headings.

Observed with Evergreen 3-7-3.

Tags: opac usability
Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
summary: - Punctuation in subject headings creates problems for search
+ Parenthesis creates problems for search in Angular Staff Catalog
Revision history for this message
Jason Stephenson (jstephenson) wrote : Re: Parenthesis creates problems for search in Angular Staff Catalog

I have sort of confirmed this bug and changed the "title" to reflect what I have determined.

First, the "++" is being ignored across the board. I don't remember the exact code or reason why, but it is common in search to ignore symbols. There is likely something that can be adjusted in search indexing or code to change this.

Second, I cannot reproduce the no search results in the OPAC. This only appears to affect the Angular Staff Catalog search.

Third, I cannot reproduce this with a stock 3.10 installation on CWMARS data.

HOWEVER, CWMARS has a security patch installed on production to protect against malicious queries. When that patch is applied to 3.10, I do not get search results in the Angular Staff Catalog for searches containing parentheses. This happens on our produciton 3.7.3 servers and a test server with 3.10 installed looking at a recent copy of produciton data. If I remove that patch on the test server, I get results, so it definitely appears to be that patch.

I am going to update the security bug with a comment to indicate that it appears to be causing this bug.

I have set the status of this bug to triaged pending further review of the security patch.

Changed in evergreen:
status: New → Triaged
assignee: Jason Stephenson (jstephenson) → nobody
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Oh. I search "c++ computer language" without parenthesis, I get results in the Angular catalog with our without the patch. The results appear to be the same as those on stock 3.10 with parenthesis.

Revision history for this message
Benjamin Kalish (bkalish) wrote :

My original report was based entirely on the public catalog (hence the opac tag).

Based on Jason Stephenson comments I have now tried this in the staff catalog as well (again on 3-7-3 using the CWMARS production server) and am getting what appears to be an unrelated problem: the interface hangs and doesn't return search results, "No Matching Items Were Found", or an error message. The problem in the staff client seems to be related to parentheses; the problem I reported is not. The parentheses problem in the staff client is a definitely a bug, but it is not the bug I reported.

summary: - Parenthesis creates problems for search in Angular Staff Catalog
+ Certain punctuation creates problems for search in public catalog
Changed in evergreen:
status: Triaged → New
Revision history for this message
Benjamin Kalish (bkalish) wrote :

I've set the bug to new since Jason's change to triaged was based on a misunderstanding. Do we need a new bug for the problem in the staff client?

Changed in evergreen:
status: New → Triaged
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Benjamin, ignoring symbols is par for the course in online search, even Google and DuckDuckGo do it. That's not necessarily a bug in my opinion. As alluded to in my previous comment, we can probably modify the normalizers to make some symbols be retained in the indexes. This may have unintended consequences on other searches, however.

My misunderstanding of the issue came from my failure to reproduce the no search results issue in the CWMARS OPAC until I added quotes around the entire string. When I don't quote the search string, I get results.

The quotes issue is also apparently caused by the security patch. I get results with the quoted string in the OPAC on a 3.10 server that does not have the fix applied. They are the same as the results without quotes. (I will attach screenshots.)

Clicking on the subject facet doesn't do what you might expect, either. You get the same results as the unquoted search, so I will agree that subject search is broken in that sense, i.e. it doesn't do a literal subject search. (Evergreen doesn't do literal searching to begin with, really.) Subject browse might be better suited to those who want to find things by subject.--That's probably a different bug, or two, if you like.

As for needing a new bug for the Angular Staff Catalog issue, I don't think we need one. That appears to be entirely caused by the security patch. I think I have provided sufficient information on that bug. It is not yet publicly visible for reasons.

So, to sum up, other than '++', '#', etc. being dropped from search indexes, it looks like the issues you reported are indeed caused by the security patch.

I have set this bug back to triaged for the reasons stated before.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Here's a screenshot showing results on the CWMARS production OPAC (3.7.3) without quotes.

It has the security patch applied.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Here's a screenshot of a CWMARS 3.10 test server with customization but without the security patch. It gets results with quotes.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Finally, for completeness' sake, here's a screen shot on stock 3.10 with CWMARS data doing the subject search with a quoted search screen.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Investigation has revealed that the Evergreen search normalizer functions do not strip +, &, @, or # from the indexes. You can verify this by running the functions on strings of your choice in the database. For example:

select public.search_normalize('C++ (Computer program language)'); returns c++ computer program language

You can can also the symbols in the value field of the various metabib field entry tables.

The symbols are stripped by the full text search configuration of PostgreSQL. Evergreen uses PostgreSQL Full Text Search to make search look up go faster by adding a "tsvector" field to the search table entries in the database. The advantages of full text search, such as being able to make partial matches, match on synonyms, match words in any order, outweigh the disadvantages.

You can demonstrate that the symbols are stripped by running the to_tsvector function:

select to_tsvector('C++ (Computer program language)'); returns 'c':1 'comput':2 'languag':4 'program':3

The ts_debug function can be used to further illustrates what happens:

select alias, description, token from ts_debug('C++ (Computer program language)'); returns

   alias | description | token
-----------+-----------------+----------
 asciiword | Word, all ASCII | C
 blank | Space symbols | +
 blank | Space symbols | + (
 asciiword | Word, all ASCII | Computer
 blank | Space symbols |
 asciiword | Word, all ASCII | program
 blank | Space symbols |
 asciiword | Word, all ASCII | language
 blank | Space symbols | )

Notice that '+' is being treated as a blank.

The output for C# (Computer program language) looks very much the same.

In order to change how symbols are treated in Evergreen search, a new configuration would need to be created for PostgreSQL Full Text Search. This is left as an exercise for the reader and for those sites that wish to make such a customization.

Revision history for this message
Benjamin Kalish (bkalish) wrote :

It does look like some work on the PostgreSQL end would be helpful, but I'd like to point out that it is not the only possible approach.

We could work around the Full Text Search behavior by "normalizing" the users search and the terms in the database in a way that preserves punctuation and won't be ignored by the Full Text Search, for example, changing a plus to something unlikely to appear otherwise such as egpunctplus. Obviously this isn't ideal, but if it is easier than making the change to PostgresSQL it might be worth considering. (We might consider limiting this behavior to subject fields and subject searches–losing punctuation in a keyword search is to be expected, but being able to do an exact search for a subject heading is important.)

And while we can't completely solve the problem without addressing search, we could avoid the problem in the case where the user is clicking on a linked subject heading. Currently linked subject headings (and linked authors) link to a regular search. An alternative approach would be handle this like a relational database: each subject heading (or author name) would have a unique ID and tables in the database would provide links between records and the subject headings that appear there. The advantage of this approach would be that it would be completely immune to problems with search and would support pearl growing, an approach that can easily fail now. One of the disadvantages is that it would do nothing to fix the problem with search.

One way or another, I think this is an important issue. We know that the vast majority of subject headings will be drawn from LCSH so we should make an effort to make sure Evergreen works with that vocabulary.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.