Live Search Not Producing Expected Result

Bug #430037 reported by Anthony
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Chris Rossi

Bug Description

Bug Reported by Jonathan Hooper:

When I type the search term "fgrep" I don’t see any results, but a blog comment clearly contains this phrase:

https://karl.soros.org/communities/hoops-snips/blog/recursive-grep

Do we know why this is happening?

Anthony (agalietti)
Changed in karl3:
assignee: nobody → Paul Everitt (paul-agendaless)
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Don't know if this is better for you or Shane. If you want, re-assign it.

Changed in karl3:
assignee: Paul Everitt (paul-agendaless) → Chris McDonough (chrism-plope)
importance: Undecided → Medium
milestone: none → m32
Changed in karl3:
milestone: m32 → m33
Revision history for this message
Chris McDonough (chrism-plope) wrote :

This object converts its text to html when it's indexed, and also makes use of the title in the text index:

>>> ob = root['communities']['hoops-snips']['blog']['recursive-grep']['comments']['002']
>>> ob.text
u'<div xmlns="http://www.w3.org/1999/xhtml">\r\n <p>recursive grep, then delete<br><br>fgrep -lir baltimore * | xargs rm</p>\r\n </div>\n'
>>> from karl.utilities.converters.stripogram import html2text
>>> html2text(ob.text)
u'\n\nrecursive grep, then delete\n\n\n\nfgrep -lir baltimore * | xargs rm'
>>> ob.title
u'Recursive GREP'

For this object we've indexed the following words:

>>> docid = root['communities']['hoops-snips']['blog']['recursive-grep']['comments']['002'].docid
>>> wids = root.catalog['texts'].index.get_words(docid)
>>> index = texts = root.catalog['texts'].index
>>> lexicon = index._lexicon
>>> map(lexicon.get_word, wids)
[u'recursive', u'grep', u'recursive', u'grep', u'delete']

I can't make much sense out of that yet.

Revision history for this message
Chris McDonough (chrism-plope) wrote :

Found it. The stripogram stuff is a decoy. The actual implementation uses karl.content.models.adapters._html_cleaner, which has a bug:

>>> from karl.content.models.adapters import _html_cleaner
>>> _html_cleaner(u'<div xmlns="http://www.w3.org/1999/xhtml">\r\n <p>recursive grep, then delete<br><br>fgrep -lir baltimore * | xargs rm</p>\r\n </div>\n')

Returns:

'\r\n recursive grep, then delete'

It should return something more like the previous stripogram example (u'\n\nrecursive grep, then delete\n\n\n\nfgrep -lir baltimore * | xargs rm').

Revision history for this message
Chris McDonough (chrism-plope) wrote :

The bug which caused improper input to the text indexer has been fixed on the trunk.

We'll need to run "bin/reindex_catalog" on the production system after the next release to fix existing content.

Changed in karl3:
status: New → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Hi Chris. Looks like this didn't get done completely in production. Some old content with fgrep isn't searchable.

Changed in karl3:
assignee: Chris McDonough (chrism-plope) → Chris Rossi (chris-archimedeanco)
milestone: m33 → none
status: Fix Committed → Incomplete
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Hi Paul, Chris M's fix is from 9/22 and the last release of Karl was 9/8, so we're still waiting for this to go into production.

Changed in karl3:
status: Incomplete → Fix Committed
Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.