Comment 2 for bug 1169693

Revision history for this message
Mike Rylander (mrylander) wrote :

Kathy Lussier asked for some clarification on what all that up there means ... here's what I said, which I think is more clear than my explanation above:

[13:32:19] <kmlussier> As someone who is very interested in improved search results, I would like to give feedback on https://bugs.launchpad.net/evergreen/+bug/1169693. But I'm getting stuck at the fact that I don't understand what combined search is. Or, more specifically, what issue combined search was meant to resolve. Is that something that can be easily explained?
[13:32:20] <pinesol_green> Launchpad bug 1169693 in Evergreen "keyword relevancy ranking in 2.4" (affected: 1, heat: 6) [Undecided,New]
[13:34:31] <eeevil> kmlussier: in 2.3 and before, imagine a subject search for "potter wizard", and a record with 2 subjects, "Harry Potter" and "Wizards". that search would not find the record
[13:34:56] <eeevil> because each subject is indexed separately, and neither contains both terms
[13:35:07] <kmlussier> ok, I'm following so far.
[13:35:42] <eeevil> in 2.4, all subjects are indexed as before, but now there's a secondary indexing that smooshes all the subjects together into one indexed string (more or less)
[13:36:02] <eeevil> but, for the keyword class, that causes bad ranking
[13:36:24] <eeevil> that's because the keyword|keyword index def is just the whole record (again, more or less) in one big blob
[13:37:08] <eeevil> it defeats the tactic of adding, say, keyword|title with a high weight value for pulling title matches on a keyword search up higher in the results
[13:37:22] <eeevil> which is a very powerful tactic
[13:38:51] <eeevil> IOW, for subject (definitely) and maybe authors or titles (less likely) one can benefit from the row-combining (shorthand: combined search), but we really don't want that for the keyword class
[13:38:57] <kmlussier> Two of our consortia added a title index (among others) for the keyword search to weight title words more heavily. We had come across a problem where a search like "twilight -meyer" wouldn't work correctly because meyer didn't appear in that title index. Was the combined search meant to resolve that problem too?
[13:39:36] <eeevil> it may have been, and (aside from the whole-record blob) would have been effective
[13:40:10] <eeevil> but the rel ranking is significantly less good (heh) with the blob in there.
[13:40:25] <eeevil> now, the offered change makes that optional
[13:41:20] <eeevil> so, one could dispense with the whole-record blob and instead have a pile of individual fields (say, /all/ of them that are elsewhere indexed), turn combined search on for keyword, and get what they want
[13:41:33] <kmlussier> eeevil: Sure, I was doing some testing with bshum yesterday and saw some of those problems with rel. I think we've been very pleased with being able to weight certain indexes more heavily than others in keyword searches.
[13:42:21] <eeevil> kmlussier: right. so, with configuration (specifically, remove the blob and replace it with weighted fields) you'd be in business with combined
[13:42:36] <eeevil> but out of the box, and for upgrades, it's a significant regression in ranking
[13:42:56] <kmlussier> eeevil: ok, thanks for the explanation.
[13:43:21] <eeevil> my idea was to give a path to the better solution without forcing big changes (that have to be done per-site, not via upgrade script) on everyone