Facets - diacritic sensitive

Bug #744276 reported by George Duimovich
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Unassigned

Bug Description

EG 2.0.4

Facets - diacritic sensitive?

Possible bug in Facets indexing / generation.

Sample listing below -- Three variations of "Québec (Province)" (2 with and one without accent) should facet under one heading.

41 Québec (Province)
32 Quebec (Province)
17 Ontario
12 Québec (Province)

The two accented headings appear to differ in how they the "é" accent is encoded.

Revision history for this message
George Duimovich (george-duimovich) wrote :
Revision history for this message
George Duimovich (george-duimovich) wrote :

Screencap of facet text pasted into notepad (note last entry diff from first) re possible different encoding of accent?

Revision history for this message
Dan Scott (denials) wrote : Re: [Bug 744276] Re: Facets - diacritic sensitive

Without having access to the actual text, my guess would be that
encoding of one of the facets uses composed character normailzation
(é) while the other one uses decomposed character normalization (e +
accent acute). I'm not sure that's so much a bug in Evergreen as it is
an inconsistency in your data.

Revision history for this message
George Duimovich (george-duimovich) wrote :

The "Québec (Province)" vs. "Quebec (Province)" is the difference between English heading vs. French heading so it's legit "inconsistency" however, I don't see the benefit having non-normalized presentation, especially when we can't multi-select yet (like WorldCat facets) to search multiple like facets in one search.

I'd still recommend the facets to be non-sensitive for diacritics.

Revision history for this message
David J. Fiander (david-fiander) wrote :

I tend to agree with George, and this has nothing to do with French. For
example, consider the case of Kurt Gödel: users are just as likely to search
for "Godel", and the rules for anglicizing ü generally turn it into "ue", so
Müller becomes Mueller.

Of course, that way lies madness. But for languages where the diacritics do
not create new letters, like French and German, consolidating the variants
makes sense. Swedish is a different problem.

On Mon, Mar 28, 2011 at 11:05, George Duimovich <
<email address hidden>> wrote:

> The "Québec (Province)" vs. "Quebec (Province)" is the difference
> between English heading vs. French heading so it's legit "inconsistency"
> however, I don't see the benefit having non-normalized presentation,
> especially when we can't multi-select yet (like WorldCat facets) to
> search multiple like facets in one search.
>
> I'd still recommend the facets to be non-sensitive for diacritics.
>
> --
> You received this bug notification because you are a member of Evergreen
> Bug Wranglers, which is subscribed to Evergreen.
> https://bugs.launchpad.net/bugs/744276
>
> Title:
> Facets - diacritic sensitive
>
> Status in Evergreen - Open ILS:
> New
>
> Bug description:
> EG 2.0.4
>
> Facets - diacritic sensitive?
>
> Possible bug in Facets indexing / generation.
>
> Sample listing below -- Three variations of "Québec (Province)" (2
> with and one without accent) should facet under one heading.
>
> 41 Québec (Province)
> 32 Quebec (Province)
> 17 Ontario
> 12 Québec (Province)
>
> The two accented headings appear to differ in how they the "é" accent
> is encoded.
>

Revision history for this message
Dan Scott (denials) wrote :

I'm suggesting that consistent normalization (NFC or NFD) be applied to the bibliographic and authority records themselves, rather than to the facets. Facets displaying different normalizations is a symptom of an underlying problem that should be addressed at the source - the record itself - otherwise, similar problems will show up in other contexts.

George, could you please run the following query to see how pervasive the problem is in your dataset?

SELECT value, COUNT(*)
  FROM metabib.subject_field_entry
  WHERE value LIKE 'Qu_bec (Province)'
  GROUP BY value
  ORDER BY COUNT(*) DESC;

For comparison, in our dataset we get:

       value | count
-------------------+-------
 Quebec (Province) | 3759
 Québec (Province) | 2471
 Qu*bec (Province) | 2
(3 rows)

(Obviously the last entry is a throwback to simpler times - but it's also clear that our dataset is consistently normalized).

Revision history for this message
Dan Scott (denials) wrote :

David, as far as your concern about searching Gödel vs Godel or Müller vs. Mueller, that's a related but different matter.

1.6 included various search and indexing normalizations that would support retrieving the same search results whether one searched for "Québec" or "Quebec", and I had extended that in our instance for sundry other characters that were appropriate for a primarily anglophone/francophone audience. If you're not simply opining on the subject and have found an actual regression in 2.0, please open a new bug for that concern, as indexing and search is different from the subject at hand of the display of facets.

Revision history for this message
Mike Rylander (mrylander) wrote :

Facets are 100% exact match, period -- no search normalizations.

Please see these tables:

  config.index_normalizer
  config.metabib_field_index_norm_map

and the documentation here about using them:

  http://open-ils.org/dokuwiki/doku.php?id=documentation:indexing#field_normalization_settings

This is a (simple?) matter of configuring the facet normalizer setup to do what you want for facets.

Revision history for this message
Mike Rylander (mrylander) wrote :

Also, you can "mult-select" facets if you click on them one at a time. A patch to put checkboxes next to facets and collect the selected ones all at once would be interesting.

Revision history for this message
George Duimovich (george-duimovich) wrote :

dbs: the result of query is:

value count
Québec (Province) 235
Quebec (Province) 156

Any my facets say (on search keyword: quebec) :
<snip>
283 Québec (Province)
279 Québec (Province)
223 Quebec (Province)
</snip>

-------------------
If I look at two variations of é, we have:

42483 records where marc like '%&#x301;%
and
23938 records marc like '%&#xE9%'

BTW, Is there any official preference for using NFD over NFC?

Revision history for this message
Galen Charlton (gmc) wrote :

The following patch by Mike should fix this:

  http://svn.open-ils.org/trac/ILS/changeset/19902/branches/rel_2_0

(Fix is in 2.1 and trunk as well).

George, if you want to try this out in your test database, you can run the following SQL which is part of the patch:

http://svn.open-ils.org/trac/ILS/browser/branches/rel_2_0/Open-ILS/src/sql/Pg/upgrade/0505.schema.force_facets_to_NFC.sql?rev=19902

Changed in evergreen:
status: New → Fix Committed
importance: Undecided → Medium
milestone: none → 2.0.5
Revision history for this message
Galen Charlton (gmc) wrote :
Ben Shum (bshum)
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.