Evergreen

Facets - diacritic sensitive

Bug #744276 reported by George Duimovich on 2011-03-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Fix Released	Medium	Unassigned	Evergreen 2.0.5

Bug Description

EG 2.0.4

Facets - diacritic sensitive?

Possible bug in Facets indexing / generation.

Sample listing below -- Three variations of "Québec (Province)" (2 with and one without accent) should facet under one heading.

41 Québec (Province)
32 Quebec (Province)
17 Ontario
12 Québec (Province)

The two accented headings appear to differ in how they the "é" accent is encoded.

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-03-28:

Sample Facet screencap Edit (2.7 KiB, image/png)

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-03-28:

Screencap of facet text Edit (4.6 KiB, image/png)

Screencap of facet text pasted into notepad (note last entry diff from first) re possible different encoding of accent?

Revision history for this message

Dan Scott (denials) wrote on 2011-03-28: Re: [Bug 744276] Re: Facets - diacritic sensitive

Without having access to the actual text, my guess would be that
encoding of one of the facets uses composed character normailzation
(é) while the other one uses decomposed character normalization (e +
accent acute). I'm not sure that's so much a bug in Evergreen as it is
an inconsistency in your data.

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-03-28:

The "Québec (Province)" vs. "Quebec (Province)" is the difference between English heading vs. French heading so it's legit "inconsistency" however, I don't see the benefit having non-normalized presentation, especially when we can't multi-select yet (like WorldCat facets) to search multiple like facets in one search.

I'd still recommend the facets to be non-sensitive for diacritics.

Revision history for this message

David J. Fiander (david-fiander) wrote on 2011-03-28:

I tend to agree with George, and this has nothing to do with French. For
example, consider the case of Kurt Gödel: users are just as likely to search
for "Godel", and the rules for anglicizing ü generally turn it into "ue", so
Müller becomes Mueller.

Of course, that way lies madness. But for languages where the diacritics do
not create new letters, like French and German, consolidating the variants
makes sense. Swedish is a different problem.

On Mon, Mar 28, 2011 at 11:05, George Duimovich <
<email address hidden>> wrote:

> The "Québec (Province)" vs. "Quebec (Province)" is the difference
> between English heading vs. French heading so it's legit "inconsistency"
> however, I don't see the benefit having non-normalized presentation,
> especially when we can't multi-select yet (like WorldCat facets) to
> search multiple like facets in one search.
>
> I'd still recommend the facets to be non-sensitive for diacritics.
>
> --
> You received this bug notification because you are a member of Evergreen
> Bug Wranglers, which is subscribed to Evergreen.
> https://bugs.launchpad.net/bugs/744276
>
> Title:
> Facets - diacritic sensitive
>
> Status in Evergreen - Open ILS:
> New
>
> Bug description:
> EG 2.0.4
>
> Facets - diacritic sensitive?
>
> Possible bug in Facets indexing / generation.
>
> Sample listing below -- Three variations of "Québec (Province)" (2
> with and one without accent) should facet under one heading.
>
> 41 Québec (Province)
> 32 Quebec (Province)
> 17 Ontario
> 12 Québec (Province)
>
> The two accented headings appear to differ in how they the "é" accent
> is encoded.
>

Revision history for this message

Dan Scott (denials) wrote on 2011-03-28:

I'm suggesting that consistent normalization (NFC or NFD) be applied to the bibliographic and authority records themselves, rather than to the facets. Facets displaying different normalizations is a symptom of an underlying problem that should be addressed at the source - the record itself - otherwise, similar problems will show up in other contexts.

George, could you please run the following query to see how pervasive the problem is in your dataset?

SELECT value, COUNT(*)
  FROM metabib.subject_field_entry
  WHERE value LIKE 'Qu_bec (Province)'
  GROUP BY value
  ORDER BY COUNT(*) DESC;

For comparison, in our dataset we get:

value | count
-------------------+-------
Quebec (Province) | 3759
Québec (Province) | 2471
Qu*bec (Province) | 2
(3 rows)

(Obviously the last entry is a throwback to simpler times - but it's also clear that our dataset is consistently normalized).

Revision history for this message

Dan Scott (denials) wrote on 2011-03-28:

David, as far as your concern about searching Gödel vs Godel or Müller vs. Mueller, that's a related but different matter.

1.6 included various search and indexing normalizations that would support retrieving the same search results whether one searched for "Québec" or "Quebec", and I had extended that in our instance for sundry other characters that were appropriate for a primarily anglophone/francophone audience. If you're not simply opining on the subject and have found an actual regression in 2.0, please open a new bug for that concern, as indexing and search is different from the subject at hand of the display of facets.

Revision history for this message

Mike Rylander (mrylander) wrote on 2011-03-28:

Facets are 100% exact match, period -- no search normalizations.

Please see these tables:

config.index_normalizer
config.metabib_field_index_norm_map

and the documentation here about using them:

http://open-ils.org/dokuwiki/doku.php?id=documentation:indexing#field_normalization_settings

This is a (simple?) matter of configuring the facet normalizer setup to do what you want for facets.

Revision history for this message

Mike Rylander (mrylander) wrote on 2011-03-28:

Also, you can "mult-select" facets if you click on them one at a time. A patch to put checkboxes next to facets and collect the selected ones all at once would be interesting.

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-03-28:

#10

dbs: the result of query is:

value count
Québec (Province) 235
Quebec (Province) 156

Any my facets say (on search keyword: quebec) :
<snip>
283 Québec (Province)
279 Québec (Province)
223 Quebec (Province)
</snip>

-------------------
If I look at two variations of é, we have:

42483 records where marc like '%́%
and
23938 records marc like '%&#xE9%'

BTW, Is there any official preference for using NFD over NFC?

Revision history for this message

Galen Charlton (gmc) wrote on 2011-03-30:

#11