Normalize unicode

Bug #598204 reported by George
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Open Library
New
Medium
Anand Chitipothu

Bug Description

2009/4/28 George <email address hidden>:
> - http://openlibrary.org/b/OL2008176M/Em%CC%A3.-T%CC%A3i.-Vi.-A%CC%84ca%CC%84rya
>
> Are we UTF-8 compliant? (Could be worth a new bug.)

The problem is the unicode normalization. This is a Python string
representation of our the title is stored in the database:

u'Em\u0323. T\u0323i. Vi. A\u0304ca\u0304rya'

It should be stored like this:

u'E\u1e43. \u1e6ci. Vi. \u0100c\u0101rya'

New records are being added with this normalization, it works with the
font we are using. I've been planning to fix existing records. I'll
move this to the top of my todo list.

===============

Anand Chitipothu wrote on 2009-04-29:

> I don't think we need to change the font if I fix the unicode
> normalization.

I don't think we should fix it in the database. what if such strings
come later? I think, the best approach is to write code to handle all cases.

How about having a public function to normalize strings and call it
before displaying strings in the browser?

===============

Edward Betts wrote on 2009-04-29:

Good idea. Here is the code:

from unicodedata import normalize

def norm(s):
    return normalize('NFC', s)

George (george-archive)
Changed in openlibrary:
assignee: nobody → Anand Chitipothu (anandology)
importance: Undecided → Medium
milestone: none → general-bucket
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.