Open Library

Normalize unicode

Bug #598204 reported by George on 2010-06-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Open Library	New	Medium	Anand Chitipothu	Open Library general-bucket

Bug Description

2009/4/28 George <email address hidden>:
> - http://openlibrary.org/b/OL2008176M/Em%CC%A3.-T%CC%A3i.-Vi.-A%CC%84ca%CC%84rya
>
> Are we UTF-8 compliant? (Could be worth a new bug.)

The problem is the unicode normalization. This is a Python string
representation of our the title is stored in the database:

u'Em\u0323. T\u0323i. Vi. A\u0304ca\u0304rya'

It should be stored like this:

u'E\u1e43. \u1e6ci. Vi. \u0100c\u0101rya'

New records are being added with this normalization, it works with the
font we are using. I've been planning to fix existing records. I'll
move this to the top of my todo list.

===============

Anand Chitipothu wrote on 2009-04-29:

> I don't think we need to change the font if I fix the unicode
> normalization.

I don't think we should fix it in the database. what if such strings
come later? I think, the best approach is to write code to handle all cases.

How about having a public function to normalize strings and call it
before displaying strings in the browser?

===============

Edward Betts wrote on 2009-04-29:

Good idea. Here is the code:

from unicodedata import normalize

def norm(s):
return normalize('NFC', s)

George (george-archive) on 2010-06-24

Changed in openlibrary:
assignee:	nobody → Anand Chitipothu (anandology)
importance:	Undecided → Medium
milestone:	none → general-bucket

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.