Technical: check/refine OCR efficiency of UFF

Reported by frere on 2010-10-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Font Family
Wishlist
Unassigned

Bug Description

I often use OCR to convert pdf documents into text files for translation. Fonts poorly designed for OCR processing will for example show systematic mismatches for "1", "l", "I", "i", "j".
A quick glance at Ubuntu font shows that it should be reasonably OCR-proof, except for the lower-case "l" that looks quite similar to an upper-case "i", and the upper-case "o" looking much like a zero.
Maybe Ubuntu font could take a daring stance and be_very_OCR proof. This would make it a resolutely modern font, with a technical incentive for wide adoption: it would just work better on computers.

Thank you so much for your beautiful work.

frere (frere) wrote :

Automatic Screenshot

Paul Sladen (sladen) on 2010-10-03
summary: - OCR efficiency
+ Technical: check/refine OCR efficiency of UFF
Changed in ubuntu-font-family:
importance: Undecided → Low
milestone: none → 1.00
status: New → Triaged
description: updated
Paul Sladen (sladen) wrote :

Frere: the stylistic direction for the direction of the Ubuntu Font Family has been mostly set, but there have been some recent changes to the design of glyphs (bug #623925, bug #621157).

Thank you for looking at the current fonts and giving an estimation. It would be very useful if you could undertake some of the research on the *real* impact of OCRing, so that it can be kept in mind as the design of future glyphs comes up. Could you try doing OCR in the same real-world setting that you currently use, and posting the results you have with the types of texts that you encounter.

Changed in ubuntu-font-family:
importance: Low → Wishlist
frere (frere) wrote :

Hi Paul,

Here are some suggestions I made to Mark.

In my experience, OCR mismatches mainly concern two groups:
A)
- Lower case "l"
- Upper case "I"
- Number "1"
- Number "7"
[ Suggestions ]
* Definitely rework upper-case "I", which looks like a simple vertical bar and can be mistaken for a bunch of other characters.
* Clearly differentiate both top and bottom parts of number "1" and letter "l". The actual Ubuntu Font should score an average here.
* Add a (thin) horizontal bar in the middle of number seven "7".

B)
- Lower case "o"
- Upper case "O"
- Number "0"
[ Suggestions ]
* Add a dot inside the zero "0" in order to distinguish it from upper-case "O". Another main source of OCR mismatches and easy to fix.
* Add a distinguishing feature in the top-right corner of lower case
"o", that could remind the handwritten letter.

I took a standard text (the first chapter of Genesis), made a second version in all upper-case, and started systematically swapping the above mentioned characters, transforming both texts in human-readable gibberish. The idea is to then print to pdf, convert to low resolution jpg or png, and process with state-of-the-art OCR software (I am using FineReader in WinXP on Virtual Box). The more mismatches we get, the better. I would also like to test a technical document containing plenty of numbers and codes, which are tricky for OCRing. Unfortunately, I am extremely busy professionally and had no time to finalize this properly. I will keep you informed.

Ideally, people working on OCR software should be consulted. Facing these problems is what they spend their time doing. They will possibly be delighted to point at problems and solutions for Ubuntu Font. You could try contacting the good people at GOCR, Ocropus and Tesseract.

I hope this helps.

Paul Sladen (sladen) wrote :

I guess OCR read-back is something that we could automate and run as part of a test-suite to get hard numbers out of, I don't know to what degree it will be possible to go back and redraw the core Latin set, but it would be interesting to graph the outputs numerical. It's one of the few things that with a font it would be possible to test pragmatically and provide some level of regression feedback.

I don't know how much need there will be for paper-OCRing, but I think OCR might have a use in eg. automatically OCRing and indexing screenshots in Launchpad. For instance, just yesterday somebody pasted an strace/stacktrace of a program as a bitmap screenshot.

We have some OCR software in Ubuntu called "Cuneiform" which (according in the documentation) supports "English, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Mixed Russian-English, Ukrainian, Danish, Swedish, Finnish, Serbian, Croatian, Polish and others".

You've actually raised an important philosophical issues about type
design. What you're identifying is actually a clean distinction between
the purposes of different fonts and different types of writing.

For prose and everyday text the ambiguity between characters is not
really an issue; while humans have strong expectations about
letterforms, they actually read whole words not individual letters, and
not all characters are equal when it comes to differentiating words. For
OCR software reading everyday text or prose, the same thing applies -
it's the words that matter, not the letters. Should someone actually
accidentally spell the word "alliance" as "aIIiance", you'd want the
software to do what a human would do - and recognize it as "alliance".

Introducing unnecessary distinctive features to characters can actually
make them harder to recognize, and jarring to read. But for some kinds
of text, such as listings, every character is equally important and it's
absolutely vital that they are distinct. This is where monospace fonts
fit in - quite literally every character is equal, and we go out of our
way to make every glyph unambiguous, which is something you'll see when
the monospace font is delivered.

Dave

frere (frere) wrote :

I see your point Dave. However, I also would think that when publishing printable material such as the Full Circle Magazine or any work produced with Scribus, besides choosing a pleasant font, one would actually appreciate that the font also be OCR-friendly. Maybe we could manage to make Ubuntu Font OCR-friendlier with little extra work by slightly enhancing half a dozen characters at most. But then font design is definitely something I never actually dealt with, so I don't realize how challenging such an endeavour may be...

I would argue the exact opposite - that if a day-to-day text font is
readable, legible, and pleasant to the human eye, but OCR software has
difficulty with it, then it is the OCR software that is failing and at
fault, not the font.

Dave

Paul Sladen (sladen) wrote :

frere: re: 'Add a dot inside the zero "0" in order to distinguish it from upper-case "O".' Thank you for the suggestion; this has been added to the Ubuntu Mono beta; for pure-OCR work Ubuntu Mono is likely to be much more suitable I would have throught. I've started experimenting with printing address labels in Ubuntu Mono and think they look wonderful!

If you wanted to do some further OCR testing (so we could at least document what works and what doesn't) then the mono is probably the place to do it.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers