no language codes for ImagePDF books

Bug #187108 reported by Hank Bromley
2
Affects Status Importance Assigned to Milestone
Deriver
New
Undecided
Unassigned

Bug Description

All the ImagePDF books are being ocr'd in English because we have no bibliographic metadata for them. Is it possible to build acquisition of bibliographic metadata into the processing pathway? Or, short of full biblio data, can we introduce some sort of human-assisted addition of language info to the metadata? Otherwise, our ocr output (and resulting text layer in the pdfs) is going to be junk on all non-English books.

For instance, here's a German one that was done recently:

ftp://ia360603.us.archive.org/0/items/mitteilungen01gescgoog/mitteilungen01gescgoog_djvu.txt

Tags: imagepdf
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.