Feature Request: Add 'Export to hOCR' in Save Dialog
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Inkscape |
New
|
Wishlist
|
Unassigned |
Bug Description
Related to Bug #179309, add an option in the 'Save As' dialog to 'Export as hOCR (.hocr)' --- the OCR XHTML format, not the Hebrew OCR library --- constructed from flowPara elements present in the Inkscape document. The options dialog that follows should provide an option to embed the hOCR file into an existing PDF such that (additional) PDF pages become text-searchable.
At the moment there appears to be no FLOSS-friendly way to generate hOCR files manually (as opposed to machine recognition) based on the starting graphic other than entirely by hand. Doing so is impractical for complex documents but necessary nevertheless when machine recognition would fare poorly (e.g., historical hand-written documents -- even paragraph bounding boxes are problematic).
This is an important, underdeveloped area which Inkscape could address with limited development work. Import a full-size TIF/PDF page, use Inkscape drawing tools to graphically overlay the scanned text with XML-based text layers, save, rinse, repeat, and voila!
See also:
The hOCR Embedded OCR Workflow and Output Format / Thomas Breuel (editor)
http://
moz-hocr-edit (OCR proofreader)
http://
Distributed Proofreaders
http://
DPCustomMono (OCR-friendly proofreading font)
http://
tags: | added: exporting |
Changed in inkscape: | |
importance: | Undecided → Wishlist |
The statement
At the moment there appears to be no FLOSS-friendly way to generate hOCR files manually (as opposed to machine recognition) based on the starting graphic other than entirely by hand
requires a comment: there is Jakub Wilks suit of DjVu tools, cf. http:// jwilk.net or an overview at http:// bc.klf. uw.edu. pl/298/.
Regards
Janusz