Feature Request: Add 'Export to hOCR' in Save Dialog

Bug #1069248 reported by George Chriss
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Inkscape
New
Wishlist
Unassigned

Bug Description

Related to Bug #179309, add an option in the 'Save As' dialog to 'Export as hOCR (.hocr)' --- the OCR XHTML format, not the Hebrew OCR library --- constructed from flowPara elements present in the Inkscape document. The options dialog that follows should provide an option to embed the hOCR file into an existing PDF such that (additional) PDF pages become text-searchable.

At the moment there appears to be no FLOSS-friendly way to generate hOCR files manually (as opposed to machine recognition) based on the starting graphic other than entirely by hand. Doing so is impractical for complex documents but necessary nevertheless when machine recognition would fare poorly (e.g., historical hand-written documents -- even paragraph bounding boxes are problematic).

This is an important, underdeveloped area which Inkscape could address with limited development work. Import a full-size TIF/PDF page, use Inkscape drawing tools to graphically overlay the scanned text with XML-based text layers, save, rinse, repeat, and voila!

See also:

The hOCR Embedded OCR Workflow and Output Format / Thomas Breuel (editor)
http://docs.google.com/View?docid=dfxcv4vc_67g844kf

moz-hocr-edit (OCR proofreader)
http://jimgarrison.org/moz-hocr-edit/

Distributed Proofreaders
http://en.wikipedia.org/wiki/Distributed_Proofreaders

DPCustomMono (OCR-friendly proofreading font)
http://boingboing.net/2012/10/01/font-designed-for-proofreading.html

Tags: exporting
su_v (suv-lp)
tags: added: exporting
Changed in inkscape:
importance: Undecided → Wishlist
Revision history for this message
jsbien (jsbien) wrote :

The statement

At the moment there appears to be no FLOSS-friendly way to generate hOCR files manually (as opposed to machine recognition) based on the starting graphic other than entirely by hand

requires a comment: there is Jakub Wilks suit of DjVu tools, cf. http://jwilk.net or an overview at http://bc.klf.uw.edu.pl/298/.

Regards

Janusz

Revision history for this message
George Chriss (gschriss) wrote :

Thank you for the pointer to DjVu. It's encouraging to see an independent, robust document format.

Unfortunately DjVu doesn't address this issue -- the Wilk toolset provides DjVu<->hOCR interoperability ('djvu2hocr' and 'hocr2djvused') and djvusmooth is able to character-edit/resize 'LINE' and 'WORD' bounding boxes in native DjVu 'HIDDENTEXT' syntax, but again we're back to basis of the original comment.

Having high-resolution tiff(s) with corresponding hOCR files seems to be the best starting point for both PDF- and DjVu-based document representation. Finding sensible ways to generate hOCR files seems to be the next step forward.

After setting 'Inkscape Preferences -> Transforms -> Store Transformation -> Set to Optimized' creating a script to build hOCR from Inkscape XML elements doesn't seem so scary, so I'm going to give it a whirl.

Revision history for this message
George Chriss (gschriss) wrote :

Attached is code for a new extension "Export Image Overlay Text as hOCR"

Usage directions:
 1) Copy 'export_hocr.py' and 'export_hocr.inx' to the Inkscape extension directory (e.g., '/usr/share/inkscape/extensions') with appropriate permissions (+x for export_hocr.py and +r for export_hocr.inx)
 2) Open 'File -> Inkscape Preferences (shortcut = Shift+Ctrl+P) --> Transforms' and change the 'Store transformations' property to 'Optimized'
 3) Select 'File --> Open' --> select an image file --> 'Open' --> 'Link' the file (not 'Embed'). Do not use 'File -> Import' without resizing the document to match the dimensions of the imported image.
 4) Select the 'Create and edit text objects' tool (shortcut = 'F8'), create a text box around a text area in the image, line-for-line, and, optionally, type out the corresponding text by hand. Repeat as necessary. Style formatting is currently discarded during export.
 6) Optionally, use the XML editor (shortcut = 'Shift+Ctrl+X') to verify text box placement/corresponding text and remove any duplicative or mis-placed boxes (delete 'flowRoot' nodes).
 7) Select 'File -> Save (shortcut = Ctrl+S)' to save the current editing session.
 8) Select 'File -> Save a Copy (shortcut = Shift+Ctrl+Alt+S)', select 'hOCR Metadata (*.html)' in the filetype pull-down menu, provide a file name, then click 'Save.' Verify the file contents in a web browser or text editor.

 9) Optionally, edit text regions using the 'moz-hocr-edit' Firefox extension. (PNG files tested, other formats may fail to display correctly)

 10) Optionally, use the exported hOCR file to create searchable PDFs via 'hocr2pdf' provided by the 'exactimage' package.
        Documentation: http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/

Patches and error reports welcome. Additional comments provided as code comments.

Revision history for this message
George Chriss (gschriss) wrote :

Corresponding Extension Index file for Comment #3.

Revision history for this message
George Chriss (gschriss) wrote :

Support for additional hOCR elements (e.g., logical document structure, column, paragraph, float, etc.) would be better handled with a new editing view in Inkscape. The hOCR extension works for basic line identification; the next step is new GUI features...

Revision history for this message
George Chriss (gschriss) wrote :

Code now in Gitorious: https://gitorious.org/export-text-as-hocr/inkscape-hocr

I'll open a new bug for the GUI feature request.

Revision history for this message
George Chriss (gschriss) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.