Comment 60 for bug 623438

Rudolf (rk-com) wrote :

Many thanks to George Chriss! (see above)

My workaround based on his description:
Modify the created hocr by XSLT (see below). Then using hocr2pdf 0.8.9 - and the textboxes are placed (almost) correctly.

$ tesseract image.tif ocr_file hocr
$ xsltproc -html -nonet -novalid -o ocr_fixed.hocr fix-hocr.xsl ocr_file.hocr
$ hocr2pdf -i image.tif -o searchable.pdf <ocr_fixed.hocr

See attached file fix-hocr.xsl.