calibre

Bug #822744
Comment #3

Comment 3 for bug 822744

Revision history for this message

helour (helourx) wrote on 2011-08-24:

"pdftohtml tends to break non-ascii characters into parts" - yes you are right - that's the problem. But what's about utf8 (= "non-asci")characters in multilingual documents?

"I need to know what the character is broken into" - I can say it simple - every utf8 character which is not inside the source code (preprocess.py - PDFTOHTML list) makes problem.

I have made simple postprocess application in the python which unwraps problematic parts of *.fb2 (Fiction Book) czech&slovak documents. Maybe this part of it will help you understand what is wrong:

inside = re.sub(r'([áéíýúäôľščťžňďěřůÁÉÍÝÚĽŠČŤŽŇĎĚŘŮ])\n\n', r'\1 ', inside) #"non-asci" CE characters
inside = re.sub(r'(,“)\n\n([a-záéíýúäôľščťžňďěřů„“])', r'\1 \2', inside) #quotation mark after comma
inside = re.sub(r'(“)\n\n([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #quotation mark in direct speech
inside = re.sub(r'([A-Z])\n\n([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #upper letter

if (remove_hyphen == True):
inside = re.sub(r'([\sa-žA-Ž])-\n\n([a-záéíýúäôľščťžňďěřů])', r'\1\2', inside)
inside = re.sub(r'(\s|„|“)([a-žA-Ž]+)-([a-záéíýúäôľščťžňďěřů])', r'\1\2\3', inside)

PS: I am not able to build calibre from the source code :( (bad dependency with poppler configured with --enable-xpdf-headers). If I will be able to do that I can simple rewrite preprocess.py and post a patch.