Comment 3 for bug 822744

Revision history for this message
helour (helourx) wrote :

"pdftohtml tends to break non-ascii characters into parts" - yes you are right - that's the problem. But what's about utf8 (= "non-asci")characters in multilingual documents?

"I need to know what the character is broken into" - I can say it simple - every utf8 character which is not inside the source code (preprocess.py - PDFTOHTML list) makes problem.

I have made simple postprocess application in the python which unwraps problematic parts of *.fb2 (Fiction Book) czech&slovak documents. Maybe this part of it will help you understand what is wrong:

inside = re.sub(r'([áéíýúäôľščťžňďěřůÁÉÍÝÚĽŠČŤŽŇĎĚŘŮ])</p>\n\n<p>', r'\1 ', inside) #"non-asci" CE characters
inside = re.sub(r'(,“)</p>\n\n<p>([a-záéíýúäôľščťžňďěřů„“])', r'\1 \2', inside) #quotation mark after comma
inside = re.sub(r'(“)</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #quotation mark in direct speech
inside = re.sub(r'([A-Z])</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #upper letter

if (remove_hyphen == True):
  inside = re.sub(r'([\sa-žA-Ž])-</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1\2', inside)
  inside = re.sub(r'(\s|„|“)([a-žA-Ž]+)-([a-záéíýúäôľščťžňďěřů])', r'\1\2\3', inside)

PS: I am not able to build calibre from the source code :( (bad dependency with poppler configured with --enable-xpdf-headers). If I will be able to do that I can simple rewrite preprocess.py and post a patch.