"pdftohtml tends to break non-ascii characters into parts" - yes you are right - that's the problem. But what's about utf8 (= "non-asci")characters in multilingual documents?
"I need to know what the character is broken into" - I can say it simple - every utf8 character which is not inside the source code (preprocess.py - PDFTOHTML list) makes problem.
I have made simple postprocess application in the python which unwraps problematic parts of *.fb2 (Fiction Book) czech&slovak documents. Maybe this part of it will help you understand what is wrong:
inside = re.sub(r'([áéíýúäôľščťžňďěřůÁÉÍÝÚĽŠČŤŽŇĎĚŘŮ])</p>\n\n<p>', r'\1 ', inside) #"non-asci" CE characters
inside = re.sub(r'(,“)</p>\n\n<p>([a-záéíýúäôľščťžňďěřů„“])', r'\1 \2', inside) #quotation mark after comma
inside = re.sub(r'(“)</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #quotation mark in direct speech
inside = re.sub(r'([A-Z])</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #upper letter
PS: I am not able to build calibre from the source code :( (bad dependency with poppler configured with --enable-xpdf-headers). If I will be able to do that I can simple rewrite preprocess.py and post a patch.
"pdftohtml tends to break non-ascii characters into parts" - yes you are right - that's the problem. But what's about utf8 (= "non-asci" )characters in multilingual documents?
"I need to know what the character is broken into" - I can say it simple - every utf8 character which is not inside the source code (preprocess.py - PDFTOHTML list) makes problem.
I have made simple postprocess application in the python which unwraps problematic parts of *.fb2 (Fiction Book) czech&slovak documents. Maybe this part of it will help you understand what is wrong:
inside = re.sub( r'([áéíýúäôľščť žňďěřůÁÉÍÝÚĽŠČŤ ŽŇĎĚŘŮ] )</p>\n\ n<p>', r'\1 ', inside) #"non-asci" CE characters r'(,“)< /p>\n\n< p>([a-záéíýúäôľ ščťžňďěřů„ “])', r'\1 \2', inside) #quotation mark after comma r'(“)</ p>\n\n< p>([a-záéíýúäôľ ščťžňďěřů] )', r'\1 \2', inside) #quotation mark in direct speech r'([A-Z] )</p>\n\ n<p>([a- záéíýúäôľščťžňď ěřů])', r'\1 \2', inside) #upper letter
inside = re.sub(
inside = re.sub(
inside = re.sub(
if (remove_hyphen == True): r'([\sa- žA-Ž])- </p>\n\ n<p>([a- záéíýúäôľščťžňď ěřů])', r'\1\2', inside) r'(\s|„ |“)([a- žA-Ž]+) -([a-záéíýúäôľš čťžňďěřů] )', r'\1\2\3', inside)
inside = re.sub(
inside = re.sub(
PS: I am not able to build calibre from the source code :( (bad dependency with poppler configured with --enable- xpdf-headers) . If I will be able to do that I can simple rewrite preprocess.py and post a patch.