Unwrap problem - some central european chars missing

Bug #822744 reported by helour
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
John Schember

Bug Description

I have found "automatic" and heuristic unwrap problem in the conversion of central european (CE) documents. There are some CE chars missing. Please look at the second chars table here: http://biega.com/special-char.html

Many thanks.

Related branches

Kovid Goyal (kovid)
Changed in calibre:
assignee: nobody → John Schember (user-none)
status: New → Triaged
Revision history for this message
helour (helourx) wrote :

Incomlete chars table (where some CE chars missing) I have found in the files: preprocess.py (PDFTOHTML list), unsmarten.py, utils.py.
Maybe only the first one is critical for automatic lines unwrapping of the pdf documents.

Revision history for this message
John Schember (user-none) wrote :

The table doesn't provide much help. I need examples of the output and the correct character. pdftohtml tends to break non-ascii characters into parts. I need to know what the character is broken into.

Revision history for this message
helour (helourx) wrote :

"pdftohtml tends to break non-ascii characters into parts" - yes you are right - that's the problem. But what's about utf8 (= "non-asci")characters in multilingual documents?

"I need to know what the character is broken into" - I can say it simple - every utf8 character which is not inside the source code (preprocess.py - PDFTOHTML list) makes problem.

I have made simple postprocess application in the python which unwraps problematic parts of *.fb2 (Fiction Book) czech&slovak documents. Maybe this part of it will help you understand what is wrong:

inside = re.sub(r'([áéíýúäôľščťžňďěřůÁÉÍÝÚĽŠČŤŽŇĎĚŘŮ])</p>\n\n<p>', r'\1 ', inside) #"non-asci" CE characters
inside = re.sub(r'(,“)</p>\n\n<p>([a-záéíýúäôľščťžňďěřů„“])', r'\1 \2', inside) #quotation mark after comma
inside = re.sub(r'(“)</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #quotation mark in direct speech
inside = re.sub(r'([A-Z])</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #upper letter

if (remove_hyphen == True):
  inside = re.sub(r'([\sa-žA-Ž])-</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1\2', inside)
  inside = re.sub(r'(\s|„|“)([a-žA-Ž]+)-([a-záéíýúäôľščťžňďěřů])', r'\1\2\3', inside)

PS: I am not able to build calibre from the source code :( (bad dependency with poppler configured with --enable-xpdf-headers). If I will be able to do that I can simple rewrite preprocess.py and post a patch.

Revision history for this message
Kovid Goyal (kovid) wrote :

You do not need to build calibre from source code, see instructions at http://manual.calibre-ebook.com/develop.html

Revision history for this message
helour (helourx) wrote :

Many thanks. I will try :)

Revision history for this message
John Schember (user-none) wrote :

Look at calibre.ebooks.conversion.preprocess.HTMLPreProcessor.PDFTOHTML You will see a list of characters that are broken in to components. For example: ä becomes ¨a. For the incomplete list in PDFTOHTML you need specify either if there are missing characters in on of the character blocks for if a character type is not represented at all. If it is not represented I need to know how pdftohtml is breaking the character. The list of characters here is only for ones that pdftohtml improperly reads and outputs in an incorrect manner.

In calibre.ebooks.conversion.utils.HeuristicProcessor.punctuation_unwrap works on line length and punctuation. In this case adding the additional characters to the lookahead variable should work. I'll add the characters (ôľščťžňďěřů) which are referenced by your example and missing from the expression. That should fix this part of the issue.

calibre.ebooks.txt.unsmarten is specific to converting certain unicode characters to their representation in Textile format. Any changes here would also need to be reflected in calibre.ebooks.textile.functions.Textile. You would need to see what these characters would convert to in Textile.

Revision history for this message
helour (helourx) wrote :

Thanks,

- the first problem (CE character on the end of line):
re.sub(r'([ľščťžňďěřů])</p>\n\n<p>', r'\1 ', inside)
I have solved in the calibre source codes. You don't need to add missing characters. I will send you may patch.

- second problem (CE character before hyphen):
re.sub(r'([áéíýúäôľščťžňďěřů])-</p>\n\n<p>([a-záéíýúäôľščťžňďěřů])', r'\1\2', inside)
still persist, I will have to better look at sources.

Please look at attached files.

Revision history for this message
helour (helourx) wrote :

I have made some patches. Please take a look at its.

Revision history for this message
John Schember (user-none) wrote :

Patches look fine. Merging them. Next time you submit a patch please use either the src or the root directory as the base path. This makes it easier to integrate the patches.

Changed in calibre:
status: Triaged → Fix Committed
Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in lp:calibre

Fixed in branch lp:calibre. The fix will be in the next release. calibre is usually released every Friday.

 status fixreleased

Changed in calibre:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.