calibre

Unwrap problem - some central european chars missing

Bug #822744 reported by helour on 2011-08-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	calibre	Fix Released	Undecided	John Schember

Bug Description

I have found "automatic" and heuristic unwrap problem in the conversion of central european (CE) documents. There are some CE chars missing. Please look at the second chars table here: http://biega.com/special-char.html

Many thanks.

Related branches

lp:calibre

Kovid Goyal (kovid) on 2011-08-08

Changed in calibre:
assignee:	nobody → John Schember (user-none)
status:	New → Triaged

Revision history for this message

helour (helourx) wrote on 2011-08-09:

Incomlete chars table (where some CE chars missing) I have found in the files: preprocess.py (PDFTOHTML list), unsmarten.py, utils.py.
Maybe only the first one is critical for automatic lines unwrapping of the pdf documents.

Revision history for this message

John Schember (user-none) wrote on 2011-08-23:

The table doesn't provide much help. I need examples of the output and the correct character. pdftohtml tends to break non-ascii characters into parts. I need to know what the character is broken into.

Revision history for this message

helour (helourx) wrote on 2011-08-24:

"pdftohtml tends to break non-ascii characters into parts" - yes you are right - that's the problem. But what's about utf8 (= "non-asci")characters in multilingual documents?

"I need to know what the character is broken into" - I can say it simple - every utf8 character which is not inside the source code (preprocess.py - PDFTOHTML list) makes problem.

I have made simple postprocess application in the python which unwraps problematic parts of *.fb2 (Fiction Book) czech&slovak documents. Maybe this part of it will help you understand what is wrong:

inside = re.sub(r'([áéíýúäôľščťžňďěřůÁÉÍÝÚĽŠČŤŽŇĎĚŘŮ])\n\n', r'\1 ', inside) #"non-asci" CE characters
inside = re.sub(r'(,“)\n\n([a-záéíýúäôľščťžňďěřů„“])', r'\1 \2', inside) #quotation mark after comma
inside = re.sub(r'(“)\n\n([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #quotation mark in direct speech
inside = re.sub(r'([A-Z])\n\n([a-záéíýúäôľščťžňďěřů])', r'\1 \2', inside) #upper letter

if (remove_hyphen == True):
inside = re.sub(r'([\sa-žA-Ž])-\n\n([a-záéíýúäôľščťžňďěřů])', r'\1\2', inside)
inside = re.sub(r'(\s|„|“)([a-žA-Ž]+)-([a-záéíýúäôľščťžňďěřů])', r'\1\2\3', inside)

PS: I am not able to build calibre from the source code :( (bad dependency with poppler configured with --enable-xpdf-headers). If I will be able to do that I can simple rewrite preprocess.py and post a patch.

Revision history for this message

Kovid Goyal (kovid) wrote on 2011-08-24:

You do not need to build calibre from source code, see instructions at http://manual.calibre-ebook.com/develop.html

Revision history for this message

helour (helourx) wrote on 2011-08-24:

Many thanks. I will try :)

Revision history for this message

John Schember (user-none) wrote on 2011-08-24:

Look at calibre.ebooks.conversion.preprocess.HTMLPreProcessor.PDFTOHTML You will see a list of characters that are broken in to components. For example: ä becomes ¨a. For the incomplete list in PDFTOHTML you need specify either if there are missing characters in on of the character blocks for if a character type is not represented at all. If it is not represented I need to know how pdftohtml is breaking the character. The list of characters here is only for ones that pdftohtml improperly reads and outputs in an incorrect manner.

In calibre.ebooks.conversion.utils.HeuristicProcessor.punctuation_unwrap works on line length and punctuation. In this case adding the additional characters to the lookahead variable should work. I'll add the characters (ôľščťžňďěřů) which are referenced by your example and missing from the expression. That should fix this part of the issue.

calibre.ebooks.txt.unsmarten is specific to converting certain unicode characters to their representation in Textile format. Any changes here would also need to be reflected in calibre.ebooks.textile.functions.Textile. You would need to see what these characters would convert to in Textile.

Revision history for this message

helour (helourx) wrote on 2011-08-25:

Test files - unwrap problem Edit (9.0 KiB, application/zip)

Thanks,

- the first problem (CE character on the end of line):
re.sub(r'([ľščťžňďěřů])\n\n', r'\1 ', inside)
I have solved in the calibre source codes. You don't need to add missing characters. I will send you may patch.

- second problem (CE character before hyphen):
re.sub(r'([áéíýúäôľščťžňďěřů])-\n\n([a-záéíýúäôľščťžňďěřů])', r'\1\2', inside)
still persist, I will have to better look at sources.

Please look at attached files.

Revision history for this message

helour (helourx) wrote on 2011-08-25:

calibre-0.8.15-patches.zip Edit (5.4 KiB, application/zip)

I have made some patches. Please take a look at its.

Revision history for this message

John Schember (user-none) wrote on 2011-08-25:

Patches look fine. Merging them. Next time you submit a patch please use either the src or the root directory as the base path. This makes it easier to integrate the patches.

Changed in calibre:
status:	Triaged → Fix Committed

Revision history for this message

Kovid Goyal (kovid) wrote on 2011-08-25: Fixed in lp:calibre

#10

Fixed in branch lp:calibre. The fix will be in the next release. calibre is usually released every Friday.

status fixreleased

Changed in calibre:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.