calibre

Bug #822744
Comment #6

Comment 6 for bug 822744

Revision history for this message

John Schember (user-none) wrote on 2011-08-24:

Look at calibre.ebooks.conversion.preprocess.HTMLPreProcessor.PDFTOHTML You will see a list of characters that are broken in to components. For example: ä becomes ¨a. For the incomplete list in PDFTOHTML you need specify either if there are missing characters in on of the character blocks for if a character type is not represented at all. If it is not represented I need to know how pdftohtml is breaking the character. The list of characters here is only for ones that pdftohtml improperly reads and outputs in an incorrect manner.

In calibre.ebooks.conversion.utils.HeuristicProcessor.punctuation_unwrap works on line length and punctuation. In this case adding the additional characters to the lookahead variable should work. I'll add the characters (ôľščťžňďěřů) which are referenced by your example and missing from the expression. That should fix this part of the issue.

calibre.ebooks.txt.unsmarten is specific to converting certain unicode characters to their representation in Textile format. Any changes here would also need to be reflected in calibre.ebooks.textile.functions.Textile. You would need to see what these characters would convert to in Textile.