Error in reflowing of lines from PDFs

Bug #2089436 reported by Seb Bacon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Alan Pettigrew

Bug Description

Given the attached test PDF, when running with the pdftotext engine:

    ebook-convert example.pdf output.epub --pdf-engine=pdftotext

I get the following, as expected:

    Patterned chiffon," she sang it off. "You must love to craft how to
    galvanise the right smell.”

But when I use the calibre engine:

    ebook-convert example.pdf output.epub --pdf-engine=calibre

I get:

    right Patternedchiffon," shesang it off. "You mustlove to craft howto
galvanise the smell.”

Revision history for this message
Seb Bacon (seb-bacon) wrote :
Revision history for this message
Kovid Goyal (kovid) wrote :

Changing the component for this bug.

 assignee
 status

Changed in calibre:
assignee: nobody → Alan Pettigrew (linux-k)
status: New → Triaged
Revision history for this message
Alan Pettigrew (linux-k) wrote :

That is a tricky one to decode.
I see:
right Patterned chiffon,” she sang it off. “You must love to craft how to galvanise the smell.”
i.e. the spaces between words are there.

We have:
<text top="822" left="615" width="71" height="25" font="0">galvanise</text>
<text top="827" left="686" width="5" height="16" font="1"> </text>
<text top="822" left="690" width="24" height="25" font="0">the</text>
<text top="843" left="108" width="37" height="25" font="0">right</text>
<text top="848" left="144" width="5" height="16" font="1"> </text>
<text top="843" left="149" width="53" height="25" font="0">smell.”</text>

So, top=822+height=25 gives the next line starting at 847.
And top=827+height=16 gives the next line starting at 843. It is part of the first line.
And top=843 is less than 847 so could be part of the first line (subscript).

It appears that the 2nd line overlaps the first, so is this 2 separate lines, or 1 oddly positioned one?

I don't know whether extending the test for 'where is the next top' to fix this will cause problems in the more general case. I will do some testing.

Isn't PDF formatting wonderful!

Revision history for this message
Charles Haley (cbhaley) wrote :

The superscripted spaces are "inside" their containing line. Their "top" is greater than the top of the nearby text and their "bottom" (top + height) is less than the nearby text. If a text line is defined by two borders, top and top + max(width), then it seems that any text that is inside those borders is part of that text, offset vertically.

I admit that I have no idea what else might fall between those borders.

And why are they using superscripted spaces in the first place?

Revision history for this message
Kovid Goyal (kovid) wrote :

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

Changed in calibre:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.