Evince doesn't handle columns properly

Bug #33288 reported by Pascal de Bruijn
232
This bug affects 29 people
Affects Status Importance Assigned to Milestone
Poppler
Unknown
Medium
poppler (Ubuntu)
Fix Released
Medium
Unassigned
Lucid
Fix Released
Medium
Unassigned

Bug Description

So, now that RC is here, let's propose it as an SRU.
I've pushed it in lucid-proposed. The debdiff is poppler_0.12.4-0ubuntu4_2_0.12.4-0ubuntu5.debdiff attached there for information. I'm removing old debdiff to avoid confusion.

poppler (0.12.4-0ubuntu5) lucid-proposed; urgency=low

  * debian/patches/11_column_selection.patch:
    - backport from upstream git commit to fix wrong selection in pdf when
      containing tables, long text, broken flow and so on.
      (fixing most of known issues with selection in pdf) (LP: #33288)

----------------------------
----------------------------

When making a multi column selection from a PDF like this:

http://www.specialist-games.com/mordheim/assets/lrb/1Rules.pdf

And pasting the result into OpenOffice.org the columns are not maintained. The results unusable because the text from both columns becomes mixed.

Please note, this is not a PDF problem, using Adobe Acrobat Reader 7.x under Windows does properly copy-paste columned text over to OpenOffice.org.

Regards,
Pascal de Bruijn

Revision history for this message
In , Steve Fox (drfickle2) wrote :
Revision history for this message
Pascal de Bruijn (pmjdebruijn) wrote :

When making a multi column selection from a PDF like this:

http://www.specialist-games.com/mordheim/assets/lrb/1Rules.pdf

And pasting the result into OpenOffice.org the columns are not maintained. The results unusable because the text from both columns becomes mixed.

Please note, this is not a PDF problem, using Adobe Acrobat Reader 7.x under Windows does properly copy-paste columned text over to OpenOffice.org.

Regards,
Pascal de Bruijn

Revision history for this message
Sitsofe Wheeler (sitsofe) wrote : Openoffice document with multicolumn PDF pastes from evince and acrobat

I can confirm this.

I have attached an Openoffice document demonstrating the difference in paste techniques between evince and Acrobat 7.x on Ubuntu (as installed from multiverse).

As Pascal said, evince's output is not as neat as or coherent as Acrobat 7.x's.

Revision history for this message
Sebastien Bacher (seb128) wrote :
Changed in evince:
assignee: nobody → desktop-bugs
Changed in poppler:
status: Unconfirmed → Confirmed
Revision history for this message
Pascal de Bruijn (pmjdebruijn) wrote :

It seems 'pdftotext' can properly handle columns.

Revision history for this message
In , Daniel Holbach (dholbach) wrote :
Changed in poppler:
status: Unconfirmed → Confirmed
Changed in poppler:
status: Confirmed → Triaged
Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

I took a look at this today because its pretty old and the downstream evince bug has a lot of dupes:
http://bugzilla.gnome.org/show_bug.cgi?id=325189

Some of the comments in the evince bug about mouseover modes/images are irrelevant, since the selection is just using poppler_page_get_text anyway.

After reading the code, I am wondering if the bug is caused by TextPage::visitSelection iterating over blocks directly, rather than using TextFlow order?
http://cgit.freedesktop.org/poppler/poppler/tree/poppler/TextOutputDev.cc?id=e3e4113c73128f49f99289b592446d4382b5d65c#n3898

So the fix might be fairly simple, just to iterate over blocks in flow order in TextPage::visitSelection, and only use the selection rect when visiting the first and last selected block - for intermediate blocks, the whole block should be selected so PDFRectangle passed in should be the block's own bbox.

I might have a go at a patch along these lines, but if its the wrong approach let me know before I go too far.

Revision history for this message
Waldir Leoncio (wleoncio) wrote :

I can also confirm this. It's very annoying having to reopen a document on Adobe Acrobat for Linux whenever I realize it has multiple columns.

Revision history for this message
mercutio22 (macabro22) wrote :

This sux. Also, It'd be cool to open PDFs inside Firefox just like when using Adobe's program.

Revision history for this message
Anton Goloborodko (antalar) wrote :

I confirm it.
Evince doesn't work properly with two-column documents. If I select some text in one column and there is a line break in another one, text is not selected properly.

See attachment.

Revision history for this message
In , David S. (d-sylva) wrote :

Brian Ewins: did you ever create/submit the patch you proposed? Though I know nothing about Poppler, and thus can't comment on whether it's the "right" way to fix the bug, any solution would be better than none. This major bug has gone unfixed for way too long.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=23584)
UNTESTED fix for the bug.

David: I hadn't and now I feel guilty :) Not at a machine I can build poppler on right now but I've typed up the change I intended; ymmv, may not even compile. Hope this nudges things along a bit.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=23585)
fix typo in patch

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=23691)
a patch that actually compiles

While this is still untested (jhbuild of evince is failing for me at the moment), at least the patched poppler compiles with no new warnings. I'll get this tested asap but posting this so no-one wastes time on the previous patch.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

I managed to test this a bit today, and for what I was able to test, can confirm that the patch works.

With the latest GnomeDeveloperKit iso, jhbuild of the Gnome 2.26 moduleset, I couldn't get evince to open pdfs - dbus issue I think - so I moved on to trying epdfview 0.1.7 with the jhbuilt poppler. That epdfview was able to select but not copy, (looked like an epdfview bug) but selection in multicolumn docs exhibits the bug described here and in bug 4006, with the selection expanding in odd ways. So I was able to go on and try the patch.

I patched the copy of poppler that had been built (0.10.2); the patch above doesn't apply to that older tree but its just a matter of replacing TextPage::visitSelection with the code in the patch I posted. Rebuilt, ran epdfview, hey presto - the selection predictably follows the column text, which appears to fix this and bug 4006. I'll leave a comment over there too, in case anyone downstream wants to take a look and test it better.

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

Hi Brian, thank you very much for the patch, this has been a very important issue for a long time now, with lots of duplicates.

The patch works great with evince, the selection follows columns perfectly. However, the selected text is not correcly pasted. See this screenshot:

http://people.freedesktop.org/~carlosgc/ev-selection-columns.png

The text on the second column is pasted before.

I'm looking forward to seeing this finally fixed. Thanks!.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Thanks Carlos, looks like this just fixes bug 4006 so far then. I'll dig into that further later today, seems like we're close.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Ok the problem appears to be that TextSelectionDumper::getText does a top-down qsort on the fragments it was passed, when they were in reading order already. The qsort here is just a relic from before the 'flow' code centralized reading-order decisions, and getText should more closely resemble the section commented 'output the page "undoing" the layout' in TextPage::dump, complicated a little by the visitor infrastructure. This looks like I can wrap it up in a unit test.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=24739)
prevent re-sorting of selection fragments

I managed to test this one with evince. Carlos, your screenshot in comment #9 shows what I presume was the intended old behaviour: its actually pasting the text with its layout intact; I'd agree this isn't what I'd want as a user. With this second patch applied, the selected parts of each column are presented in reading order instead of side-by-side.

I think the output has excess whitespace - the columns should be directly below each other instead of leaving space for previous columns. Also it might also be better if TextPage::dump effectively selected the whole page and dumped its text by visiting the selection, to avoid duplication of this logic. However, I went for the simple fix first.

This seems to work except for those cases where poppler's reading-order heuristics break down, eg, on paragraphs that flow around a diagram that contains text labels (the labels are viewed as part of the para because the fall within the flow bbox, I think), pasting tables (which unlike text columns, are best kept tabular), and most likely on rtl text.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=24748)
fix typo

Oops. Replacing that last patch.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=24749)
a more agressive fix, paste unformatted text

In this version of the patch, I remove the formatting entirely from text selected in reading order, except for line breaks. This is closer to what is done by Acrobat: there there are several selection modes. 'Text' selection mode produces unformatted text in reading order; 'Table' selection mode corresponds to TextPage::getText(x1,y1,x2,y2) - formatted text from within a rectangular selection. The poppler code previously confused the two.

I've got a couple of cleanup patches to follow on from this, to reduce duplication.

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

Hi Brian,

I've noticed other problem with your patch. See this screenshot:

http://people.freedesktop.org/~carlosgc/poppler-selection-problem.png

It's always reproducible with the hig document (http://developer.gnome.org/projects/gup/hig/2.0/hig-2.0.pdf) and evince.

Thanks!

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=24922)
replacement patch

Thanks for checking that one Carlos. It seems the logic I'd used to find the end of the selection was working in part by chance. Here's a different approach, which should be more deterministic.

As before, we find the first block in reading order whose bottom right is below and right of either of the two corner points of the selection; start selecting text here. The other corner becomes the 'stop point'.

Expand the selection from there until we find a block containing the stop point. If there is no such block, the selection ends before the first block that is below the stop point (the 'low_block' in the code).

This makes a much better job of the gnome hig doc, and works fine on the multicolumn docs I've tried (at least, when poppler determines reading order correctly).

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=24928)
sigh - fixed again

I'd introduced a small bug in the last one tidying up the patch. There should never be more than two partially selected blocks (the first and last).

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

Brian, it seems there are still problems with the bullets. See:

http://people.freedesktop.org/~carlosgc/poppler-selection-problem2.png

This document is available here:

http://research.sun.com/techrep/1994/smli_tr-94-29.pdf

And even with the hig there is still a weird behaviour. See:

http://people.freedesktop.org/~carlosgc/poppler-selection-problem3.png

Note that only the first two bullets symbols are selected.

Thanks!

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=24938)
further improvements

With the previous version of the patch, when the stop point lay outside any block, the choice of stop block was sometimes unnatural - this is very noticeable when doing a multicolumn select and moving the mouse into whitespace between paragraphs in the rightmost column.

In this latest version of the patch, I've improved this quite a bit, so the selection expands much more smoothly. The 'natural' block to select is found like this:
1. if there are blocks directly above the stop point, use the lowest one that is still above the stop point
2. otherwise, find the nearest column, and choose the lowest block in that column that is still above the stop point.

Step 1 selects the blocks you select when dragging down a column. Step 2 guesses which block you meant when you drag outside a column. Occasionally step 2 goes wrong: drag past the end of a narrow paragraph, and it will no longer be selected. However you can always get the desired effect, of selecting this whole para, by moving the mouse under the para. Its not ideal but it is reasonably discoverable, and could be fixed by better column identification. Its way better than the old behaviour anyway.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

We've overlapped there Carlos, but the latest version doesn't fix those two issues.

They are both down to problems with the column identification in poppler (a bit outside the scope of this bug, but there's more robust algorithms in ocropus).

In the hig, poppler does not correctly identify the bullets as being part of their paragraphs, and then does not think that all the bullets form a single column; hence what you see. (this is a bit clearer when you look at the text you get from a paste.)

In the sun report, if you drag down in the first column, you'll see that the second column gets highlighted as soon as you hit those bulleted items: this is because poppler thinks that the bulleted items are after the right hand column in reading order!

I do think there's a problem with my code here though with the way the selection behaves when the start or end point are in those bullet pointed items. They should definitely always select when the mouse moves over them, but it seems to depend on the starting point of the selection. I'll take a look at that.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=24983)
another stab at it

Here's a new version. Whats different this time is how much better the selection follows the mouse. For example, if you click in the margin then drag into the text, the selection is just the lines you have moused across; current evince grabs either the whole beginning or whole end of the paragraph.

I've got rid of all of the 'flashing' I could see - where the selection suddenly expands then contracts again as you move - except for those instances where its caused by mistakes in the reading order; the hig document above is a good example of that. In order to do this I had to fix up TextBlock::visitSelection as well, since it often didn't find the end line and would suddenly cause a whole para to be selected (and then deselected again a few pixels further)

I don't think this can be pushed much further without fixing up the reading order issues. But that's a pile of work ( the algorithm I'm considering is here: http://pubs.iupr.org/cache/2002-breuel-das.pdf.dir/m-p-00001.html ), although maybe the current code could just be tweaked to deal better with these bullet lists. I'd rather not sink time into that if there's no chance of this getting in.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=25026)
simpler patch

This behaves much the same as the previous patch, but the algorithm is simpler and there's a lot less code, so it should be easier to review.

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

(In reply to comment #22)
> Created an attachment (id=25026) [details]
> simpler patch
>
> This behaves much the same as the previous patch, but the algorithm is simpler
> and there's a lot less code, so it should be easier to review.
>

Does this patch depend on any of the previous patches? it applies cleanly on master, but it doesn't build:

TextOutputDev.cc: In member function 'void TextPage::visitSelection(TextSelectionVisitor*, PDFRectangle*, SelectionStyle)':
TextOutputDev.cc:3980: error: 'begin' was not declared in this scope
TextOutputDev.cc:3980: error: 'end' was not declared in this scope
TextOutputDev.cc: In function 'void TextOutputDev_outputToFile(void*, char*, int)':

I'm sorry, but I don't have time to debug it now.

Thanks!

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Sorry about that Carlos, must've messed up again transferring the patch from the VM. Yes, it was supposed to apply on its own and compile. This line:
+ visitor->visitBlock (this, begin, end, selection);
Should be:
+ visitor->visitBlock (this, best_block[start], best_block[stop], selection);

But when I'm back tomorrow I'll respin the patch and make sure its the right one. Still thinking about how to improve the reading order.

Revision history for this message
Amr Bekhit (amrbekhit) wrote :

I have a similar problem. When trying to select a paragraph of text in a multicolumn document, evince selects text from other columns too and so the whole lot is copied. I am using Ubuntu 9.04 with Evince Document Viewer 2.26.0.

You can try the bug using the attached PDF. Try selecting any text in a column and the other columns will be selected. I know this is not a problem with the PDF as text selection works fine using Foxit Reader for Linux.

All the PDF documents I read are papers which are written using multiple columns, so this is a very frustrating bug for me!

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

Hey Brian, any progress on this? it would be really nice to have this finally fixed for poppler 0.12

Thanks!

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Sorry - have been struggling to find free time to tackle this, though I should really have posted the corrected patch by now. I'll try to push things on in the next couple of days. I think this might need some discussion on-list before committing, since I won't have a fix for reading order errors by the June 8 code freeze - is this worth taking without further improvements there?

Revision history for this message
Andres Monroy-Hernandez (andresmh) wrote :

I'm also experiencing this problem. Looking forward to having a solution for this problem that has been around for a couple of years now!

Revision history for this message
Mark Emerick (ciremem) wrote :

Yes. It is very difficult to untangle the desired text from the 'selected' text. This is a basic issue that I'm sure all would like resolved. Thanks.

Revision history for this message
In , Nicolò Chieffo (yelo3) wrote :

Brian, have you got any news on this patch?
It would be a real improvement!

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

Hi Brian,

I've been looking at your patch again and even after applying the change you mentioned in comment #24 it still doesn't build. Could you attach the last version of the patch that worked for you, please? If you are busy, we'll try to finish your great work, but we need a working patch first.

Thanks!

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=30821)
patch against 0.10.2

Sorry for being away for so long. This patch is exactly what I was using previously, hopefully without any errors this time; built and works for me on the gnome dev kit based on poppler 0.10.2, but this time diff generated directly against git rather than transferring back to the host vm first, so there's less chance this is messed up. A patch that applies to current poppler follows.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created an attachment (id=30822)
patch against 0.10.2

Agh...the previous patch was the one rebased against master: poppler-0.12.0-46-g7670cc4. This one is the old pristine patch in case it is needed.

This does copy & paste fine for me, but as I've previously mentioned 2 other fixes are needed: the code needs to take account of the primary writing direction on the page, shouldn't be too hard as this is already recorded; and the column selection needs improved, much harder.

description: updated
tags: added: patch
Nigel Babu (nigelbabu)
tags: added: patch-needswork
removed: patch
Nigel Babu (nigelbabu)
tags: added: patch-accepted-upstream
removed: patch-needswork
Changed in poppler (Ubuntu):
assignee: Ubuntu Desktop Bugs (desktop-bugs) → nobody
status: Triaged → Fix Committed
Changed in poppler (Ubuntu):
assignee: nobody → Didier Roche (didrocks)
status: Fix Committed → In Progress
Changed in poppler (Ubuntu):
assignee: Didier Roche (didrocks) → nobody
status: In Progress → Triaged
description: updated
Steve Kowalik (stevenk)
tags: added: verification-needed
Changed in poppler (Ubuntu Lucid):
status: Triaged → Fix Committed
76 comments hidden view all 156 comments
Revision history for this message
Martin Erik Werner (arand) wrote :

Do note that this bug and fix only concerns column "selection flow handling" e.g. the common issue that selecting one column will select the other simultaneously.

The (also very common) issue of selected text disappearing and displaying random characters is a separate issue NOT addressed by this patch.

The correct one for that issue is Bug #39321 which IS correctly marked as not fixed.

Revision history for this message
Mark Emerick (ciremem) wrote : Re: [Bug 33288] Re: Evince doesn't handle columns properly

This document does have the column selection issue. Each row of four numbers
in A1, ..., A5 behaves as a table that includes some of the text above it as
well as the single-character line below.

There is an interesting behavior in slowly selecting one character at a time
from left to right along the row of numbers. When you reach the space
between each entry (say between "1) 30" and "2) 60" in A1), the selection
skips to the line above, then skips back down when the next number is
scanned.

On Fri, May 7, 2010 at 7:54 AM, arand <email address hidden> wrote:

> Do note that this bug and fix only concerns column "selection flow
> handling" e.g. the common issue that selecting one column will select
> the other simultaneously.
>
> The (also very common) issue of selected text disappearing and
> displaying random characters is a separate issue NOT addressed by this
> patch.
>
> The correct one for that issue is Bug #39321 which IS correctly marked
> as not fixed.
>
> --
> Evince doesn't handle columns properly
> https://bugs.launchpad.net/bugs/33288
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Poppler: Confirmed
> Status in “poppler” package in Ubuntu: Triaged
> Status in “poppler” source package in Lucid: Fix Committed
>
> Bug description:
> So, now that RC is here, let's propose it as an SRU.
> I've pushed it in lucid-proposed. The debdiff is
> poppler_0.12.4-0ubuntu4_2_0.12.4-0ubuntu5.debdiff attached there for
> information. I'm removing old debdiff to avoid confusion.
>
> poppler (0.12.4-0ubuntu5) lucid-proposed; urgency=low
>
> * debian/patches/11_column_selection.patch:
> - backport from upstream git commit to fix wrong selection in pdf when
> containing tables, long text, broken flow and so on.
> (fixing most of known issues with selection in pdf) (LP: #33288)
>
>
> ----------------------------
> ----------------------------
>
> When making a multi column selection from a PDF like this:
>
> http://www.specialist-games.com/mordheim/assets/lrb/1Rules.pdf
>
> And pasting the result into OpenOffice.org the columns are not maintained.
> The results unusable because the text from both columns becomes mixed.
>
> Please note, this is not a PDF problem, using Adobe Acrobat Reader 7.x
> under Windows does properly copy-paste columned text over to OpenOffice.org.
>
> Regards,
> Pascal de Bruijn
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/poppler/+bug/33288/+subscribe
>

Revision history for this message
Dmitry Shachnev (mitya57) wrote :

arand wrote:
The correct one for that issue is Bug #39321 which IS correctly marked as not fixed.

I commented there at first, but it is written in that bug description: "This bug should only be about the disappearing-text-on selection issue ("bug a" below), for the other selection issues see Bug #33288", so I commented here too :)

Revision history for this message
Dmitry Shachnev (mitya57) wrote :

Also it behaves interesting when I try to select the text in the first frame.

Revision history for this message
Dmitry Shachnev (mitya57) wrote :

Pascal de Bruijn wrote:
>I just tried that document too... But I noticed it was generated by Cairo... which is a graphical library, not a text processor...

If you are interested, here is the original file: http://window.edu.ru/window_catalog/redir?id=59183&file=%D0%98%D0%9D%D0%A4_%D0%B4%D0%B5%D0%BC%D0%BE_2009.pdf

Revision history for this message
Brian Ewins (brian-ewins) wrote :

Mark: what's going on there is that the bounding boxes of the (1) (2) (3) ... text regions are tightly around the text you can see, and when the mouse is between them, its not over any text region at all. Hence, the best guess of what you were trying to select is the nearest text region (by manhattan distance), which will be the one above or below the current line, because these columns are very short and widely spaced. When the mouse is close to what you intended to select, and not in the middle of blank space, this usually does the right thing. Acrobat makes the same guess.

Revision history for this message
Dmitry Shachnev (mitya57) wrote :

One more screenshot displaying page 15 of original file (http://window.edu.ru/window_catalog/redir?id=59183&file=%D0%98%D0%9D%D0%A4_%D0%B4%D0%B5%D0%BC%D0%BE_2009.pdf, page 15, C4).

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

Hi Marek, thank you very much for the patch, do you have a test case where your patch makes a difference to try with?

Revision history for this message
Mark Emerick (ciremem) wrote : Re: [Bug 33288] Re: Evince doesn't handle columns properly

Yeah. It's a bugger. That's reasonable behavior. I imagine it's
impracticable to include the history of the mouse trajectory in your the
guess of which line to select on. Anyway. This version is just an enormous
improvement, so thanks to you all.

On Fri, May 7, 2010 at 10:02 AM, Brian Ewins <email address hidden> wrote:

> Mark: what's going on there is that the bounding boxes of the (1) (2)
> (3) ... text regions are tightly around the text you can see, and when
> the mouse is between them, its not over any text region at all. Hence,
> the best guess of what you were trying to select is the nearest text
> region (by manhattan distance), which will be the one above or below the
> current line, because these columns are very short and widely spaced.
> When the mouse is close to what you intended to select, and not in the
> middle of blank space, this usually does the right thing. Acrobat makes
> the same guess.
>
> --
> Evince doesn't handle columns properly
> https://bugs.launchpad.net/bugs/33288
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Poppler: Confirmed
> Status in “poppler” package in Ubuntu: Triaged
> Status in “poppler” source package in Lucid: Fix Committed
>
> Bug description:
> So, now that RC is here, let's propose it as an SRU.
> I've pushed it in lucid-proposed. The debdiff is
> poppler_0.12.4-0ubuntu4_2_0.12.4-0ubuntu5.debdiff attached there for
> information. I'm removing old debdiff to avoid confusion.
>
> poppler (0.12.4-0ubuntu5) lucid-proposed; urgency=low
>
> * debian/patches/11_column_selection.patch:
> - backport from upstream git commit to fix wrong selection in pdf when
> containing tables, long text, broken flow and so on.
> (fixing most of known issues with selection in pdf) (LP: #33288)
>
>
> ----------------------------
> ----------------------------
>
> When making a multi column selection from a PDF like this:
>
> http://www.specialist-games.com/mordheim/assets/lrb/1Rules.pdf
>
> And pasting the result into OpenOffice.org the columns are not maintained.
> The results unusable because the text from both columns becomes mixed.
>
> Please note, this is not a PDF problem, using Adobe Acrobat Reader 7.x
> under Windows does properly copy-paste columned text over to OpenOffice.org.
>
> Regards,
> Pascal de Bruijn
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/poppler/+bug/33288/+subscribe
>

Revision history for this message
In , Marek Kašík (mkasik) wrote :

Created an attachment (id=35559)
reproducer for table cell overlapping

Hi Carlos,

this is a reproducer of the problem with overlapping cells. I can not post the original one, so I created this one. Select the Headline 1, Text 1, Headline 2 and Text 2. Paste it somewhere then and you'll see the problem.
Now I see that it is not problem of not checking for overlapping but wrong check of overlapping. I'll post an updated patch (which do the same, but sooner in the code). The actual code doesn't check for overlapping of lower left cell and upper right cell in x and y.

Regards

Marek

Revision history for this message
In , Marek Kašík (mkasik) wrote :

Created an attachment (id=35560)
a patch improving checking of overlapping of cells

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

(In reply to comment #55)
> Created an attachment (id=35560) [details]
> a patch improving checking of overlapping of cells

pushed to git master, thank you very much!

Revision history for this message
Ali Shtarbanov (ametedinov) wrote :

Here is another two column PDF which I am also having selection problems with. http://www3.lehigh.edu/academics/catalog/printable.asp (click the "Lehigh University Course catalog" link to open it)

I am using the default version of evince on Ubuntu 10.04 - version 2.30.0, so can someone post a link to the LATEST package with fixes. (A lot of links to patches and packages were posted above so I don't really know which one is the latest)

Revision history for this message
Pascal de Bruijn (pmjdebruijn) wrote :

Just enable Lucid Proposed (in your Updates Sources) in Synaptic.

Revision history for this message
Christophe Sauthier (christophe.sauthier) wrote :

The proposed package is acting correctly on my system.

Revision history for this message
tshirtman (gabriel-pettier) wrote :

Works for me too :). Seems there are still a few border cases, but way better than before :).

Martin Pitt (pitti)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package poppler - 0.12.4-0ubuntu5

---------------
poppler (0.12.4-0ubuntu5) lucid-proposed; urgency=low

  * debian/patches/11_column_selection.patch:
    - backport from upstream git commit to fix wrong selection in pdf when
      containing tables, long text, broken flow and so on.
      (fixing most of known issues with selection in pdf) (LP: #33288)
 -- Didier Roche <email address hidden> Mon, 19 Apr 2010 16:41:30 +0200

Changed in poppler (Ubuntu Lucid):
status: Fix Committed → Fix Released
Changed in poppler (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

I've just added a selections demo to poppler-glib-demo, I hope it makes easier to test selections.

Changed in poppler:
importance: Unknown → Medium
Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

(In reply to comment #42)
> Created an attachment (id=31406) [details]
> 3/5 reading order (bug fixed)

It seems this bug introduced an important performance regression, see the detailed analysis in poppler mailing list:

http://lists.freedesktop.org/archives/poppler/2010-October/006566.html

Revision history for this message
In , Julian Sikorski (belegdol) wrote :

Correctness trumps performance, right?

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

It should be possible to improve this substantially. When I wrote the patch I was being very conservative with the existing poppler data structures, so essentially that method is traversing an unordered list. If the block list was in isBeforeByRule1 order most of those comparisons would go away. I can't remember if this would break clients wanting access to the text in physical order-it's been a while since I looked at the code and I'm reading this on a phone. Can take a deeper look tomorrow.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created attachment 40056
patch to fix performance regression

Here's what I've got so far. On my very slow VM, this renders the paris bus map reported on the mailing list in ~15.2s, compared to ~60s without the patch. YX-sorting alone got the time down to ~16.2s. Rendering on other documents is as fast as ever.

Almost all of the time rendering the bus map is prior to the sort, so there must still be some quadratic algorithms in there unrelated to reading order. There is one obvious fix on my list that I didn't implement (track the first unvisited block, start loops there) but I don't think this will make much difference for the effort it requires.

I'll be offline until Monday 8 Nov, but I'd be grateful if some more eyes could look at this to make sure I haven't regressed anything.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created attachment 40061
improved patch

Had another look and tidied the code a bit removing repeated page orientation checks, and a redundant test for overlap in rule(2). This is noticeably faster rendering the bus map. (down to ~14.8s)

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Can you please attach a patch without unnecessary spacing changes like

- before = gTrue;
+ before = gTrue;

Thanks :-)

Revision history for this message
In , Dsheil (dsheil) wrote :

(In reply to comment #62)
> Created an attachment (id=40061) [details]
> improved patch
>
> Had another look and tidied the code a bit removing repeated page orientation
> checks, and a redundant test for overlap in rule(2). This is noticeably faster
> rendering the bus map. (down to ~14.8s)

I tested this in glib. It improved rendering for me significantly for the PDF's prone to be affected. I tested other PDFs as well and didn't notice anything.

I did a little bit of testing with test selection as well, but not as much, text selection seemed OK.

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

Created attachment 40124
patch without extraneous whitespace changes

Oops! Ok, here's the patch without the whitespace changes.

Revision history for this message
In , Dsheil (dsheil) wrote :

(In reply to comment #65)
> Created an attachment (id=40124) [details]
> patch without extraneous whitespace changes
>
> Oops! Ok, here's the patch without the whitespace changes.

Tested with the new whitespace patch - renders the map PDFs much faster in cairo, other PDFs aven't changed much.

Revision history for this message
In , Marek Kašík (mkasik) wrote :

There are also 2 nested "for" cycles in the block preceded by this comment:

  /* set extended bounding boxes of all other blocks
   * so that they extend in x without hitting neighbours
   */

I'm working on an optimization of it.

Marek

Revision history for this message
In , Marek Kašík (mkasik) wrote :

Created attachment 40308
optimization of search for tables

Hi,

this patch improves efficiency of searching for blocks which belong to a table. In the worst case it is still quadratic but it should be O(n*sqrt(n)) in average case.
It creates a list of all y coordinates and then sort it. After that it goes through this list and for each y coordinate which begins a block it starts a local while cycle. This while cycle searches for blocks which overlaps with the actual block in y. It finds closest blocks from the left and from the right during this search.
It does the same for the x coordinate and finds up and down adjacent blocks.
After that, it uses this information for computing of ExMin, .., EyMax and for determining whether the actual block is part of a table.
(You need to have the Brian's patch applied already.)

Regards

Marek

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Brian, your patch changes the output of pdftotext in a file, is this to be expected?

Changed in poppler:
importance: Medium → Unknown
Changed in poppler:
importance: Unknown → Medium
Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

While working on bug #71160 I've found another regression introduced by this fix. In some cases, additional lines are added to the selection. For example, open the hig document and go to the first page. Start selecting the second line, but dragging from the margin, and you will see that the first line is selected too. This is because the second line is more indented than the first one. This fix changed the way blocks and lines are included in the selection by using the manhattan distance, and in this case, the distance of the first line is less than the second line, but the first line doesn't even intersect with the selection rectangle. If you start the selection closer to the beginning of the second line, then the first line is not included because distance to the second line is less in such case.

You can play with it now using the text demo of poppler-glib-demo. I've added an area selector to get the text of a given area. Try using X1=0, Y1=122, which should discard the first line, but it doesn't. However using X1=257, Y1=122 discards the first line entirely.

So, I think that we need to somehow discard blocks and lines that don't intersect with the selection rectangle even if the manhattan distance is less than any other block/line.

Revision history for this message
Brian Ewins (brian-ewins) wrote :

Carlos: it isn't a regression that lines outside a rectangle formed by the start and endpoints are included, it's the intent.

Consider selecting in a document with two columns, starting in the 1st column 2/3 down the page, ending in the 2nd column 1/3 down the page. In this case, the correct selection consists entirely of lines that lie outside the rectangle formed by the start and endpoints (ie, the bottom 1/3 of the 1st column and the top 1/3 of the 2nd column).

The motivation for this patch was that text selection by rectangles is fundamentally wrong. The correct approach is to reconstruct the reading order of text; then from two points on the page, find the nearest insertion points (where an edit cursor would go); swap the insertion points if necessary; then return the characters between them. The difficulties lie in inferring the reading order, and determining what 'nearest insertion point' means.

Clicking inside a word, the nearest insertion point is obvious; it's the nearest character boundary. Click in a blank area, and it's less clear. In Breuel's algorithms that I used for determining reading order, there is something that helps here. There, line width is determined by expanding the line left and right to fit the column it contains. So the line 'box' contains the initial indent if it is the first line of a paragraph, or the trailing space in the last line; or the ragged space for left- or right- justified text.

Poppler doesn't have columns as such, but blocks instead, and as I recall the line boxes are the tight bounding box of the words contained in the line. So we can try to determine insertion point by looking for the nearest block (horizontally and vertically), then the nearest line (vertically ONLY, so that we ignore indents/ragged space), then nearest character (horizontally). I mean these to be three different comparisons, discarding blocks, line and character candidates at each stage, not some single distance you sum up. The upshot would be that clicking in blank areas of a line that lie within its block's bounding box - or even nearby - will choose that line, not the one above or below.

(It's been ages since I looked at the poppler code, I can't remember if this heuristic is what the patches do already)

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

(oops, replied on launchpad, not sure if Carlos reads there. Repeating for fdo)

Carlos: it isn't a regression that lines outside a rectangle formed by the start and endpoints are included, it's the intent.

Consider selecting in a document with two columns, starting in the 1st column 2/3 down the page, ending in the 2nd column 1/3 down the page. In this case, the correct selection consists entirely of lines that lie outside the rectangle formed by the start and endpoints (ie, the bottom 1/3 of the 1st column and the top 1/3 of the 2nd column).

The motivation for this patch was that text selection by rectangles is fundamentally wrong. The correct approach is to reconstruct the reading order of text; then from two points on the page, find the nearest insertion points (where an edit cursor would go); swap the insertion points if necessary; then return the characters between them. The difficulties lie in inferring the reading order, and determining what 'nearest insertion point' means.

Clicking inside a word, the nearest insertion point is obvious; it's the nearest character boundary. Click in a blank area, and it's less clear. In Breuel's algorithms that I used for determining reading order, there is something that helps here. There, line width is determined by expanding the line left and right to fit the column it contains. So the line 'box' contains the initial indent if it is the first line of a paragraph, or the trailing space in the last line; or the ragged space for left- or right- justified text.

Poppler doesn't have columns as such, but blocks instead, and as I recall the line boxes are the tight bounding box of the words contained in the line. So we can try to determine insertion point by looking for the nearest block (horizontally and vertically), then the nearest line (vertically ONLY, so that we ignore indents/ragged space), then nearest character (horizontally). I mean these to be three different comparisons, discarding blocks, line and character candidates at each stage, not some single distance you sum up. The upshot would be that clicking in blank areas of a line that lie within its block's bounding box - or even nearby - will choose that line, not the one above or below.

(It's been ages since I looked at the poppler code, I can't remember if this heuristic is what the patches do already)

Revision history for this message
In , Brian Ewins (brian-ewins) wrote :

(In reply to comment #70)

(I originally replied on launchpad, which is supposed to copy it through to here, but it hasn't.)

Carlos: it isn't a regression that lines outside a rectangle formed by the start and endpoints are included, it's the intent.

Consider selecting in a document with two columns, starting in the 1st column 2/3 down the page, ending in the 2nd column 1/3 down the page. In this case, the correct selection consists entirely of lines that lie outside the rectangle formed by the start and endpoints (ie, the bottom 1/3 of the 1st column and the top 1/3 of the 2nd column).

You get situations like this even for single column text; just choose start and end points vertically above each other.

The motivation for this patch was that text selection by rectangles is fundamentally wrong. The correct approach is to reconstruct the reading order of text; then from two points on the page, find the nearest insertion points (where an edit cursor would go); swap the insertion points if necessary; then return the characters between them. The difficulties lie in inferring the reading order, and determining what 'nearest insertion point' means.

Clicking inside a word, the nearest insertion point is obvious; it's the nearest character boundary. Click in a blank area, and it's less clear. In Breuel's algorithms that I used for determining reading order, there is something that helps here. There, line width is determined by expanding the line left and right to fit the column it contains. So the line 'box' contains the initial indent if it is the first line of a paragraph, or the trailing space in the last line; or the ragged space for left- or right- justified text.

Poppler doesn't have columns as such, but blocks instead, and as I recall the line boxes are the tight bounding box of the words contained in the line. So we can try to determine insertion point by looking for the nearest block (horizontally and vertically), then the nearest line (vertically ONLY, so that we ignore indents/ragged space), then nearest character (horizontally). I mean these to be three different comparisons, discarding blocks, line and character candidates at each stage, not some single distance you sum up. The upshot would be that clicking in blank areas of a line that lie within its block's bounding box - or even nearby - will choose that line, not the one above or below.

(It's been ages since I looked at the poppler code, I can't remember if this heuristic is what the patches do already)

Revision history for this message
In , Marek Kašík (mkasik) wrote :

Created attachment 97356
Improve efficiency of searching for tables

(In reply to comment #68)
> Created attachment 40308 [details] [review]
> optimization of search for tables

This is an updated version of the patch. It needs Brian's patch to be already applied. (see #77087 for additional info)

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

I just realized we never really followed up on the last two patches.

My concern is that they change the pdftotext output.

I though they where for a) speed b) text selection so i'd prefer if they did not change pdftotext output.

I've checked a few files of the changed ones and the changes can be argued not to be for better or worse, but then again the problem is that 1 out of 3 files i have has a changed output in pdftotext and having 1600 files in the test suite it makes it impossible for me to go over them all and verify the changes are "not better nor worse".

Is there any chance we get patches that don't influence pdftotext output or at least not such drastically?

And yes, i know it's ages ago since you wrote the patches, sorry.

Revision history for this message
In , Marek Kašík (mkasik) wrote :

I went through the patch written by me and unfortunately I can not make it so that it returns the same result as before. I've separated axes in which it searches for immediate up/down/left/right neighbours, sorted them and found the neighbours by single pass (+ number of possible neighbour candidates in the other axis for given block, which should be sqrt(n) in average case).

The difference is that the patch searches for right-down neighbour by looking at down neighbour of its right neighbour and at the right neighbour of its down neighbour. If they match then it selects it as the right-bottom neighbour.
Previous version just searched for the closest block which is to the right of the block and below the block (and looking at the code, the result could depend on order of the blocks in the searched array).

Modifying the patch so that it would return the same results as without it would cost the whole efficiency.
Anyway, the efficiency improvement of my patch is not as big as the one from Brian so you can reject it (but I think that it improves the conversion :) ).

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 121848
Cache result of inner loop in visitDepthFirst

This is an alternative to Brian's patch in comment 65. This speeds up the visitDepthFirst function by caching the result in the inner loop. This provides a similar speedup without changing the output of pdftotext.

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

Comment on attachment 121848
Cache result of inner loop in visitDepthFirst

Looks good to me, pushed. Thanks!

Revision history for this message
In , Gitlab-migration (gitlab-migration) wrote :

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/516.

Changed in poppler:
status: Confirmed → Unknown
Displaying first 40 and last 40 comments. View all 156 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.