Certain characters are not rendered correctly when selected (highlighted)

Bug #808894 reported by Thibault D
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Poppler
Unknown
Medium
poppler (Ubuntu)
Triaged
Low
Unassigned

Bug Description

1) lsb_release -rd
Description: Ubuntu Vivid Vervet (development branch)
Release: 15.04

2) apt-cache policy evince
evince:
  Installed: 3.14.1-0ubuntu1
  Candidate: 3.14.1-0ubuntu1
  Version table:
 *** 3.14.1-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages
        100 /var/lib/dpkg/status

3) What is expected to happen via https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/+attachment/2202502/+files/testfile.pdf is when one highlights the first three lines, it doesn't mis-highlight the words.

What happens instead is certain letters are not visible as per https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/+attachment/2202506/+files/screenshot.png .

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: evince 2.32.0-0ubuntu12.2
ProcVersionSignature: Ubuntu 2.6.38-8.42-generic 2.6.38.2
Uname: Linux 2.6.38-8-generic i686
Architecture: i386
CheckboxSubmission: 9e6554c36969a101b9e0e3075c8ffbe0
CheckboxSystem: b8f3ec504801f13fc208edb5c785b099
Date: Mon Jul 11 18:38:00 2011
InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1)
ProcEnviron:
 LANGUAGE=fr_FR:en
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
ProcVersionSignature_: Ubuntu 2.6.38-8.42-generic 2.6.38.2
SourcePackage: evince
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Thibault D (thibdrev) wrote :
Revision history for this message
Thibault D (thibdrev) wrote :
Revision history for this message
Walter Garcia-Fontes (walter-garcia) wrote :

Thank you for your bug report. This bug has been reported upstream for evince. You can track it and make comments at:

https://bugzilla.gnome.org/show_bug.cgi?id=439070

Changed in evince (Ubuntu):
status: New → Invalid
Victor Vargas (kamus)
Changed in evince (Ubuntu):
importance: Undecided → Medium
status: Invalid → Triaged
Changed in evince:
importance: Unknown → Medium
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Still present in Oneiric

summary: - bad selected text rendering
+ Certain characters are not rendered correctly when selected
+ (highlighted)
Changed in poppler:
importance: Unknown → Medium
status: Unknown → New
Changed in poppler:
importance: Medium → Critical
status: New → Confirmed
Revision history for this message
Thibault D (thibdrev) wrote :

Still present in todays' Precise daily build (evince 3.3.5 | poppler/cairo 0.18.3)

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 57622
test pdf

Forwarding from gnome bugzilla: https://bugzilla.gnome.org/show_bug.cgi?id=654473

When selecting text in this pdf, some glyphs are not visible. Text is displayed correctly when not selected.

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 57623
incorrect display in evince

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 57624
correct display in evince

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 57625
Fixed display for selected glyph in ActualText span

This patch seems to correct the issue. It sets the correct CharCode.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

I don't think this patch is correct, you are only setting the charcode once in ActualText::addChar so if multiple ActualText::addChar calls happen before ActualText::end the other charcodes are lost, no?

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

The patch doesn't work when the ActualText span contains more than one glyph. There is a test case in the test suite that demonstrates the problem:

http://cgit.freedesktop.org/poppler/test/tree/unittestcases/WithActualText.pdf

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 57989
Enable displayed chars to map to any number of text chars

It's tricky when the length of the ActualText does not match the number of displayed glyphs. This first patch modifies the TextWord, TextLine, TextLineFrag, TextBlock, and TextPage classes to suport displayed characters that can be mapped to any number of text characters.

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 57990
Fixes display for selected glyphs in ActualText span

This sets the correct CharCode for each glyph in an ActualText span. Attempts to match one text character to each glyph. If there are more glyphs, they are added without matching text. If there is more text, the remaining is added to the last glyph.

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 57993
fix selection of glyphs in actualtext

Thanks for these patches. I have started looking at some of the text related bugs and the inability of TextOuputDev to understand the difference between glyphs and characters is the cause of some of these bugs. The first patch is very similar to the solution I had in mind. The first patch also fixes bug 9001.

Some comments on the first patch:

The following code in TextBlock::coalesce() needs fixing:
   if (word2->len == word0->len &&
       !memcmp(word2->text, word0->text,
        word0->len * sizeof(Unicode))) {
len need to be replaced with textLen.

I don't think addChar should be renamed to addChars. My understanding of the code is that 'Char' is referring to the CharCodes and only one CharCode is added per call.

I would move the surrogate decoding to TextWord:addChar() and do the decoding as the unicode values are copied into the text array. This avoids having to make a copy of the unicode array in TextPage::addChar().

Some comments on the second patch:

I don't like the way the second patch is mapping the replacement text to the charcodes. I am attaching an alternative patch that distributes the replacement text evenly across the charcodes.

Revision history for this message
In , Jason Crain (jcrain) wrote :

Comment on attachment 57993
fix selection of glyphs in actualtext

Review of attachment 57993:
-----------------------------------------------------------------

::: poppler/TextOutputDev.cc
@@ +5331,5 @@
> + // If this is the last glyph ensure all remaining text is included
> + // as pos may be < length due to rounding errors.
> + if (i == lenGlyphs - 1)
> + count = length - first;
> + text->addChar(state, glyphs[i].x, glyphs[i].y, glyphs[i].dx, glyphs[i].dy,

This needs to make sure that a surrogate pair is not split

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 58013
Enable displayed chars to map to any number of text chars

updated patch

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Guys, i'm a bit lost, sorry, are both patches supposed to fix the same issue in a different way?

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Both patches are required.

The "fix selection of glyphs in actualtext" ensures all charcodes are passed through to TextPage. I need to update this to correctly handle surrogates.

The "enable displayed chars to map to any number of text chars" is a prerequisite for the "fix selection of glyphs in actualtext" patch. It also fixes text selection of ligatures (bug 9001).

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

When using Jason's patch on a pdf that i will be attaching in a moment, pdftotext changes from extracting

 Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing
through 𝑥. We want to define an order on the set ℒ 𝑥 ∖ {𝐿} in the same way as we
did in the proof of Proposition 5. In the disk model ̃
𝑃 with boundary circle 𝐿, every
line 𝐻 ∈ ℒ 𝑥 ∖{𝐿} separates ̃
𝑃 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may
 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect
 transversally, it follows that for two such lines, one of the respective lower parts is entirely

to

 Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing
through 𝑥. We want to define an order on the set ℒ𝑥 ∖ {𝐿} in the same way as we
 with boundary circle 𝐿, every
did in the proof of Proposition 5. In the disk model 𝑃
 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may
line 𝐻 ∈ ℒ𝑥 ∖{𝐿} separates 𝑃
 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect
 transversally, it follows that for two such lines, one of the respective lower parts is entirely

You can see that in the second case some weird/unwanted reordering happened

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Created attachment 58045
The said pdf

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 58178
convert utf-16 to ucs-4 when reading ToUnicode

The next two patches fix the problem of "fix selection of glyphs in actualtext" not handling surrogates. The "Unicode" type is meant to be UCS-4 so the solution is to convert UTF-16 to UCS-4 when it the ToUnicode cmap is parsed.

This patch does the UTF-16 conversion in CharCodeToUnicode.cc. As a result the special surrogate handling in TextOutputDev and HtmlOutputDev can be removed.

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 58179
move text string to unicode conversion into a separate function

This patch adds a new function for converting PDF text strings to UCS-4. As a result the duplicated code in TextOutputDev and pdfinfo can be replaced by a call to this function.

This patch is to ensure that my updated "fix selection of glyphs in actualtext" does not have to care about surrogates.

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 58180
fix selection of glyphs in actualtext

The updated version of "fix selection of glyphs in actualtext".

The patch order is:
1 - convert utf-16 to ucs-4 when reading ToUnicode
2 - move text string to unicode conversion into a separate function
3 - Enable displayed chars to map to any number of text chars
4 - fix selection of glyphs in actualtext

Patch 3 needs to be updated to remove the surrogate handling and fix the regression.

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 58643
fix regressions

This patch fixes the regressions in "Enable displayed chars to map to any number of text chars".

The problem is the changes now allow glyphs that map to zero length unicode strings to be added to TextWords. Often these glyphs have overlapping bounding boxes or are not on the same baseline. This confuses TextOutputDev when trying to determine the layout of the text.

This patch does two things:
- it avoids breaking words when one of these glyphs with an empty mapping is encountered
- it increases the tolerance for overlapping bounding boxes.

With the attached PDF the result the text output is still different but checking the differences it is actually an improvement.

However I suspect the changes could potentially break other PDFs. If this patch causes problems, plan B is to change TextOutputDev to ignore the glyphs with zero mapping when determining the text layout (but still add these glyphs to the words to make text selection work correctly). This should emulate the old behavior as closely as possible.

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 58644
Enable displayed chars to map to any number of text chars

This is Jason's patch rebased so it applies on top of my first two patches.

The patch order is:
1 - convert utf-16 to ucs-4 when reading ToUnicode
2 - move text string to unicode conversion into a separate function
3 - Enable displayed chars to map to any number of text chars
4 - fix selection of glyphs in actualtext
5 - fix regressions

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

With all the patches applied the pdftotext extraction of https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459
changes from

  •
Patches may be grouped with other patches to test the whole of a

to

  •  P
 atches may be grouped with other patches to test the whole of a

The correct extraction would be having no newline, but if we are going to have a newline i very much prefer the old way than the new one that breaks the word after the P

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 58860
Don't reverse order of words with same xMin

This fixes the regression. Output is now:

•  Patches may be grouped with other patches to test the whole of a

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Tehre's still some problems, in the file that i will be attaching


white X in patch b:

gets changed to

•w
 hite X in patch b:

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Created attachment 58892
The pdf with the "white X in patch" problem

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

Created attachment 59003
don't start a new word if the previous char is a control char

Problem is caused by a ^G that overlaps the first letter of the word. This patch avoids started a new word if a control characters overlaps other character. Although it would probably be better to strip out control characters from the extracted text. Acroread does not include the ^G in extracted text.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

With this last patch now we get
       P
atch creation date
  instead of
Patch creation date
again in https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459

Revision history for this message
João Matos (matosmeister) wrote :

confirmed on precise

Changed in poppler:
status: Confirmed → Unknown
penalvch (penalvch)
description: updated
no longer affects: evince (Ubuntu)
affects: poppler → evince (Ubuntu)
Changed in evince (Ubuntu):
importance: Critical → Undecided
status: Unknown → New
no longer affects: evince (Ubuntu)
affects: evince → poppler (Ubuntu)
Changed in poppler (Ubuntu):
importance: Medium → Undecided
status: Unknown → New
importance: Undecided → Low
status: New → Triaged
Revision history for this message
In , penalvch (penalvch) wrote :

1) lsb_release -rd
Description: Ubuntu Vivid Vervet (development branch)
Release: 15.04

2) apt-cache policy evince
evince:
  Installed: 3.14.1-0ubuntu1
  Candidate: 3.14.1-0ubuntu1
  Version table:
 *** 3.14.1-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages
        100 /var/lib/dpkg/status

3) What is expected to happen via https://bugs.launchpad.net/ubuntu/+source/evince/+bug/808894/+attachment/2202502/+files/testfile.pdf is when one highlights the first three lines, it doesn't mis-highlight the words.

What happens instead is certain letters are not visible as per https://bugs.launchpad.net/ubuntu/+source/evince/+bug/808894/+attachment/2202506/+files/screenshot.png .

description: updated
Revision history for this message
In , penalvch (penalvch) wrote :

Created attachment 110940
testfile.pdf

Changed in poppler:
importance: Unknown → Low
status: Unknown → Confirmed
Revision history for this message
In , Jason Crain (jcrain) wrote :

*** This bug has been marked as a duplicate of bug 46603 ***

Revision history for this message
In , Jason Crain (jcrain) wrote :

*** Bug 87401 has been marked as a duplicate of this bug. ***

Changed in poppler:
status: Confirmed → Invalid
Changed in poppler:
importance: Low → Unknown
status: Invalid → Unknown
Changed in poppler:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
In , Gitlab-migration (gitlab-migration) wrote :

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/350.

Changed in poppler:
status: Confirmed → Unknown
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.