evince can not find ü in attached PDF

Bug #116453 reported by lherrmann
16
Affects Status Importance Assigned to Milestone
Poppler
Fix Released
Low
poppler (Ubuntu)
Fix Released
Low
Unassigned

Bug Description

Binary package hint: evince

1) lsb_release -rd
Description: Ubuntu Vivid Vervet (development branch)
Release: 15.04

2) apt-cache policy evince
evince:
  Installed: 3.14.1-0ubuntu1
  Candidate: 3.14.1-0ubuntu1
  Version table:
 *** 3.14.1-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages
        100 /var/lib/dpkg/status

3) What is expected to happen with the attached document is when one searches for:
über

it is found:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453/+attachment/102979/+files/example.pdf

4) What happens instead is it does not return any matches.

WORKAROUND: Use the built-in PDF viewer+search with chromium-browser or chrome (doesn't work in Firefox).

apt-cache policy chromium-browser
chromium-browser:
  Installed: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Candidate: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Version table:
 *** 39.0.2171.65-0ubuntu0.14.04.1.1064 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ trusty-security/universe amd64 Packages
        100 /var/lib/dpkg/status
     34.0.1847.116-0ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages

apt-cache policy google-chrome-stable:i386
google-chrome-stable:i386:
  Installed: 39.0.2171.95-1
  Candidate: 39.0.2171.95-1
  Version table:
 *** 39.0.2171.95-1 0
        500 http://dl.google.com/linux/chrome/deb/ stable/main i386 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
Architecture: i386
Date: Wed May 23 18:22:27 2007
DistroRelease: Ubuntu 7.04
ExecutablePath: /usr/bin/evince
Package: evince 0.8.1-0ubuntu1
PackageArchitecture: i386
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=~/local/bin:~/local/lib:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: evince
Uname: Linux copper 2.6.20-15-generic #2 SMP Sun Apr 15 07:36:31 UTC 2007 i686 GNU/Linux

Revision history for this message
In , Guillaume-desmottes (guillaume-desmottes) wrote :

This bug is maybe related to bug 7064 and bug 7065.

Revision history for this message
lherrmann (lherrmann) wrote : evince can not find special characters in pdfs

Binary package hint: evince

when using the CTRL+F search function to find a string with special characters (e.g. "über"), evince does not return any matches.

ProblemType: Bug
Architecture: i386
Date: Wed May 23 18:22:27 2007
DistroRelease: Ubuntu 7.04
ExecutablePath: /usr/bin/evince
Package: evince 0.8.1-0ubuntu1
PackageArchitecture: i386
ProcCmdline: evince file:///home/lherrmann/uni/semester_8/pn/skript/pn-2006.pdf
ProcCwd: /home/lherrmann
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=~/local/bin:~/local/lib:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: evince
Uname: Linux copper 2.6.20-15-generic #2 SMP Sun Apr 15 07:36:31 UTC 2007 i686 GNU/Linux

Revision history for this message
lherrmann (lherrmann) wrote :
lherrmann (lherrmann)
description: updated
Revision history for this message
Sebastien Bacher (seb128) wrote :

Thank you for your bug. Could you attach an example?

Changed in evince:
assignee: nobody → desktop-bugs
importance: Undecided → Low
status: Unconfirmed → Needs Info
Revision history for this message
lherrmann (lherrmann) wrote :
Revision history for this message
Daniel Holbach (dholbach) wrote :

Confirmed with evince and xpdf. Does searching work for you in any PDF viewer?

Revision history for this message
Daniel Holbach (dholbach) wrote :

I mean searching 'Über'.

Revision history for this message
Sebastien Bacher (seb128) wrote :

There is a poppler bug upstream, https://bugs.freedesktop.org/show_bug.cgi?id=7063

Changed in evince:
status: Needs Info → Confirmed
Revision history for this message
lherrmann (lherrmann) wrote :

I just tried it with kpdf and it doesn't work there either.
In fact, I also tried it with the MacOS pdf viewer and suprisingly, it doesn't even work there.

Changed in poppler:
status: Unknown → Confirmed
Changed in poppler:
status: Confirmed → Triaged
Revision history for this message
hdante (hdante) wrote :

(forwarding duplicate bug 224702)
can't search accented letters

In certain PDF documents, searching for words with accented letters gives no results.
Steps to reproduce:
1. Open attached file
2. Search for "implementação" or "exercícios"
3. Text is not found

There is the same problem in some other files.

The problem is specially evil, because "tracker" doesn't work either.

Revision history for this message
hdante (hdante) wrote :

BTW, the importance of this bug is not "low". It should be treated as if it caused "information loss".

Changed in poppler:
importance: Unknown → Medium
Changed in poppler:
importance: Medium → Unknown
Changed in poppler:
importance: Unknown → Medium
Revision history for this message
In , Freedesktopbug-20-k-d (freedesktopbug-20-k-d) wrote :

Just found this old bug with status "NEW". It works for me with Evince 2.30.3 [poppler/cairo (0.12.4)]. Please reopen this bug if it doesn't work for you.

Changed in poppler:
status: Confirmed → Invalid
penalvch (penalvch)
summary: - evince can not find special characters in pdfs
+ evince can not find ü in attached PDF
Revision history for this message
In , penalvch (penalvch) wrote :

Downstream report:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453

1) lsb_release -rd
Description: Ubuntu Vivid Vervet (development branch)
Release: 15.04

2) apt-cache policy evince
evince:
  Installed: 3.14.1-0ubuntu1
  Candidate: 3.14.1-0ubuntu1
  Version table:
 *** 3.14.1-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages
        100 /var/lib/dpkg/status

3) What is expected to happen with the attached document is when one searches for:
über

it is found:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453/+attachment/102979/+files/example.pdf

4) What happens instead is it does not return any matches. This is reproducible as far back as Ubuntu 7.04 with evince 0.8.1-0ubuntu1, with the older poppler version built for it, so probably not a regression.

WORKAROUND: Use the built-in PDF viewer+search with chromium-browser or chrome (doesn't work in Firefox).

apt-cache policy chromium-browser
chromium-browser:
  Installed: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Candidate: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Version table:
 *** 39.0.2171.65-0ubuntu0.14.04.1.1064 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ trusty-security/universe amd64 Packages
        100 /var/lib/dpkg/status
     34.0.1847.116-0ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages

apt-cache policy google-chrome-stable:i386
google-chrome-stable:i386:
  Installed: 39.0.2171.95-1
  Candidate: 39.0.2171.95-1
  Version table:
 *** 39.0.2171.95-1 0
        500 http://dl.google.com/linux/chrome/deb/ stable/main i386 Packages
        100 /var/lib/dpkg/status

description: updated
tags: added: fiesty trusty vivid
Changed in poppler:
importance: Medium → Unknown
status: Invalid → Unknown
Changed in poppler (Ubuntu):
assignee: Ubuntu Desktop Bugs (desktop-bugs) → nobody
Changed in poppler:
importance: Unknown → Low
status: Unknown → Confirmed
Revision history for this message
In , Jason Crain (jcrain) wrote :

If you look at the copy and paste from adobe reader and chrome, the word 'Über' is not actually in that document. The diaresis is separate from the 'U'. We could make search looser by stripping out combining characters. Looks like that's what adobe reader and chrome do. Is that the kind of thing that people would want? Being able to find 'uber' by searching for 'über' or 'ubér' or 'ũb̏ȇ̱r̽'?

That might make some people upset. We already have bug # 85702 requesting to make search stricter. Though personally I think a looser search makes sense.

Revision history for this message
In , penalvch (penalvch) wrote :

Jason Crain, thank you for your response.

I appreciate keeping a lean code base to ease maintenance. However, I'm a strong proponent of functionality compatibility expectations, to ease the transition for folks from an alternative operating system, and/or PDF viewer.

Hence, in this select case, allowing for feature compatibility expectations with Reader/Chrome makes sense here.

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 112107
Remove combining characters from normalized text

This patch changes normalization so that combining characters are removed from the normalized text. This makes searching through TextPage::findText insensitive to these characters.

Also, renames unicodeNormalizeNFKC to unicodeNormalizeSearch to make it clear it's no longer doing a regular NFKC normalization.

Renames decomp_compat to decomp_compat_base because it now strips combing characters, leaving only base characters, in addition to compatibility decomposition.

Removes UnicodeCompTables.h and some compose functions. They're no longer needed since we're not recomposing the characters.

I'm not sure if UnicodeTypeTable.h and UnicodeCompTables.h are considered part of the public interface. They're included in the xpdf headers. Albert, is it OK to change these files in this way?

Revision history for this message
In , Adrian Johnson (ajohnson-redneon) wrote :

I'm not sure that removing this functionality is a good idea. Can't we just add an option to findText to enable a looser search and leave it to the front ends to decide if/how to expose this option.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

I'm with Adrian, don't think changing this at such low level is a good idea.

Revision history for this message
hdante (hdante) wrote :

I don't really understand what's going on, Unicode has a collation algorithm, can't it be used ?

Revision history for this message
In , Jason Crain (jcrain) wrote :

I suppose if I add an option to findText, I should also add a flag (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front end's poppler_page_find_text_with_options(). It would be nice if someone could confirm that evince would actually use this option.

Revision history for this message
Jason Crain (jcrain) wrote :

hdante: the problem is that, despite appearances, the PDF in the bug description does not contain the word 'Über'. It contains the word 'Uber', without a diaresis. You can see this if you copy and paste from the document using any PDF reader, including adobe reader, google chrome, foxit, etc. There is a diaresis, but it is not really attached to the 'U'.

Even so, adobe reader and chrome can still find something if you search the document for 'über'. What they seem to be doing is ignoring any diacratic marks, so if you search for 'über' (or even 'ubér') it will find 'Uber'. I was proposing similar behavior for poppler.

Revision history for this message
In , Carlos Garcia Campos (carlosgc) wrote :

(In reply to Jason Crain from comment #6)
> I suppose if I add an option to findText, I should also add a flag
> (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front
> end's poppler_page_find_text_with_options(). It would be nice if someone
> could confirm that evince would actually use this option.

I don't see a reason why someone might want to search for ü and not find a word containing ü. So, if there are two methods in poppler core, I would change the glib bindings to use the one correctly finding combining characters.

Revision history for this message
hdante (hdante) wrote :

I understand now. In this case, if über is searched, the reasonable easy solution is to match uber.

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 113036
[draft] combine characters

I might be able to fix this in a better way by combining letters with nearby diacritic marks so that this document *would* contain ü. It seems to be a nice improvement for some latex documents. Attached patch can give you a rough idea of what I mean. It still needs a lot of work though.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

I certainly remember we already did that combination somewhere, either in okular or in poppler, but i can't find it and of course the document does not work, so it may be a fake memory :D

I think this may make sense, though then again preserving the old behaviour via a flag (even if not default) in the TextOutputDev may make sense if someone (not sure who though) would be depending on it.

Revision history for this message
In , Jason Crain (jcrain) wrote :

Created attachment 114485
Combine base characters and diacritical marks

My attempt to improve this.

When you make a diacriticized character with LaTeX, ü for example, it will make a PDF with separate u and ¨ characters and draw them over each other. This patch detects when this happens and converts it to a combining character sequence so that pdftotext and the search function will see a ü and not separate characters. Also refactors some (TextWord::ensureCapacity and TextWord::setInitialBounds) to avoid duplicating code.

Limitations:

It doesn't handle some of LaTeX's diacritic commands, such as \b for bar under letter or \d for dot under letter, because they are positioned differently and \d would be easy to confuse with a period. They don't seem to be used very often though.

If the base character is unusual, such as a math symbol or number, adding a combining character can make the result of pdftotext look a bit odd. I think this is because if the font or rendering engine don't know how to draw the character sequence, it will place the diacritic in a strange position, such as to the right of the letter. In these cases, the output of pdftotext is technically correct, it just looks odd when drawn on screen.

When selecting text in evince, you can separately select the character and diacritic. If that's a problem, I think I could fix it by adding clustering support so that a group of glyphs and characters are treated as a single unit. It would make this a much more invasive change, but maybe I should try it anyway. It would be nice to also fix the assumpution that one glyph is always matched 1 character.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

I think it looks good as it is.

If noone disagrees i'll commit in a week.

Revision history for this message
In , Albert Astals Cid (aacid) wrote :

Pushed.

Changed in poppler:
status: Confirmed → Fix Released
Revision history for this message
Sebastien Bacher (seb128) wrote :

the fix has been commited upstream, we should get it in next cycle

Changed in poppler (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
In , Nelson Benitez (gnel) wrote :

Hi Jason, thank you very much for the patch, btw, today I was reading this pdf:

http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/lecture_notes/GTK_textview.pdf

and noticed that lot of words with double f, like 'buffer', are not found[1] when searching for it, also when copied to gedit it shows the unicode not found glyph inplace of the 'ff' in the word.

So, is your patch covering this double f case?

If so, please ignore this comment, but for a quick reading over this bug I thought this double f case was not handled as it wasn't accented word or diacritic.

Thank you.

[1] Some 'buffer' words are found, the ones in a code block, but the ones in the normal text are not. Eg. the 5th paragraph of the fourth page, that starts with "Locations within a text buffer are represented..."

Revision history for this message
In , Jason Crain (jcrain) wrote :

(In reply to Nelson Benitez from comment #13)
> Hi Jason, thank you very much for the patch, btw, today I was reading this
> pdf:
>
> http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/
> lecture_notes/GTK_textview.pdf
>
> and noticed that lot of words with double f, like 'buffer', are not
> found[1] when searching for it, also when copied to gedit it shows the
> unicode not found glyph inplace of the 'ff' in the word.
>
> So, is your patch covering this double f case?

No, it does not fix that. That file has a different problem and I don't see a way of fixing it. The PDF creator would need to add some extra information before we could guess that character code 27 should be a double f.

Revision history for this message
In , Jason Crain (jcrain) wrote :

*** Bug 66569 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Nelson Benitez (gnel) wrote :

(In reply to Jason Crain from comment #14)
> (In reply to Nelson Benitez from comment #13)
> > Hi Jason, thank you very much for the patch, btw, today I was reading this
> > pdf:
> >
> > http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/
> > lecture_notes/GTK_textview.pdf
> >
> > and noticed that lot of words with double f, like 'buffer', are not
> > found[1] when searching for it, also when copied to gedit it shows the
> > unicode not found glyph inplace of the 'ff' in the word.
> >
> > So, is your patch covering this double f case?
>
> No, it does not fix that. That file has a different problem and I don't see
> a way of fixing it. The PDF creator would need to add some extra
> information before we could guess that character code 27 should be a double
> f.

Thanks Jason for explanation, indeed it was a problem in the PDF creator. Just for completeness I'm posting link describing the problem and solution in pdfTEX:

http://tex.stackexchange.com/questions/31113/enable-searching-in-a-pdflatex-generated-document

Regards,

madbiologist (me-again)
Changed in poppler (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
madbiologist (me-again) wrote :

The word Über is now found when searching the attached PDF within Evince on Ubuntu 18.04.5 "Bionic Beaver" using evince 3.28.4-0ubuntu1.2 and poppler 0.62.0-2ubuntu2.12.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.