Ubuntu
poppler package

evince can not find ü in attached PDF

Bug #116453 reported by lherrmann on 2007-05-23

Affects		Status	Importance	Assigned to	Milestone
	Poppler	Fix Released	Low	freedesktop-bugs #87215
	poppler (Ubuntu)	Fix Released	Low	Unassigned

Bug Description

Binary package hint: evince

1) lsb_release -rd
Description: Ubuntu Vivid Vervet (development branch)
Release: 15.04

2) apt-cache policy evince
evince:
  Installed: 3.14.1-0ubuntu1
  Candidate: 3.14.1-0ubuntu1
  Version table:
*** 3.14.1-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages
        100 /var/lib/dpkg/status

3) What is expected to happen with the attached document is when one searches for:
über

it is found:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453/+attachment/102979/+files/example.pdf

4) What happens instead is it does not return any matches.

WORKAROUND: Use the built-in PDF viewer+search with chromium-browser or chrome (doesn't work in Firefox).

apt-cache policy chromium-browser
chromium-browser:
  Installed: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Candidate: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Version table:
*** 39.0.2171.65-0ubuntu0.14.04.1.1064 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ trusty-security/universe amd64 Packages
        100 /var/lib/dpkg/status
     34.0.1847.116-0ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages

apt-cache policy google-chrome-stable:i386
google-chrome-stable:i386:
  Installed: 39.0.2171.95-1
  Candidate: 39.0.2171.95-1
  Version table:
*** 39.0.2171.95-1 0
        500 http://dl.google.com/linux/chrome/deb/ stable/main i386 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
Architecture: i386
Date: Wed May 23 18:22:27 2007
DistroRelease: Ubuntu 7.04
ExecutablePath: /usr/bin/evince
Package: evince 0.8.1-0ubuntu1
PackageArchitecture: i386
ProcEnviron:
LANGUAGE=en_US:en
PATH=~/local/bin:~/local/lib:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: evince
Uname: Linux copper 2.6.20-15-generic #2 SMP Sun Apr 15 07:36:31 UTC 2007 i686 GNU/Linux

See original description

Tags:

Revision history for this message

In freedesktop.org Bugzilla #7063, Guillaume-desmottes (guillaume-desmottes) wrote on 2006-05-29:

This bug is maybe related to bug 7064 and bug 7065.

Revision history for this message

lherrmann (lherrmann) wrote on 2007-05-23: evince can not find special characters in pdfs

Binary package hint: evince

when using the CTRL+F search function to find a string with special characters (e.g. "über"), evince does not return any matches.

ProblemType: Bug
Architecture: i386
Date: Wed May 23 18:22:27 2007
DistroRelease: Ubuntu 7.04
ExecutablePath: /usr/bin/evince
Package: evince 0.8.1-0ubuntu1
PackageArchitecture: i386
ProcCmdline: evince file:///home/lherrmann/uni/semester_8/pn/skript/pn-2006.pdf
ProcCwd: /home/lherrmann
ProcEnviron:
LANGUAGE=en_US:en
PATH=~/local/bin:~/local/lib:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: evince
Uname: Linux copper 2.6.20-15-generic #2 SMP Sun Apr 15 07:36:31 UTC 2007 i686 GNU/Linux

Revision history for this message

lherrmann (lherrmann) wrote on 2007-05-23:

Dependencies.txt Edit (4.2 KiB, text/plain; charset="utf-8")
ProcMaps.txt Edit (26.4 KiB, text/plain; charset="utf-8")
ProcStatus.txt Edit (661 bytes, text/plain; charset="utf-8")

lherrmann (lherrmann) on 2007-05-23

description:

updated

Revision history for this message

Sebastien Bacher (seb128) wrote on 2007-05-23:

Thank you for your bug. Could you attach an example?

Changed in evince:
assignee:	nobody → desktop-bugs
importance:	Undecided → Low
status:	Unconfirmed → Needs Info

Revision history for this message

lherrmann (lherrmann) wrote on 2007-05-23:

short pdf-file containing some words with umlauts. Edit (8.8 KiB, application/pdf)

Revision history for this message

Daniel Holbach (dholbach) wrote on 2007-05-24:

Confirmed with evince and xpdf. Does searching work for you in any PDF viewer?

Revision history for this message

Daniel Holbach (dholbach) wrote on 2007-05-24:

I mean searching 'Über'.

Revision history for this message

Sebastien Bacher (seb128) wrote on 2007-05-24:

There is a poppler bug upstream, https://bugs.freedesktop.org/show_bug.cgi?id=7063

Changed in evince:
status:	Needs Info → Confirmed

Revision history for this message

lherrmann (lherrmann) wrote on 2007-05-24:

I just tried it with kpdf and it doesn't work there either.
In fact, I also tried it with the MacOS pdf viewer and suprisingly, it doesn't even work there.

Bug Watch Updater (bug-watch-updater) on 2007-05-25

Changed in poppler:
status:	Unknown → Confirmed

Sebastien Bacher (seb128) on 2008-02-06

Changed in poppler:
status:	Confirmed → Triaged

Revision history for this message

hdante (hdante) wrote on 2008-04-30:

#10

(forwarding duplicate bug 224702)
can't search accented letters

In certain PDF documents, searching for words with accented letters gives no results.
Steps to reproduce:
1. Open attached file
2. Search for "implementação" or "exercícios"
3. Text is not found

There is the same problem in some other files.

The problem is specially evil, because "tracker" doesn't work either.

Revision history for this message

hdante (hdante) wrote on 2008-04-30:

#11

BTW, the importance of this bug is not "low". It should be treated as if it caused "information loss".

Bug Watch Updater (bug-watch-updater) on 2010-09-10

Changed in poppler:
importance:	Unknown → Medium

Bug Watch Updater (bug-watch-updater) on 2011-01-24

Changed in poppler:
importance:	Medium → Unknown

Bug Watch Updater (bug-watch-updater) on 2011-02-03

Changed in poppler:
importance:	Unknown → Medium

Revision history for this message

In freedesktop.org Bugzilla #7063, Freedesktopbug-20-k-d (freedesktopbug-20-k-d) wrote on 2011-11-27:

#12

Just found this old bug with status "NEW". It works for me with Evince 2.30.3 [poppler/cairo (0.12.4)]. Please reopen this bug if it doesn't work for you.

Bug Watch Updater (bug-watch-updater) on 2011-11-28

Changed in poppler:
status:	Confirmed → Invalid

penalvch (penalvch) on 2014-12-11

summary:

- evince can not find special characters in pdfs
+ evince can not find ü in attached PDF

Revision history for this message

In freedesktop.org Bugzilla #87215, penalvch (penalvch) wrote on 2014-12-11:

#13

Downstream report:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453

1) lsb_release -rd
Description: Ubuntu Vivid Vervet (development branch)
Release: 15.04

3) What is expected to happen with the attached document is when one searches for:
über

it is found:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453/+attachment/102979/+files/example.pdf

4) What happens instead is it does not return any matches. This is reproducible as far back as Ubuntu 7.04 with evince 0.8.1-0ubuntu1, with the older poppler version built for it, so probably not a regression.

WORKAROUND: Use the built-in PDF viewer+search with chromium-browser or chrome (doesn't work in Firefox).

description:	updated
tags:	added: fiesty trusty vivid
Changed in poppler:
importance:	Medium → Unknown
status:	Invalid → Unknown
Changed in poppler (Ubuntu):
assignee:	Ubuntu Desktop Bugs (desktop-bugs) → nobody

Bug Watch Updater (bug-watch-updater) on 2014-12-11

Changed in poppler:
importance:	Unknown → Low
status:	Unknown → Confirmed

Revision history for this message

In freedesktop.org Bugzilla #87215, Jason Crain (jcrain) wrote on 2014-12-15:

#14

If you look at the copy and paste from adobe reader and chrome, the word 'Über' is not actually in that document. The diaresis is separate from the 'U'. We could make search looser by stripping out combining characters. Looks like that's what adobe reader and chrome do. Is that the kind of thing that people would want? Being able to find 'uber' by searching for 'über' or 'ubér' or 'ũb̏ȇ̱r̽'?

That might make some people upset. We already have bug # 85702 requesting to make search stricter. Though personally I think a looser search makes sense.

Revision history for this message

In freedesktop.org Bugzilla #87215, penalvch (penalvch) wrote on 2014-12-16:

#15

Jason Crain, thank you for your response.

I appreciate keeping a lean code base to ease maintenance. However, I'm a strong proponent of functionality compatibility expectations, to ease the transition for folks from an alternative operating system, and/or PDF viewer.

Hence, in this select case, allowing for feature compatibility expectations with Reader/Chrome makes sense here.

Revision history for this message

In freedesktop.org Bugzilla #87215, Jason Crain (jcrain) wrote on 2015-01-12:

#16

Created attachment 112107
Remove combining characters from normalized text

This patch changes normalization so that combining characters are removed from the normalized text. This makes searching through TextPage::findText insensitive to these characters.

Also, renames unicodeNormalizeNFKC to unicodeNormalizeSearch to make it clear it's no longer doing a regular NFKC normalization.

Renames decomp_compat to decomp_compat_base because it now strips combing characters, leaving only base characters, in addition to compatibility decomposition.

Removes UnicodeCompTables.h and some compose functions. They're no longer needed since we're not recomposing the characters.

I'm not sure if UnicodeTypeTable.h and UnicodeCompTables.h are considered part of the public interface. They're included in the xpdf headers. Albert, is it OK to change these files in this way?

Revision history for this message

In freedesktop.org Bugzilla #87215, Adrian Johnson (ajohnson-redneon) wrote on 2015-01-12:

#17

I'm not sure that removing this functionality is a good idea. Can't we just add an option to findText to enable a looser search and leave it to the front ends to decide if/how to expose this option.

Revision history for this message

In freedesktop.org Bugzilla #87215, Albert Astals Cid (aacid) wrote on 2015-01-12:

#18

I'm with Adrian, don't think changing this at such low level is a good idea.

Revision history for this message

hdante (hdante) wrote on 2015-01-14:

#19

I don't really understand what's going on, Unicode has a collation algorithm, can't it be used ?

Revision history for this message

In freedesktop.org Bugzilla #87215, Jason Crain (jcrain) wrote on 2015-01-21:

#22

I suppose if I add an option to findText, I should also add a flag (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front end's poppler_page_find_text_with_options(). It would be nice if someone could confirm that evince would actually use this option.

Revision history for this message

Jason Crain (jcrain) wrote on 2015-01-21:

#20

hdante: the problem is that, despite appearances, the PDF in the bug description does not contain the word 'Über'. It contains the word 'Uber', without a diaresis. You can see this if you copy and paste from the document using any PDF reader, including adobe reader, google chrome, foxit, etc. There is a diaresis, but it is not really attached to the 'U'.

Even so, adobe reader and chrome can still find something if you search the document for 'über'. What they seem to be doing is ignoring any diacratic marks, so if you search for 'über' (or even 'ubér') it will find 'Uber'. I was proposing similar behavior for poppler.

Revision history for this message

In freedesktop.org Bugzilla #87215, Carlos Garcia Campos (carlosgc) wrote on 2015-01-21:

#23

(In reply to Jason Crain from comment #6)
> I suppose if I add an option to findText, I should also add a flag
> (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front
> end's poppler_page_find_text_with_options(). It would be nice if someone
> could confirm that evince would actually use this option.

I don't see a reason why someone might want to search for ü and not find a word containing ü. So, if there are two methods in poppler core, I would change the glib bindings to use the one correctly finding combining characters.

Revision history for this message

hdante (hdante) wrote on 2015-01-23:

#21

I understand now. In this case, if über is searched, the reasonable easy solution is to match uber.

Revision history for this message

In freedesktop.org Bugzilla #87215, Jason Crain (jcrain) wrote on 2015-02-02:

#24

Created attachment 113036
[draft] combine characters

I might be able to fix this in a better way by combining letters with nearby diacritic marks so that this document *would* contain ü. It seems to be a nice improvement for some latex documents. Attached patch can give you a rough idea of what I mean. It still needs a lot of work though.

Revision history for this message

In freedesktop.org Bugzilla #87215, Albert Astals Cid (aacid) wrote on 2015-02-02:

#25

I certainly remember we already did that combination somewhere, either in okular or in poppler, but i can't find it and of course the document does not work, so it may be a fake memory :D

I think this may make sense, though then again preserving the old behaviour via a flag (even if not default) in the TextOutputDev may make sense if someone (not sure who though) would be depending on it.

Revision history for this message

In freedesktop.org Bugzilla #87215, Jason Crain (jcrain) wrote on 2015-03-20:

#26

Created attachment 114485
Combine base characters and diacritical marks

My attempt to improve this.

When you make a diacriticized character with LaTeX, ü for example, it will make a PDF with separate u and ¨ characters and draw them over each other. This patch detects when this happens and converts it to a combining character sequence so that pdftotext and the search function will see a ü and not separate characters. Also refactors some (TextWord::ensureCapacity and TextWord::setInitialBounds) to avoid duplicating code.

Limitations:

It doesn't handle some of LaTeX's diacritic commands, such as \b for bar under letter or \d for dot under letter, because they are positioned differently and \d would be easy to confuse with a period. They don't seem to be used very often though.

If the base character is unusual, such as a math symbol or number, adding a combining character can make the result of pdftotext look a bit odd. I think this is because if the font or rendering engine don't know how to draw the character sequence, it will place the diacritic in a strange position, such as to the right of the letter. In these cases, the output of pdftotext is technically correct, it just looks odd when drawn on screen.

When selecting text in evince, you can separately select the character and diacritic. If that's a problem, I think I could fix it by adding clustering support so that a group of glyphs and characters are treated as a single unit. It would make this a much more invasive change, but maybe I should try it anyway. It would be nice to also fix the assumpution that one glyph is always matched 1 character.

Revision history for this message

In freedesktop.org Bugzilla #87215, Albert Astals Cid (aacid) wrote on 2015-03-27:

#27

I think it looks good as it is.

If noone disagrees i'll commit in a week.

Revision history for this message

In freedesktop.org Bugzilla #87215, Albert Astals Cid (aacid) wrote on 2015-04-04:

#28

Pushed.

Bug Watch Updater (bug-watch-updater) on 2015-04-08

Changed in poppler:
status:	Confirmed → Fix Released

Revision history for this message

Sebastien Bacher (seb128) wrote on 2015-04-08:

#29

the fix has been commited upstream, we should get it in next cycle

Changed in poppler (Ubuntu):
status:	Triaged → Fix Committed

Revision history for this message

In freedesktop.org Bugzilla #87215, Nelson Benitez (gnel) wrote on 2015-04-11:

#30

Hi Jason, thank you very much for the patch, btw, today I was reading this pdf:

http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/lecture_notes/GTK_textview.pdf

and noticed that lot of words with double f, like 'buffer', are not found[1] when searching for it, also when copied to gedit it shows the unicode not found glyph inplace of the 'ff' in the word.

So, is your patch covering this double f case?

If so, please ignore this comment, but for a quick reading over this bug I thought this double f case was not handled as it wasn't accented word or diacritic.

Thank you.

[1] Some 'buffer' words are found, the ones in a code block, but the ones in the normal text are not. Eg. the 5th paragraph of the fourth page, that starts with "Locations within a text buffer are represented..."

Revision history for this message

In freedesktop.org Bugzilla #87215, Jason Crain (jcrain) wrote on 2015-04-13:

#31

(In reply to Nelson Benitez from comment #13)
> Hi Jason, thank you very much for the patch, btw, today I was reading this
> pdf:
>
> http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/
> lecture_notes/GTK_textview.pdf
>
> and noticed that lot of words with double f, like 'buffer', are not
> found[1] when searching for it, also when copied to gedit it shows the
> unicode not found glyph inplace of the 'ff' in the word.
>
> So, is your patch covering this double f case?

No, it does not fix that. That file has a different problem and I don't see a way of fixing it. The PDF creator would need to add some extra information before we could guess that character code 27 should be a double f.

Revision history for this message

In freedesktop.org Bugzilla #87215, Jason Crain (jcrain) wrote on 2015-04-17:

#32

*** Bug 66569 has been marked as a duplicate of this bug. ***

Revision history for this message

In freedesktop.org Bugzilla #87215, Nelson Benitez (gnel) wrote on 2015-09-03:

#33

(In reply to Jason Crain from comment #14)
> (In reply to Nelson Benitez from comment #13)
> > Hi Jason, thank you very much for the patch, btw, today I was reading this
> > pdf:
> >
> > http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/
> > lecture_notes/GTK_textview.pdf
> >
> > and noticed that lot of words with double f, like 'buffer', are not
> > found[1] when searching for it, also when copied to gedit it shows the
> > unicode not found glyph inplace of the 'ff' in the word.
> >
> > So, is your patch covering this double f case?
>
> No, it does not fix that. That file has a different problem and I don't see
> a way of fixing it. The PDF creator would need to add some extra
> information before we could guess that character code 27 should be a double
> f.

Thanks Jason for explanation, indeed it was a problem in the PDF creator. Just for completeness I'm posting link describing the problem and solution in pdfTEX:

http://tex.stackexchange.com/questions/31113/enable-searching-in-a-pdflatex-generated-document

Regards,

madbiologist (me-again) on 2021-04-01

Changed in poppler (Ubuntu):
status:	Fix Committed → Fix Released

Revision history for this message

madbiologist (me-again) wrote on 2021-04-01:

#34

The word Über is now found when searching the attached PDF within Evince on Ubuntu 18.04.5 "Bionic Beaver" using evince 3.28.4-0ubuntu1.2 and poppler 0.62.0-2ubuntu2.12.