Tracker ignores contents of djvu files

Bug #428599 reported by kakaz
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
tracker (Ubuntu)
Confirmed
Low
Unassigned
Declined for Maverick by Sebastien Bacher

Bug Description

Binary package hint: tracker

Ubuntu 8.10 Intrepid Ibex
Tracker 0.6.6
Gnome Version: 2.24.1

I have a couple of djvu files. Their mime types are recognised as image/vnd.djvu. By default their content is not indexed by tracker search system, although there is filter text/djvu_filter in /usr/lib/tracker/filters directory. As I know that tracker uses mime types, I create directory image in /usr/lib/tracker/filters direcotey and symbolic link named vnd.djvu_filter pointing to text/djvu_filter. I add image/vnd.djvu to default.services of tracker too. This does not help.
Then I made several other things: change mime definition for djvu type from image/vnd.djvu to application/djvu, and some other variants. This does not help.
Tracker still, and without any reason, uses always mime/vnd.djvu even if there is no such definition in the system ( both /etc/mime, gnome definitions, and also definitions in ~./local/share/mime directory. ). Gnome sees application/djvu, file -i sees application/djvu but TRACKER uses image/vnd.djvu.

Please: use some clear dependency when we talking about mime types or do not use it at all. When You trying use it, but in fact probably hardcode it, it has strange impacts.

djvu_filter from /usr/lib/tracker/filters directory do produces text file with correct and full content for very single djvu file I have, but tracker never launches this filter! is uses tracker-extract with image/vnd.djvu always!

Revision history for this message
Josh Wachuta (modernfoyers) wrote :

This bug still exists for me in Ubuntu 10.04 Lucid Lynx using tracker 0.6.95-1ubuntu6. Tracker does not index text from djvu files out of the box, although djvu_filter is present in /usr/lib/trackers/filters/text. The filter requires the package djvulibre-bin, which is installed on my system.

After much searching, I was able to get tracker to index my djvu files by following the instructions at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=460260#30 . In short, it is necessary to:

1. Create a symbolic link at /usr/lib/trackers/filters/images/vnd.djvu_filter pointing to /usr/lib/trackers/filters/text/djvu_filter
2. Edit the file /usr/share/tracker/services/default.service to add "image/vnd.djvu;" to the "Mimes" line under the [Documents] section.
3. Remove the file at ~/.local/share/tracker/data/common.db
4. Restart tracker and reindex.

DJVU is a good open alternative to PDF for scanned documents, so it would be nice if tracker could recognize text in djvu files by default — without the above procedure, the djvu_filter shipped with tracker does absolutely nothing.

Revision history for this message
Atanas Atanasov (thenasko) wrote :

Thanks for the instructions you gave above. I recently upgraded to Maverick, and to my surprise djvu_filter is not present in any package:

nasko@serre:~$ dpkg -S djvu_filter
dpkg: *djvu_filter* not found.
nasko@serre:~$ dpkg -l | grep djvulibre-bin
ii djvulibre-bin 3.5.22-1ubuntu4 Utilities for the DjVu image format

Thinking about it, I am not convinced djvulibre-bin is the right package to contain the tracker filter. Maybe the best solution would be to create a new package, e.g. tracker-extra, and add all "untraditional" filters to it. I nominated this bug for Maverick.

Changed in tracker (Ubuntu):
status: New → Confirmed
Changed in tracker (Ubuntu):
importance: Undecided → Low
Revision history for this message
sybille (sybillel) wrote :

I think a DJVU filter for tracker would be an important addition. The DJVU format is becoming popular amongst people doing DIY book scanning, thanks to these excellent tools:

1) Scan Tailor, for image post-processing
http://scantailor.sourceforge.net/

2) djvubind, a script for Linux that creates a DJVU file with searchable, positioned OCR text from Scan Tailor's output
http://code.google.com/p/djvubind/
deb here: http://code.google.com/p/djvubind/downloads/list

It seems that eventually the goal is to incorporate the functions of djvubind into the Scan Tailor GUI. The tools can be used on any images of pages, whether generated by a flatbed scanner (which I use) or one of the homebrew camera-based scanners developed here:
http://www.diybookscanner.org/

All of that is to say that I think DJVU may become somewhat less obscure - the Internet Archive uses the format, too.

The following thread from the Tracker mailing list states that what is needed is an extractor module for Tracker (as opposed to a filter - filters are no longer used in tracker >= 0.7.x):
http://mail.gnome.org/archives/tracker-list/2010-August/msg00022.html

Here's the howto for writing an extractor module:
http://library.gnome.org/devel/libtracker-extract/unstable/libtracker-extract-How-to-use-libtracker-extract.html

I think the command needed for the extractor module is djvutxt, which is part of the djvulibre-bin package.

This is as far as I've gotten with things so far. Maybe I will be able to figure out how to make an extractor module for text and metadata or maybe this information can help someone else with more skills.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.