Beagle

Beagle uses wrong mime-type

Bug #365426 reported by Williams Christ on 2009-04-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Beagle	New	Undecided	Unassigned

Bug Description

Beagle is not able to index or find all major office file types like ms word, excel, openoffice.
Here is the dialog of beagle-extract-content program:

--start --
~/Dokumente/search$ beagle-extract-content ./test_search.doc
Filename: file:///home/tv/Dokumente/search/test_search.doc
Warn: bibparse is not found; bibtex files will not be indexed
Debug: Loaded 64 filters from /usr/lib/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/tv/.beagle/filterver.dat ... cache is dirty ? False
Debug: No filter for file:///home/tv/Dokumente/search/test_search.doc (/home/tv/Dokumente/search/test_search.doc) [application/octet-stream]
Filter: (determined in ,29s)
MimeType: application/octet-stream
-- end --

the mimetype for all office file types is always: application/octet-stream.

With xdg-mime everything is recognized well:

~/Dokumente$ xdg-mime query filetype test_search.doc
application/msword

Thank you for any help
Thomas

Revision history for this message

Andrej Kazakov (andrejserafim) wrote on 2009-05-21:

I have a very similar problem, only with PDF files.

when I run:
$ beagle-extract-content mockus.pdf
Filename: file:///tmp/mockus.pdf
Debug: Loaded 64 filters from /usr/lib/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/serafim/.beagle/filterver.dat ... cache is dirty ? False
Debug: No filter for file:///tmp/mockus.pdf (/tmp/mockus.pdf) [application/octet-stream]
Filter: (determined in ,36s)
MimeType: application/octet-stream

Properties:
Timestamp = 2009-05-19 11:51:49 (Utc)

On the other hand, when I run:
$ beagle-extract-content --mimetype=application/pdf mockus.pdf
Filename: file:///tmp/mockus.pdf
Debug: Loaded 64 filters from /usr/lib/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/s/.beagle/filterver.dat ... cache i
s dirty ? False
Filter: Beagle.Filters.FilterPdf (determined in ,40s)
MimeType: application/pdf

Properties:
  Timestamp = 2009-05-19 11:51:49 (Utc)
  beagle:FileType = document
  dc:appname = Acrobat Distiller 3.0 for Power Macintosh
  fixme:page-count = 38

Content:

........................

Text extracted in 63,79s

Thanks,
Andrej

Revision history for this message

Andrej Kazakov (andrejserafim) wrote on 2009-05-21:

Forgot to say:

$ xdg-mime query filetype mockus.pdf
application/pdf

Andrej

Revision history for this message

Williams Christ (wchrist) wrote on 2009-07-27:

Are we both the only people that have problems with beagle indexing? Is someone else using beagle? If there are no problems: how do you manage this?

-- Thomas

Revision history for this message

Stephan Ritscher (stephan-ritscher) wrote on 2009-08-25:

Hi guys,
I think I had the same problem. After some debugging I found out that it was a problem with the mime cache. After I removed ~/.local/share/mime/mime.cache at least beagle-extract-content worked properly on my pdfs. Now I'm gonna rebuild the index...
Btw, also re-installing the packages shared-mime-info and xdg-utils could help (at least these are the names in Gentoo).
Hope that helps.

Cheers, Stephan

Revision history for this message

Williams Christ (wchrist) wrote on 2009-08-25:

Hi Stephan,
indexing PDFs was never a problem for me. But whats about office files like Word, Excel, Powerpoint or Openoffice files? Can you index and find them?
What is your output from:
beagle-extract-content ./test_search.doc (or some other test document) ?

-- Thomas

Revision history for this message

Stephan Ritscher (stephan-ritscher) wrote on 2009-08-26:

Hi Thomas,

Openoffice files work flawlessly for me. Since I don't use Microsoft Office, I compiled beagle without the support for those documents (running Gentoo ;-)).

Running on an test document looks (I guess Office 2003) something like this:

Filename: file:///home/user/test_search.doc
Warn: bibparse is not found; bibtex files will not be indexed
Debug: Loaded 61 filters from /usr/lib64/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/user/.beagle/filterver.dat ... cache is dirty ? False
Debug: No filter for file:///home/user/test_search.doc (/home/user/test_search.doc) [application/msword]
Filter: (determined in .15s)
MimeType: application/msword

Properties:
Timestamp = 2009-08-26 21:36:20 (Utc)

So the mime type is correct.
Good luck anyways.

Stephan

Revision history for this message

Williams Christ (wchrist) wrote on 2009-08-26:

Hi Stephan,
but your beagle uses the right mime-Type for MS Word files: application/msword
My beagle determines: application/octet-stream

And that is the point. I use the beagle packages from Ubuntu repositories. This produce the wrong mime-type.

I will try to compile beagle from sources an look what I can find out.

Thank you.

-- Thomas

Revision history for this message

John Baptist (jepst79) wrote on 2010-02-07:

My problem is related to this. I have several, old, archived mbox files stored in zip files. Beagle correctly recognizes the zip files, correctly identifies the mbox files in the zip files, but refuses to index the contents of the mbox files. It says:

Debug: No filter for file:///home/foobar/Documents/test.zip#saved-messages (/tmp/tmp13cc8770) [application/mbox]

I think all of our problems could be solved if Beagle let the user set the mimetype for a particular file. I would be happy if I could persuade Beagle to index my mbox files as plain text.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Beagle

Beagle uses wrong mime-type

Bug Description

Other bug subscribers

Related questions

Remote bug watches