Beagle uses wrong mime-type

Bug #365426 reported by Williams Christ
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beagle
New
Undecided
Unassigned

Bug Description

Beagle is not able to index or find all major office file types like ms word, excel, openoffice.
Here is the dialog of beagle-extract-content program:

--start --
~/Dokumente/search$ beagle-extract-content ./test_search.doc
Filename: file:///home/tv/Dokumente/search/test_search.doc
Warn: bibparse is not found; bibtex files will not be indexed
Debug: Loaded 64 filters from /usr/lib/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/tv/.beagle/filterver.dat ... cache is dirty ? False
Debug: No filter for file:///home/tv/Dokumente/search/test_search.doc (/home/tv/Dokumente/search/test_search.doc) [application/octet-stream]
Filter: (determined in ,29s)
MimeType: application/octet-stream
-- end --

the mimetype for all office file types is always: application/octet-stream.

With xdg-mime everything is recognized well:

~/Dokumente$ xdg-mime query filetype test_search.doc
application/msword

Thank you for any help
Thomas

Revision history for this message
Andrej Kazakov (andrejserafim) wrote :

I have a very similar problem, only with PDF files.

when I run:
$ beagle-extract-content mockus.pdf
Filename: file:///tmp/mockus.pdf
Debug: Loaded 64 filters from /usr/lib/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/serafim/.beagle/filterver.dat ... cache is dirty ? False
Debug: No filter for file:///tmp/mockus.pdf (/tmp/mockus.pdf) [application/octet-stream]
Filter: (determined in ,36s)
MimeType: application/octet-stream

Properties:
  Timestamp = 2009-05-19 11:51:49 (Utc)

On the other hand, when I run:
$ beagle-extract-content --mimetype=application/pdf mockus.pdf
Filename: file:///tmp/mockus.pdf
Debug: Loaded 64 filters from /usr/lib/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/s/.beagle/filterver.dat ... cache i
s dirty ? False
Filter: Beagle.Filters.FilterPdf (determined in ,40s)
MimeType: application/pdf

Properties:
  Timestamp = 2009-05-19 11:51:49 (Utc)
  beagle:FileType = document
  dc:appname = Acrobat Distiller 3.0 for Power Macintosh
  fixme:page-count = 38

Content:

........................

Text extracted in 63,79s

Thanks,
Andrej

Revision history for this message
Andrej Kazakov (andrejserafim) wrote :

Forgot to say:

$ xdg-mime query filetype mockus.pdf
application/pdf

Andrej

Revision history for this message
Williams Christ (wchrist) wrote :

Are we both the only people that have problems with beagle indexing? Is someone else using beagle? If there are no problems: how do you manage this?

-- Thomas

Revision history for this message
Stephan Ritscher (stephan-ritscher) wrote :

Hi guys,
I think I had the same problem. After some debugging I found out that it was a problem with the mime cache. After I removed ~/.local/share/mime/mime.cache at least beagle-extract-content worked properly on my pdfs. Now I'm gonna rebuild the index...
Btw, also re-installing the packages shared-mime-info and xdg-utils could help (at least these are the names in Gentoo).
Hope that helps.

Cheers, Stephan

Revision history for this message
Williams Christ (wchrist) wrote :

Hi Stephan,
indexing PDFs was never a problem for me. But whats about office files like Word, Excel, Powerpoint or Openoffice files? Can you index and find them?
What is your output from:
beagle-extract-content ./test_search.doc (or some other test document) ?

-- Thomas

Revision history for this message
Stephan Ritscher (stephan-ritscher) wrote :

Hi Thomas,

Openoffice files work flawlessly for me. Since I don't use Microsoft Office, I compiled beagle without the support for those documents (running Gentoo ;-)).

Running on an test document looks (I guess Office 2003) something like this:

Filename: file:///home/user/test_search.doc
Warn: bibparse is not found; bibtex files will not be indexed
Debug: Loaded 61 filters from /usr/lib64/beagle/Filters/Filters.dll
Debug: Verifying filter_cache at /home/user/.beagle/filterver.dat ... cache is dirty ? False
Debug: No filter for file:///home/user/test_search.doc (/home/user/test_search.doc) [application/msword]
Filter: (determined in .15s)
MimeType: application/msword

Properties:
  Timestamp = 2009-08-26 21:36:20 (Utc)

So the mime type is correct.
Good luck anyways.

Stephan

Revision history for this message
Williams Christ (wchrist) wrote :

Hi Stephan,
but your beagle uses the right mime-Type for MS Word files: application/msword
My beagle determines: application/octet-stream

And that is the point. I use the beagle packages from Ubuntu repositories. This produce the wrong mime-type.

I will try to compile beagle from sources an look what I can find out.

Thank you.

-- Thomas

Revision history for this message
John Baptist (jepst79) wrote :

My problem is related to this. I have several, old, archived mbox files stored in zip files. Beagle correctly recognizes the zip files, correctly identifies the mbox files in the zip files, but refuses to index the contents of the mbox files. It says:

Debug: No filter for file:///home/foobar/Documents/test.zip#saved-messages (/tmp/tmp13cc8770) [application/mbox]

I think all of our problems could be solved if Beagle let the user set the mimetype for a particular file. I would be happy if I could persuade Beagle to index my mbox files as plain text.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.