PDF metadata extraction is broken

Bug #47651 reported by Ryszard Szopa
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libextractor (Ubuntu)
Expired
Medium
Unassigned

Bug Description

The new version doesn't support PDF metadata as it should. The former version (exactly, *yesterday* version ;) would extract nicely the author, subject, keywords, and so on. No it can do only this:

bies@quine:~/Opole$ extract Dziobak.pdf
software - This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4
creation date - 20060528155926+02'00'
format - PDF 1,0
mimetype - application/pdf

At first I thought this was a feature;), but I checked also with the BibTeX option:
bies@quine:~/Opole$ extract -b Dziobak.pdf
% BiBTeX file
@misc{ thisdziob,
    title = "Dziobak.pdf",
    year = "This",
    month = " i"
}

That means that the program now fails to recognize any data apart from the creation date and format...

The file itself isn't broken, because when I use pdftk everything seem to work fine:

bies@quine:~/Opole$ pdftk Dziobak.pdf dump_data
InfoKey: Creator
InfoValue: LaTeX with hyperref package
InfoKey: Title
InfoValue: Meaning, Hintikka's thesis, and computational complexity
InfoKey: Producer
InfoValue: pdfeTeX-1.21a
InfoKey: Author
InfoValue: Ryszard Szopa
InfoKey: Keywords
InfoValue: theory of meaning, P vs. NP, Hintikka's thesis, Edmonds' thesis
InfoKey: PTEX.Fullbanner
InfoValue: This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4
InfoKey: Subject
InfoValue: I discuss some natural language constructions that turn to be very difficult from the computational point of view and consider what this means for the theory of meaning.
InfoKey: CreationDate
InfoValue: D:20060528155926+02'00'
PdfID0: b159ce5e5e9d4ea86e562db5ccbefca0
PdfID1: b159ce5e5e9d4ea86e562db5ccbefca0
NumberOfPages: 8

I suspect that the extract has a rigid idea of what data should be in what place, and when finds something is different, it brokes. For example, is seems to find PTEX.Fullbanner in the place it expects year and month...

Revision history for this message
Jignesh Borad (jigneshborad) wrote :

I think I am able to get hte meta data properly.

>>extract output.pdf
modification date - D:20080714110537
creation date - 20080714110537
title - output.pdf
format - PDF 1.3
mimetype - application/pdf

>> extract -b output.pdf
% BiBTeX file
@misc{ d_20outpu,
    title = "output.pdf",
    year = "D:20",
    month = "08"
}

I am using following versions to test this.
Hardy 8.04.1
extract 0.5.18a-2

Please provide the actual pdf and, the output and the version for all commands if this problem is still there.

Changed in libextractor:
status: New → Incomplete
Revision history for this message
nglnx (nglnx) wrote :

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in libextractor (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
mark (mark-carpaij) wrote :

I would like to reopen this bug report, since I seem to have the same problem.

I followed the instructions given on http://freshmeat.net/projects/libextractor/ and downloaded an example PDF by

wget -q http://www.copyright.gov/legislation/dmca.pdf

next I ran the command EXTRACT on this pdf (as described online), and got

format - PDF 1.4
mimetype - application/pdf

Unfortunately nothing more. When i check the available metadata by the document viewer, it becomes clear that the metadata is filled out correctly. Any idea?

Changed in libextractor (Ubuntu):
status: Invalid → New
Revision history for this message
mark (mark-carpaij) wrote :

i followed these instructions published on this website, not the one mentioned before

http://www.gnu.org/software/libextractor/documentation.html

Revision history for this message
Leo (leorolla) wrote :

Please provide the actual pdf and, the output and the version for all commands if this problem is still there.

Changed in libextractor (Ubuntu):
status: New → Incomplete
Revision history for this message
mark (mark-carpaij) wrote :

the version of the extract tool: 0.5.21
pdf file attached.

Revision history for this message
mark (mark-carpaij) wrote :

ubuntu 9.10

Revision history for this message
mark (mark-carpaij) wrote :

$ extract dmca.pdf
format - PDF 1.4
mimetype - application/pdf

Revision history for this message
Leo (leorolla) wrote :

And the expected outpout would contain Subject, Author, Producer, etc... ?

Could you install it from the repositories and test again?
Just undo what you did to install it and install extractor.

Revision history for this message
rusivi2 (rusivi2-deactivatedaccount) wrote :

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in libextractor (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
rusivi2 (rusivi2-deactivatedaccount) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. My apologies as I should not have marked this Invalid. The issue that you reported is one that should be reproducible with the live environment of the Desktop CD of the development release - Maverick Meerkat. It would help us greatly if you could test with it so we can work on getting it fixed in the next release of Ubuntu. You can find out more about the development release at http://www.ubuntu.com/testing/ . Thanks again and we appreciate your help.

Changed in libextractor (Ubuntu):
status: Invalid → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for libextractor (Ubuntu) because there has been no activity for 60 days.]

Changed in libextractor (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.