Ubuntu
libextractor package

PDF metadata extraction is broken

Bug #47651 reported by Ryszard Szopa on 2006-05-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	libextractor (Ubuntu)	Expired	Medium	Unassigned

Bug Description

The new version doesn't support PDF metadata as it should. The former version (exactly, *yesterday* version ;) would extract nicely the author, subject, keywords, and so on. No it can do only this:

bies@quine:~/Opole$ extract Dziobak.pdf
software - This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4
creation date - 20060528155926+02'00'
format - PDF 1,0
mimetype - application/pdf

At first I thought this was a feature;), but I checked also with the BibTeX option:
bies@quine:~/Opole$ extract -b Dziobak.pdf
% BiBTeX file
@misc{ thisdziob,
    title = "Dziobak.pdf",
    year = "This",
    month = " i"
}

That means that the program now fails to recognize any data apart from the creation date and format...

The file itself isn't broken, because when I use pdftk everything seem to work fine:

bies@quine:~/Opole$ pdftk Dziobak.pdf dump_data
InfoKey: Creator
InfoValue: LaTeX with hyperref package
InfoKey: Title
InfoValue: Meaning, Hintikka's thesis, and computational complexity
InfoKey: Producer
InfoValue: pdfeTeX-1.21a
InfoKey: Author
InfoValue: Ryszard Szopa
InfoKey: Keywords
InfoValue: theory of meaning, P vs. NP, Hintikka's thesis, Edmonds' thesis
InfoKey: PTEX.Fullbanner
InfoValue: This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4
InfoKey: Subject
InfoValue: I discuss some natural language constructions that turn to be very difficult from the computational point of view and consider what this means for the theory of meaning.
InfoKey: CreationDate
InfoValue: D:20060528155926+02'00'
PdfID0: b159ce5e5e9d4ea86e562db5ccbefca0
PdfID1: b159ce5e5e9d4ea86e562db5ccbefca0
NumberOfPages: 8

I suspect that the extract has a rigid idea of what data should be in what place, and when finds something is different, it brokes. For example, is seems to find PTEX.Fullbanner in the place it expects year and month...

Revision history for this message

Jignesh Borad (jigneshborad) wrote on 2008-11-13:

I think I am able to get hte meta data properly.

>>extract output.pdf
modification date - D:20080714110537
creation date - 20080714110537
title - output.pdf
format - PDF 1.3
mimetype - application/pdf

>> extract -b output.pdf
% BiBTeX file
@misc{ d_20outpu,
    title = "output.pdf",
    year = "D:20",
    month = "08"
}

I am using following versions to test this.
Hardy 8.04.1
extract 0.5.18a-2

Please provide the actual pdf and, the output and the version for all commands if this problem is still there.

Changed in libextractor:
status:	New → Incomplete

Revision history for this message

nglnx (nglnx) wrote on 2009-04-02:

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in libextractor (Ubuntu):
status:	Incomplete → Invalid

Revision history for this message

mark (mark-carpaij) wrote on 2010-01-18:

I would like to reopen this bug report, since I seem to have the same problem.

I followed the instructions given on http://freshmeat.net/projects/libextractor/ and downloaded an example PDF by

wget -q http://www.copyright.gov/legislation/dmca.pdf

next I ran the command EXTRACT on this pdf (as described online), and got

format - PDF 1.4
mimetype - application/pdf

Unfortunately nothing more. When i check the available metadata by the document viewer, it becomes clear that the metadata is filled out correctly. Any idea?

Changed in libextractor (Ubuntu):
status:	Invalid → New

Revision history for this message

mark (mark-carpaij) wrote on 2010-01-18:

i followed these instructions published on this website, not the one mentioned before

http://www.gnu.org/software/libextractor/documentation.html

Revision history for this message

Leo (leorolla) wrote on 2010-06-01:

Please provide the actual pdf and, the output and the version for all commands if this problem is still there.

Changed in libextractor (Ubuntu):
status:	New → Incomplete

Revision history for this message

mark (mark-carpaij) wrote on 2010-06-01:

dmca.pdf Edit (71.1 KiB, application/pdf)

the version of the extract tool: 0.5.21
pdf file attached.

Revision history for this message

mark (mark-carpaij) wrote on 2010-06-01:

ubuntu 9.10

Revision history for this message

mark (mark-carpaij) wrote on 2010-06-01:

$ extract dmca.pdf
format - PDF 1.4
mimetype - application/pdf

Revision history for this message

Leo (leorolla) wrote on 2010-06-02:

And the expected outpout would contain Subject, Author, Producer, etc... ?

Could you install it from the repositories and test again?
Just undo what you did to install it and install extractor.

Revision history for this message

rusivi2 (rusivi2-deactivatedaccount) wrote on 2010-09-15:

#10

Changed in libextractor (Ubuntu):
status:	Incomplete → Invalid

Revision history for this message

rusivi2 (rusivi2-deactivatedaccount) wrote on 2010-09-19:

#11

Thank you for taking the time to report this bug and helping to make Ubuntu better. My apologies as I should not have marked this Invalid. The issue that you reported is one that should be reproducible with the live environment of the Desktop CD of the development release - Maverick Meerkat. It would help us greatly if you could test with it so we can work on getting it fixed in the next release of Ubuntu. You can find out more about the development release at http://www.ubuntu.com/testing/ . Thanks again and we appreciate your help.

Changed in libextractor (Ubuntu):
status:	Invalid → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2010-12-23:

#12

[Expired for libextractor (Ubuntu) because there has been no activity for 60 days.]

Changed in libextractor (Ubuntu):
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

dmca.pdf Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulibextractor package

PDF metadata extraction is broken

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
libextractor package