PDF metadata extraction is broken
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libextractor (Ubuntu) |
Expired
|
Medium
|
Unassigned |
Bug Description
The new version doesn't support PDF metadata as it should. The former version (exactly, *yesterday* version ;) would extract nicely the author, subject, keywords, and so on. No it can do only this:
bies@quine:~/Opole$ extract Dziobak.pdf
software - This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4
creation date - 20060528155926+
format - PDF 1,0
mimetype - application/pdf
At first I thought this was a feature;), but I checked also with the BibTeX option:
bies@quine:~/Opole$ extract -b Dziobak.pdf
% BiBTeX file
@misc{ thisdziob,
title = "Dziobak.pdf",
year = "This",
month = " i"
}
That means that the program now fails to recognize any data apart from the creation date and format...
The file itself isn't broken, because when I use pdftk everything seem to work fine:
bies@quine:~/Opole$ pdftk Dziobak.pdf dump_data
InfoKey: Creator
InfoValue: LaTeX with hyperref package
InfoKey: Title
InfoValue: Meaning, Hintikka's thesis, and computational complexity
InfoKey: Producer
InfoValue: pdfeTeX-1.21a
InfoKey: Author
InfoValue: Ryszard Szopa
InfoKey: Keywords
InfoValue: theory of meaning, P vs. NP, Hintikka's thesis, Edmonds' thesis
InfoKey: PTEX.Fullbanner
InfoValue: This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4
InfoKey: Subject
InfoValue: I discuss some natural language constructions that turn to be very difficult from the computational point of view and consider what this means for the theory of meaning.
InfoKey: CreationDate
InfoValue: D:2006052815592
PdfID0: b159ce5e5e9d4ea
PdfID1: b159ce5e5e9d4ea
NumberOfPages: 8
I suspect that the extract has a rigid idea of what data should be in what place, and when finds something is different, it brokes. For example, is seems to find PTEX.Fullbanner in the place it expects year and month...
I think I am able to get hte meta data properly.
>>extract output.pdf
modification date - D:20080714110537
creation date - 20080714110537
title - output.pdf
format - PDF 1.3
mimetype - application/pdf
>> extract -b output.pdf
% BiBTeX file
@misc{ d_20outpu,
title = "output.pdf",
year = "D:20",
month = "08"
}
I am using following versions to test this.
Hardy 8.04.1
extract 0.5.18a-2
Please provide the actual pdf and, the output and the version for all commands if this problem is still there.