[Enhancement] configure metadata import when importing pdf file in calibre

Bug #1440304 reported by iostrym on 2015-04-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Undecided
Unassigned

Bug Description

In 2.23 on win7 64 bits, when importing a pdf in calibre, some common metadata in pdf file can be read by calibre to be imported in calibre metadata.
for example : title, author and tag are imported. Also subject metadata is put in comment.

By testing I saw that :

- first line of subject is put in calibre tag (pdf subject can set in many lines using some pdf editor)
- full subject (including others lines) are put in calibre comment
- tag must be separated by comma.

But maybe this import feature is described somewhere ?

Something great would be for example
- to configure the "separator" used between tags because some pdf editor don't support comma and want ";"
- to be able to disable de first line import in subject for tags
- to be able to customize which calibre metadata is written using which pdf metadata :
   ie : published date is first line of subject
          isbn is second line of subject
          others lines of subject are comment

I would be happy to help if I was showed where this is done in code...

I dont see much point in this. PDF supports the XMP metadata standard.
Simply use a PDF metadata editor that supports XMP, such as calibre
itself (the ebook-meta command line tool from calibre). calibre prefers
XMP metadata over the Info dict, unless the latter has a newer mod date.

See the metadata_from_xmp_packet() function in the calibre source code
for how exactly XML metadata is mapped to calibre metadata.

 status wontfix

Changed in calibre:
status: New → Won't Fix
Download full text (3.5 KiB)

Thanks a lot for answering so quickly this report. I don't even know if this answer will be logged somewhere ...

I found what gives the strange behavior. When saving the pdf with xchange I have strange behavior with tag when importing in calibre and when I open and save the "wrong" pdf with adobe reader, the import will be ok regarding tags.

both pdf version are 1.6

after extracting metadata with exiftool, I notice that

import ko pdf : linearized (no) and XMP Toolkit = XMP Core 4.1.1
import ok pdf : linearized (yes) and XMP Toolkit = Adobe XMP Core 5.4

don't know if calibre matters the version of XMP Toolkit. But for sure, there is something in metadata that Calibre don't like with PDF XChange.

By the way, do you know which XMP is read by Calibre when importing the Pdf (when no calibre xmp metadata are available) ?

http://ns.adobe.com/* ones ?
or
http://purl.org/dc/elements/1.1

As I understood, after a calibre export, calibre will use its own XMP metadata : http://calibre-ebook.com/xmp-namespace for "custom" metadata. but for common metadata standart XMP are used (adobes or dc I don't know). Because I don't see any title,author nor tag in calibre metadata.

And there is something strange anyway because :
- add tag in calibre : toto, titi
- exporting a PDF in calibre
- re-import the PDF in calibre => tag are concatened with "_" : toto_titi (one tag)

Best regards,

Armandooooo

> Date: Sat, 4 Apr 2015 13:39:03 +0000
> From: <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1440304] Re: calibre bug 1440304
>
> I dont see much point in this. PDF supports the XMP metadata standard.
> Simply use a PDF metadata editor that supports XMP, such as calibre
> itself (the ebook-meta command line tool from calibre). calibre prefers
> XMP metadata over the Info dict, unless the latter has a newer mod date.
>
> See the metadata_from_xmp_packet() function in the calibre source code
> for how exactly XML metadata is mapped to calibre metadata.
>
> status wontfix
>
> ** Changed in: calibre
> Status: New => Won't Fix
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1440304
>
> Title:
> [Enhancement] configure metadata import when importing pdf file in
> calibre
>
> Status in calibre: e-book management:
> Won't Fix
>
> Bug description:
> In 2.23 on win7 64 bits, when importing a pdf in calibre, some common metadata in pdf file can be read by calibre to be imported in calibre metadata.
> for example : title, author and tag are imported. Also subject metadata is put in comment.
>
> By testing I saw that :
>
> - first line of subject is put in calibre tag (pdf subject can set in many lines using some pdf editor)
> - full subject (including others lines) are put in calibre comment
> - tag must be separated by comma.
>
> But maybe this import feature is described somewhere ?
>
> Something great would be for example
> - to configure the "separator" used between tags because some pdf editor don't support comma and want ";"
> - to be able to disable de first line import in subject for tags
> - to be able t...

Read more...

Calibre uses the standard Dublin Core metadata fields where they are
available and only uses its own namespaced fields for metadata not
defined in the Dublin Core standard.

iostrym (armandooooo) wrote :

Thanks for your precision.

As attached file, are 2 identical pdf. one is ok when imported in calibre and the second one is KO.

The one that is OK has all its metadata cleaned by PDF-shelltools and new metadata written by pdf-xchange.
PDF KO has a lot of metadata but also somes written by pdf-xchange
metadata extraction by exiftool is also given for a winmerge compare...

Import KO mean that TAG importation by calibre is wrong because tag are stuck together and first line of subject is used as a tag.

I made a lot of test, pdf version 1.4 or 1.6 has nothing to do, XMP toolbox version also. The only secure way I found for correct import, is to cleanup all tags before doing my own tag correctly with pdf-xchange.

Do you prefer me to open a new case for this specific bug ?

By the way, also export pdf out of calibre and exporting them gives also strange result with tags: also do you want me to open a specific case for this also ?

Concerning the original purpose of this Enhancement request. The goal is to have more flexibility when doing pdf import. I understand that calibre has its own XMP metadata but when importing a lot of PDF that were not stored originally in calibre, it could be useful to have some flexibility in the way Calibre read Dublin Core metadata fields or Calibre metadata field and use them for filling calibre metadata (.opf file)

For example someone who has put a version number in a Description field of DC metadata may be interested to have this value kept by Calibre when importing his PDF. The is a filename import feature that is powerful and configurable using regexp but there is nothing configurable when using metadata import.

For example, also have a mix between filename extraction and metadata extraction because currently when filename import is checked, metadata of the pdf are ignored and only filename is used.

Kovid Goyal (kovid) wrote :

Configuring metadata read from individual file formats is not something
that belongs in the core calibre program. Fortunately, calibre has an
extensive plugin framework. You can write a calibre metadata reader
plugin of your own to override the builtin one, if you want.

As for your OK and KO pdf files, as I said before, XMP metadata is used
in preference to the Info metadata always, *except* when the last
modified date of theInfo dictionary is newer than the XMP metadata
block. That is the case in your KO PDF. You need to make sure that
whatever program you are using to edit XMP metadata updates the last
modified date in the XMP block correctly. Use the calibre ebook-meta
program if you cant find any others.

Kovid Goyal (kovid) wrote :

To be precise, calibre compares the ModDate from the PDF Info dictionary to the MetadataData in the XMP block. In your problem PDF, the ModDate is Mon Apr 6 23:24:42 2015 and the MetadataDate is 2014-04-22T00:53:01+02:00

so calibre will use the information from the Info block rather than the XMP, since the Info block is marked as being newer.

Thanks a lot for your time. I did not understand that calibre use also PDF info dic. So calibre use PDF info dic, xmp Dublin core and also xmp calibre meta data (only for custom metadata not available in DC metadata) ?
PDF info dic are not OK in the wrong PDF file ? Because PDF xchange change both PDF info dic and xmp Dublin core in same manipulation. So even reading info dic, it should be OK...
I think I start to understand in KO file info dic are read, in OK file DC xmp are read. Even if info dic are identical in both. But there is something calibre don't like in info dic. But what...

--- Message initial ---

De : "Kovid Goyal" <email address hidden>
Envoyé : 7 avril 2015 08:35
A : <email address hidden>
Objet : [Bug 1440304] Re: [Enhancement] configure metadata import when importing pdf file in calibre

To be precise, calibre compares the ModDate from the PDF Info dictionary
to the MetadataData in the XMP block. In your problem PDF, the ModDate
is Mon Apr 6 23:24:42 2015 and the MetadataDate is
2014-04-22T00:53:01+02:00

so calibre will use the information from the Info block rather than the
XMP, since the Info block is marked as being newer.

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1440304

Title:
  [Enhancement] configure metadata import when importing pdf file in
  calibre

Status in calibre: e-book management:
  Won't Fix

Bug description:
  In 2.23 on win7 64 bits, when importing a pdf in calibre, some common metadata in pdf file can be read by calibre to be imported in calibre metadata.
  for example : title, author and tag are imported. Also subject metadata is put in comment.

  By testing I saw that :

  - first line of subject is put in calibre tag (pdf subject can set in many lines using some pdf editor)
  - full subject (including others lines) are put in calibre comment
  - tag must be separated by comma.

  But maybe this import feature is described somewhere ?

  Something great would be for example
  - to configure the "separator" used between tags because some pdf editor don't support comma and want ";"
  - to be able to disable de first line import in subject for tags
  - to be able to customize which calibre metadata is written using which pdf metadata :
     ie : published date is first line of subject
            isbn is second line of subject
            others lines of subject are comment

  I would be happy to help if I was showed where this is done in code...

To manage notifications about this bug go to:
https://bugs.launchpad.net/calibre/+bug/1440304/+subscriptions

Download full text (3.5 KiB)

Hi,

from exiftool I have the following information regarding wrong PDF file :

---- XMP-xmp ----
Create Date : 2014:04:22 00:43:38+02:00
Modify Date : 2015:04:06 23:24:42+02:00
Creator Tool : PFU ScanSnap Manager 6.2.22 #SV600
Metadata Date : 2014:04:22 00:53:01+02:00
Caption Writer : ATT

"metadata date" is 2014-04-22T00:53:01+02:00 (as you said) but "Modify date" is 2015:04:06 23:24:42+02:00. And Modify date from XMP metadata is same date than info dict modify date.

Anyway, both metadata should be ok because both contains correct information. I don't understand why PDF info should be a problem as adobe reader and also pdf xchange are able to read info dict correctly.

and if you compare both xmp and info dict date, in the two different document, there are always the same because info dict and xmp metadata are written in same time by the same programme (pdf x change) :

OK PDF :
---- PDF ----
Modify Date : 2015:04:06 23:56:21+02:00
---- XMP-xmp ----
Modify Date : 2015:04:06 23:56:21+02:00

KO PDF :
---- PDF ----
Modify Date : 2015:04:06 23:24:42+02:00
---- XMP-xmp ----
Modify Date : 2015:04:06 23:24:42+02:00

What is the metadata taken in account then when date are the same ? And why should XMP metadata ir info dict give incorrect result in Calibre ?

Best regards,

> Date: Tue, 7 Apr 2015 06:29:42 +0000
> From: <email address hidden>
> To: <email address hidden>
> Subject: [Bug 1440304] Re: [Enhancement] configure metadata import when importing pdf file in calibre
>
> To be precise, calibre compares the ModDate from the PDF Info dictionary
> to the MetadataData in the XMP block. In your problem PDF, the ModDate
> is Mon Apr 6 23:24:42 2015 and the MetadataDate is
> 2014-04-22T00:53:01+02:00
>
> so calibre will use the information from the Info block rather than the
> XMP, since the Info block is marked as being newer.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1440304
>
> Title:
> [Enhancement] configure metadata import when importing pdf file in
> calibre
>
> Status in calibre: e-book management:
> Won't Fix
>
> Bug description:
> In 2.23 on win7 64 bits, when importing a pdf in calibre, some common metadata in pdf file can be read by calibre to be imported in calibre metadata.
> for example : title, author and tag are imported. Also subject metadata is put in comment.
>
> By testing I saw that :
>
> - first line of subject is put in calibre tag (pdf subject can set in many lines using some pdf editor)
> - full subject (including others lines) are put in calibre comment
> - tag must be separated by comma.
>
> But maybe this import feature is described somewhere ?
>
> Something great would be for example
> - to configure the "separator" used between tags because some pdf editor don't support comma and want ";"
> - to be able to disable de first line import in subject for tags
> - to be able to customize which calibre metadata is written usi...

Read more...

It is simple:

XMP metadata is always used inpreference to info metadata, unless the
metadatadate is LESS THAN the modify date in the Info block. And when
reading Info metadata tags are assumed to be separated by commas not
semicolons.

Like I said before use a decent program to edit your PDF's XML metadata.
One that changes the emtadata date, and you will be fine.

Thanks,
In OK PDF, there is only xmp 'modify date' and not 'metadata date'
So if xmp 'modify date' is always used and not 'metadata date', then for both file (OK and KO PDF) xmp should be used because for both file xmp 'modify date' equal info block modify date... (Have a look at text files delivered with PDF file that contains metadata from exif tool)

Regards,

--- Message initial ---

De : "Kovid Goyal" <email address hidden>
Envoyé : 8 avril 2015 05:25
A : <email address hidden>
Objet : [Bug 1440304] Re: calibre bug 1440304

It is simple:

XMP metadata is always used inpreference to info metadata, unless the
metadatadate is LESS THAN the modify date in the Info block. And when
reading Info metadata tags are assumed to be separated by commas not
semicolons.

Like I said before use a decent program to edit your PDF's XML metadata.
One that changes the emtadata date, and you will be fine.

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1440304

Title:
  [Enhancement] configure metadata import when importing pdf file in
  calibre

Status in calibre: e-book management:
  Won't Fix

Bug description:
  In 2.23 on win7 64 bits, when importing a pdf in calibre, some common metadata in pdf file can be read by calibre to be imported in calibre metadata.
  for example : title, author and tag are imported. Also subject metadata is put in comment.

  By testing I saw that :

  - first line of subject is put in calibre tag (pdf subject can set in many lines using some pdf editor)
  - full subject (including others lines) are put in calibre comment
  - tag must be separated by comma.

  But maybe this import feature is described somewhere ?

  Something great would be for example
  - to configure the "separator" used between tags because some pdf editor don't support comma and want ";"
  - to be able to disable de first line import in subject for tags
  - to be able to customize which calibre metadata is written using which pdf metadata :
     ie : published date is first line of subject
            isbn is second line of subject
            others lines of subject are comment

  I would be happy to help if I was showed where this is done in code...

To manage notifications about this bug go to:
https://bugs.launchpad.net/calibre/+bug/1440304/+subscriptions

iostrym (armandooooo) wrote :

Hello, sorry to insist but I double check in KO PDF and xmp 'modify date' is up to date. Only xmp 'metadata date' is not up to date.
Is this the root of problem ? If yes, this is strange because there is no xmp 'metadata date' in OK PDF. So how OK Pdf is working correctly in calibre ?

I understand my problem is not in the main part of calibre but I don't know if there is something tricky in calibre or if my PDF are 'metadata' corrupted

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers