PDF Import: spaces removed from font names

Bug #179589 reported by khovorka
4
Affects Status Importance Assigned to Milestone
Inkscape
Fix Released
Critical
Unassigned

Bug Description

 I have to import PDF files on a daily basis from PDF
 drivers such as Adobe PDF printer driver or PrimoPDF and when importing
 one thing I have recently learned as I am starting to work with
 inkscape is that the Font names seem to not get imported properly
 and perhaps this is a known bug. If the font used to print to the
 Driver is for example Times New Roman, when it is imported to SVG in
 inkscape it is converted to the name TimesNewRoman and the missing
 spaces are the main problem in the font name. If I use a text
 editor and do a search and replace on TimesNewRoman and replace it
 with Times New Roman then the font is used properly. I would
 hope this can be easily corrected.

 I have another question and I don't know if there is
 somewhere you can already do this. It would be nice if the X= value could
 be limited to just a single X coordinate rather than a list in my type
 of work so that the text string is a single string in positioning.
 Example: x="-3.296843 1.747382 2.3432245" could
 just be x="-3.296843".
 Is that a possibility?

 Sincerely,

 Kim Hovorka

Revision history for this message
Bryce Harrington (bryce) wrote :

I can confirm the font issue, having tested a variety of *.pdfs.

For the second question please either file a separate bug or ask on answers.launchpad.net/inkscape.

Changed in inkscape:
importance: Undecided → Critical
milestone: none → 0.46.1
status: New → Confirmed
Revision history for this message
Aubanel (aubanel) wrote :

Please attach an example.

Revision history for this message
Mark Everitt (mark-s-everitt) wrote : A (sort of) solution

A workaround I've come across for another issue should work here too. The side effect is that inkscape will interpret the imported pdf as normal shapes curves etc. rather than text. See this page:

https://bugs.launchpad.net/inkscape/+bug/199689/comments/4

Revision history for this message
KoRi (koen-ribus) wrote :

After some testing with pdfs created in different ways I think that there is not a general way to recreate the original font names from the names specified in the pdf. Spaces are removed, but for example Times New Roman can get shortened to Times, ...

A possible solution is to scan the available fonts and compare their names with the names in the pdf to see which one matches best, taking into account that spaces can be removed, ... Fonts that are not availble, will not get a reasonable match and can be replaced by a default font.

The attached patch can be used to test this idea ... However when this solution is chosen, some points should be clarified to ensure that everything is handled in the correct way:

- For now I only search for a match if the pdf specifies no 'font family' ... Maybe the best match to the specified family should be found when it is specified? (The only application I tested that seems to specify the family in the pdf was Adobe Illustrator and it includes the spaces ... as such it does not need matching ... but that's no guarantee of course).

- What value should '-inkscape-font-specification' get?

- If no reasonable match is found (ie. the font is probably not available) now a default font is chosen but it maybe (?) the font specification from the pdf should be kept in the svg for a possible substitution with the correct font later in time. However this would require a similar substitution at the "svg-side" ... A more general tought regarding this is whether svg-files are garanteed to specify the font names correctly (also if they come from other applications)? If not, shouldn't font substitution occur also on loading (and importing) svg files?

Would appreciate if I could get some feedback on whether / how to proceed on this track. Thanks.

Revision history for this message
KoRi (koen-ribus) wrote :
Revision history for this message
KoRi (koen-ribus) wrote :

Updated the patch to also do the matching when a family name is specified in the pdf.

The imported pdf show the correct fonts (provided that they are available).

The imported text gets its (x)-position specified per character. This ensures correct placement of the characters, but limits the editability. However don't think this should be fixed at the pdf-inport and in the scope of this bug. (Removing all but the first x-position and setting xml:space to preserve could work, but also discards any special spacing/kerning that was present). This should probably be handled more generally by allowing text-editing of text that has position info for more than only the start character(?).

Revision history for this message
bbyak (buliabyak) wrote :

KoRi: you are right, there's no general solution for PS or PDF which do not embed their fonts. This is because PS uses its own name convention which does not allow spaces. However, both TTF and PFB formats usually contain both names: the one with spaces (Family Name) and the one without (FontName). Looks like only AI is smart enough to supply the Family Name to PDF it creates.

Your approach is good I think, indeed there's nothing better we can do if the font is not embedded. Ideally it should also warn the user about substitutions, but this should be done not only for PDF but for SVG and all other formats as well, so this needs to be done in a different place. For now, I think your patch is better than nothing, I will now test it.

For kerning info, we have a special command: Text > Remove manual kerns, after which the text becomes editable.

Revision history for this message
bbyak (buliabyak) wrote :

KoRi: looking into this, I see that at least sometimes, the current code somehow manages to figure out the corrent family name with spaces. I attach one such PDF where font-family is "Times New Roman" while -inskscape-font-specification (which stores the exact and complete name of the font used in the original, not broken into CSS properties) is "TimesNewRomanPSMT". How is this possible? Can we make it work everywhere (if possible, it would be preferable), or is this only this specific PDF?

Revision history for this message
KoRi (koen-ribus) wrote :

Whenever the pdf specifies a font-family it is used to set the svg font-family. The pdf you attached (as well as the pdf I created with AI) seem to specify the family name correct (with spaces) and as such just copying the name into the svg, as the current code does, is ok.

Whenever the pdf does not specify the font-family (which was the case with most pdf's I tested) the svg font-family is based on the fonts stripped down (Orig)Name (without spaces).

The patch did the font name matching in both these cases, but since it seems that the pdf font family is a way to specify the correct name, including spaces, I removed the matching in this case in the attached patch again. This speeds up things as matching is skipped when font-family is specified (*).

Concerning -inskscape-font-specification I was (and am still) not sure about what value it should get and if the current code is correct. The patch does not change anything here. Is the current behaviour correct (without spaces, ...)?

(*) I'm aware that the matching as in the patch is slow since every time all availble font names are fetched for comparison (code from Text and Font dialiog). Especially when text is split in small portions as in some pdfs this causes much overhead. I don't think it's worth to try to optimize much until the font substitution is handled in a general way.

Revision history for this message
bbyak (buliabyak) wrote :

KoRi: the meaning of -inkscape-font-specification is to record the EXACT specification of the font, in case it is not uniquely representable in other standard CSS properties. For example, a font Ababa Swoosh from family Ababa would not be representable in CSS because it has no property for the swoosh variant. In that case "Ababa Swoosh" is placed to -inkscape-font-specification and is tried before everything else. So, this was not specifically designed for import, but it is logical to use it to store the exact font name as specified in PDF, before we substitute and/or break it into properties. It is useful meta-information which, although not used much yet, may become crucial later on (for example for exporting PDF back).

As for your patch, yes, let's not try to do any substitution when font family is explicitly given. However, there are real speed issue and exactness issue with your patch at the moment. I attach PDF which takes many minutes to import, and it replaces "Triodion-UCS" with "Trajan" - really not very similar at all! Can you try to simplify and streamline your code so it works faster and returns matches only if a whole first word of the two names has coincided?

Revision history for this message
KoRi (koen-ribus) wrote :

Updated patch:
- now checks that at least the first word of the font name matches.
  -> Would it be better if in case no matching font is found to just copy the (stripped) font-name from the pdf instead of setting the default (Arial) as is done now? (However this causes the font-selector to default to "sans", which seems impossible to change manually to another font easily).
  -> Should there be a minimal number of matching characters in case the first 'word' of the font name is very short (eg. MV Boli)?

- Improved the speed by only fetching the list with font names one time for each SvgBuilder instance. Should be reasonable now.

- as for '-inkscape-font-specification' ... I didn't change anything there (do not understand enough of how it works with PangoFontDescription and the changes related to Bug #169973). However when you say "... to store the exact font name as specified in PDF ..." that means without spaces and with style suffixes (eg.: TimesNewRomanPS-BoldItalicMT) as it is now, right? Does it also need to include the prefixes that might be present in the pdf (eg.: DAAAAA+TimesNewRomanPS-BoldItalicMT) which is currently not the case?

Hope I don't take too much of your time with all my questions ...

Revision history for this message
bbyak (buliabyak) wrote : Re: [Bug 179589] Re: PDF Import: spaces removed from font names

On Thu, Feb 12, 2009 at 4:18 PM, KoRi <email address hidden> wrote:
> Updated patch:
> - now checks that at least the first word of the font name matches.
> -> Would it be better if in case no matching font is found to just copy the (stripped) font-name from the pdf instead of setting the default (Arial) as is done now? (However this causes the font-selector to default to "sans", which seems impossible to change manually to another font easily).

that unchangeability of sans is a bug, certain to be fixed before
0.47, so yes, this is a more logical approach - let's not force
microsoft's arial on people, and try to preserve information as much
as possible

> -> Should there be a minimal number of matching characters in case the first 'word' of the font name is very short (eg. MV Boli)?

good idea - for example if first word is 3 chars or less (such as
ITC), require two words for a match

it will still break on Adobe Garamond though...

> - Improved the speed by only fetching the list with font names one time
> for each SvgBuilder instance. Should be reasonable now.

thanks, now much faster

> - as for '-inkscape-font-specification' ... I didn't change anything
> there (do not understand enough of how it works with
> PangoFontDescription and the changes related to Bug #169973). However
> when you say "... to store the exact font name as specified in PDF ..."
> that means without spaces and with style suffixes (eg.: TimesNewRomanPS-
> BoldItalicMT) as it is now, right?

yes

> Does it also need to include the
> prefixes that might be present in the pdf (eg.: DAAAAA+TimesNewRomanPS-
> BoldItalicMT) which is currently not the case?

no, i don't think these prefixes are proper part of the font name

> Hope I don't take too much of your time with all my questions ...

not at all, you're very welcome!

--
bulia byak
Inkscape. Draw Freely.
http://www.inkscape.org

Revision history for this message
KoRi (koen-ribus) wrote :

Updated patch:
- at least the words containing the first 4 characters of the fonts name should match. In other words, if the first word is very short, require the next one to match also).
- If no match is found use the font name as was in the pdf.
--
The more I look at it, the more complicated it seems to do it the right way ... For example depending on how the pdf is created (OpenOffice -> pdf or OpenOffice -> GhostScript -> pdf) the same font can be named different: "AvantGardeITCbyBT-Book" = "AvantGarde Bk BT", ... of which only the latter is guaranteed to match correctly by the patch. It may need rethinking ...
--

Revision history for this message
bbyak (buliabyak) wrote :

On Fri, Feb 13, 2009 at 10:34 AM, KoRi <email address hidden> wrote:
> The more I look at it, the more complicated it seems to do it the right way ... For example depending on how the pdf is created (OpenOffice -> pdf or OpenOffice -> GhostScript -> pdf) the same font can be named different: "AvantGardeITCbyBT-Book" = "AvantGarde Bk BT", ... of which only the latter is guaranteed to match correctly by the patch.

But why? The "complete word" requirement should only apply to the
names on Inkscape side, which have spaces, not the PDF names. So if
you have a font "AvantGarde Blah Blah" installed, both these names
would match its first word completely, and thus have the same rating.

Also, if you have name with spaces in PDF, like "AvantGarde Bk BT",
doesn't it mean that it is already non-PS and does not need replacing?

Revision history for this message
KoRi (koen-ribus) wrote :

> But why? The "complete word" requirement should only apply to the
> names on Inkscape side, which have spaces, not the PDF names.
Yes, that's how it works.
> So if
> you have a font "AvantGarde Blah Blah" installed, both these names
> would match its first word completely, and thus have the same rating.
>
My concern was that in case there are multiple AvantGarde fonts
installed ("AvantGarde Bk BT" and "AvantGarde Md BT") both match equally
well to "AvantGardeITCbyBT-Book" or "AvantGardeITCbyBT-Medium" and then
just the first one found is chosen which of course is not guaranteed to
be the right one (cases 1 and 3 in the attached file; "AvantGarde Bk BT"
is chosen in both cases because it's the first best match found, while
it is only correct for case 1).

In case the pdf uses "AvantGardeBkBT" and "AvantGardeMdBT", as is the
case with the pdf created from PS via GhostScript, there is no problem
and the correct font is chosen (cases 2 and 4 in the attached file).
> Also, if you have name with spaces in PDF, like "AvantGarde Bk BT",
> doesn't it mean that it is already non-PS and does not need replacing?
>
Yes, when it is specified in the pdf as the family name it is used as
is, else it will fully match the name of the installed font. Both these
cases are ok.

Was not sure if it we could live with these inaccuracies for now?

Revision history for this message
bbyak (buliabyak) wrote :

On Sat, Feb 14, 2009 at 7:32 AM, KoRi <email address hidden> wrote:
> Was not sure if it we could live with these inaccuracies for now?

Yes, certainly! So long as it matches AvantGarde at all, it is already
an improvement. Requiring it to always guess that "Bk" means "Book"
would be too much I think :)

On the other hand, looking at various fonts in FontForge, I can see
that both regular name (with spaces) and the PS name are recorded in
the font file itself. So, the Right Way to solve this problem would be
to find a way to access this information and to compare the font names
from PDF with this PS name of the font, then if a match is found, put
into SVG the normal-with-spaces name from the same font. I wonder if
there's some API for accessing that name - in fontconfig, pango? Can
you check this idea out if it makes sense to you?

--
bulia byak
Inkscape. Draw Freely.
http://www.inkscape.org

Revision history for this message
bbyak (buliabyak) wrote :

Better late than never: I was still having doubts about this patch, but then I decided to add a UI checkbox which allows you to disable it on PDF import dialog, which should be good enough if people dislike this behavior, and committed it with this change. Also I made it return the original PDF name instead of "Arial" if no matching installed font is found. Enjoy rev 21158!

Changed in inkscape:
status: Confirmed → Fix Released
Revision history for this message
bbyak (buliabyak) wrote :

KoRi: if you plan to work more on Inkscape (I hope you do!) please give me your sourceforge ID and I will give you svn commit access

su_v (suv-lp)
tags: added: importing
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.