PDF import, text import issues

Bug #965463 reported by David Mathog
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Inkscape
Triaged
Medium
Unassigned

Bug Description

There are at least two issues related to simple text import. (Current trunk, on Windows).

1. Start with the PDFtest.svg (attached)
2. Save as PDF
3. Verify that PDFtest.pdf is OK (in various PDF viewers).
4. Open PDFtest.pdf in Inkscape (trunk).

The following issues are apparent:

1. Symbol font is not imported correctly, all characters map to the "unknown" unicode glyph
2. The kerning is peculiar. This can be seen after "select all" (PDFtest1.png), where the text boxes are too tall. If one then does "remove manual kerns" the spaces are all removed (PDFtest2.png).

Revision history for this message
David Mathog (mathog) wrote :
Revision history for this message
David Mathog (mathog) wrote :
Revision history for this message
David Mathog (mathog) wrote :

Note "unknown character" for symbol font characters and oversized text box for other text

Revision history for this message
David Mathog (mathog) wrote :

Note that all spaces have been removed and there is a lot of text displacement.

On further exploration, immediately after PDFtest.pdf is imported it is not possible to select the spaces.
I believe that on PDF export spaces were actually space characters, because if PDFtest.pdf is opened in
PDF-Xchange Viewer it is possible to copy a region of text including spaces, and then paste it into an editor (notepad
or wordpad) and the spaces are pasted as spaces. It looks like, for whatever reason, PDF import is converting space
characters to kerning operations.

Revision history for this message
David Mathog (mathog) wrote :

Here is another instance of text mangling. Save the attached example as PDF, import, and the text characters appear out of order and with character changes.

   Arial 40 px, 32 pt

becomes

  Arial POpx7GF pt

In this case PDF import has some assistance screwing things up, from what looks like an export bug. If the text which looks like "Arial 40px, 32pt" is selected in PDF Xchange-viewer, and then pasted into an editor, it comes out as:

  Arial 4P pxJ 32 pt

I do that sort of thing all the time with that PDF viewer, and have never seen it do that before.

jazzynico (jazzynico)
tags: added: importing pdf text
Revision history for this message
jazzynico (jazzynico) wrote :

Confirmed on Windows XP, Inkscape trunk revision 11141, 0.48.2 and 0.48.3.1.
Note that the 0.48 branch versions don't render the unknown glyph but leave it blank (the glyph exists in the SVG code).

Changed in inkscape:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
jazzynico (jazzynico) wrote :

Reproduced again on Windows XP, Inkscape trunk revision 12836 with the test files attached here and in Bug #1046170 "text font (imported from pdf) not shown in status line and font selection dialogue".

Changed in inkscape:
status: Confirmed → Triaged
Revision history for this message
David Mathog (mathog) wrote :

This seems as good a place as any to add this. In the attached PDF the original PDF has E,m,c in the formula in different colors. When read into Inkscape those are assembled into a single text string, all with the color of the first letter.

Revision history for this message
David Mathog (mathog) wrote :

Correction, the color is from the last letter, not the first.

Revision history for this message
David Mathog (mathog) wrote :

There is a little C program pdf2svg

   http://www.cityinthesky.co.uk/opensource/pdf2svg/

that does all of the example conversions PDF->SVG cited here beautifully. It is nothing more than a thin layer over Poppler and Cairo. The pdf-parser.cpp file in Inkscape says:

 * PDF parsing using libpoppler.
 *
 * Derived from poppler's Gfx.cc

So I grabbed poppler 0.24.5 and built it, which among other things produced the pdftocairo program, and then used that to do:

  utils/pdftocairo -svg /tmp/reassemble_decorate.pdf /tmp/pop_reassemble_decorate.svg
  utils/pdftocairo -svg /tmp/PDFtest.pdf /tmp/pop_PDFtest.svg

and those SVG files too, not surprisingly, were just fine.

I googled to find some discussion of the history of the pdf conversion code in Inkscape, but the keywords were too common, so I didn't find it.

Why is Inkscape "rolling its own" for PDF to SVG conversion, rather than just using poppler and cairo like these two utilities do? Is there some functionality missing from the poppler/cairo route that the Inkscape version implements? I am all for custom code when it adds functions, or works better, but here the simple utilities and standard libraries are handling simple text conversion tasks that Inkscape's code botches.

Revision history for this message
David Mathog (mathog) wrote :

Ah, I see one reason - the text imported as drawn glyphs, not as <text> and <tspan>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.