PDF Export glyph-character mapping is odd
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Inkscape |
New
|
Undecided
|
Unassigned |
Bug Description
It appears that there is something strange in the CIDMap used with PDF export in Inkscape 0.48.4 r9939. I've seen hints of this bug in older reports (that led to the PDF+LaTeX feature), but nothing used the keywords that seemed relevant to me, so I'm opening a new one.
The initially observed behavior was that some PDF exports containing text had one or more character substitutions, so that even though the PDF displayed properly, copy-paste of text gave incorrect results. The glyph/character mismatches changed for different files, or the same file, if edited. Sometimes there was no mismatch at all.
I don't know the Inkscape code/libraries well enough to be sure where the blame for the bug should lie.
I would be tempted to call this a _reader_ error except mismatches reproduce on reading with both Adobe Acrobat, and using the Python text-extraction module pdfminer. I am not bold or knowledgeable enough to call it a spec error. So I'm reporting here.
At least I can offer a workaround.
So:
The font I'm using is plain old Ariel, *without* conversion of text to paths.
After initial inquiry (pdfminer to examine full file contents), the behavior can be explained as a double table mapping.
If you have a CIDMap entry (in a PDF file made by Inkscape) of:
<01> <0020>
then it should map glyph 1 to a space (0x20). If there is no entry <20>, it does. However, if there is, later down the list, an entry
<20> <0021>
then it appears as though the 0x20 is remapped into the table, and the resulting mapping of glyph-to-character produces 0x21 -- an exclamation point. The same behavior reproduces at least for other ASCII/Unicode values below 127; I don't have any tables long enough to test other values.
The "remapping" theory is based on behavior, and not any analysis or understanding of the underlying mechanics.
And I note that if the CIDMap table is actually _supposed_ to work like that, then I am very surprised. Nevertheless,
it interferes with the use of Inkscape-produced PDF files.
WORKAROUND:
The reason for the file-contents dependent behavior appears to be that Inkscape sends characters to be mapped in the Z-order they are encountered. So editing text can change the mapping, and moving text objects up and down in the Z-order does the same thing. This suggests a workaround: put in an appropriate hidden text item to control the CIDMap order. Hidden means camouflaged against the background, or under another object; making it transparent or off-page will cause the item to be dropped. The hidden item needs to be bottom-most in the Z-order of the objects exported to the PDF.
A string that works in my hands (for the characters it contains) is:
`abcdefghijklmn
Basically, the space gets pushed out to position 0x32 (position 0x01 is first), and all characters after it are likewise positioned ordinally at their own code points. The table is short enough that the small letters don't get corresponding table entries to overwrite them unless some further character is added to the table by another text element. In my experiments, other characters got added to a different CIDMap, so this workaround should cover many situations
where higher-code point characters are used.
But this is All So Wrong.
Any ideas?
EDIT: I meant: "the space gets pushed out to decimal position 32." Hex is 0x20.