[EMF to SVG conversion]: chinese characters

Bug #1336753 reported by Kanstantsin
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Inkscape
Fix Released
Medium
David Mathog

Bug Description

Hi,

For better understanding, some workflow details are below:
1. Export Excel range or chart to emf image
2. Convert emf file from step 1 to svg using Inkscape through command line. Command line pattern is -l "<source>.emf" "<target>.svg"

In most cases it works perfectly without any issues. But whenever table contains chinese characters, theirs positions are incorrect in resulting svg file. Please compare original_emf.emf and svg_processed.svg from attached archive. Could you please take a look? Please let me know if you need more details/examples/original excel files/etc...

Just for note: If Excel font is set to Microsoft JhengHei UI, .svg file is fine (check original_MicrosoftJhengHeiUI.emf and svg_MicrosoftJhengHeiUI.svg in the archive).

Inskape version - Inkscape 0.48+devel r12992
OS - Windows 7 Enterprise SP1

Thanks in advance!

Best Regards,
Kanstantsin Krasouski

Tags: emf importing
Revision history for this message
Kanstantsin (kanstantsin-krasouski) wrote :
description: updated
jazzynico (jazzynico)
tags: added: importing
removed: svg
Revision history for this message
Steven Eastman (steven-eastman) wrote :

Hello,

I am a colleague of Kanstantsin. Would it be possible to have an initial review of this bug? It is urgent to a solution (large-scale deployment) at a top 10 banking client. If more details or clarifications are needed, please let us know and we will be happy to provide whatever is needed.

Thanks -- Steven

Revision history for this message
su_v (suv-lp) wrote :

@David Mathog - any chance you could take a closer look at this EMF-related report for trunk?

Revision history for this message
David Mathog (mathog) wrote :

This appears to be a font substitution issue. The text which uses MS Gothic moves around, but the text which uses MS JhengHeiHUI does not. My Windows system does not have the JhengHeiUI font, so the second EMF Chinese characters only show up as squares. My linux systems uses "Microsoft YaHei UI" (is that the same font, perhaps with a Cantonese versus Mandarin name?) and I can see the characters, but do not know if they are positioned properly. I can see when Inkscape loads the two EMFs that the text is in different positions for MS Gothic and JhengHeiUI, but I do not that either is where it is supposed to be.

To help me work on this please construct the following much simpler examples and post them:

1. EMF file with one line of horizontal Chinese text, with a red colored rectangular line tightly fit around it (to show me where the text is supposed to be) in the MS Gothic font. Please put a "1" near the lexically first character. (I do not read this language, and do not know if the characters are R->L or L->R.) You can make this in Powerpoint. Select everything in the slide, right click, choose "Save as Picture...", and save to EMF.
2. Ditto, but with the JhengHeiUI font.
3. PNG's of these two images, so I know what they looked like on your machine. (In case my machine does not display them correctly.)

Please test some other Chinese fonts and see what happens. Is this an issue specific to MS Gothic (which might, for instance, list font parameters that do not actually correspond to what its Chinese Glyphs need) or is it a general problem, affecting many different fonts?

In the meantime, since you seem to be pressed for time, if JhengHeiUI works, is there some reason you cannot use that?

Revision history for this message
David Mathog (mathog) wrote :

After a bit more exploration it seems that even on my Linux system there is font substitution going on for Chinese characters. The font identified as "Microsoft YaHei UI" was actually something else. (I don't know what, there is just a little symbol next to the font name that shows it has been substituted in Inkscape when text is selected with the text tool). Similarly, MS Gothic was substituted. I made a tiny test file with a couple of characters pulled from your example converted to "Times New Roman", "Arial", "Gothic", and "YaHei". Saved as EMF and SVG. Read back in. Nothing moved around - but that is because the same font substitutions are going on at all steps, as this was all on one machine.

The issue you are going to run into is that whatever your target platform is that must display the final SVG it is going to need to have the right font(s) installed, otherwise it will not be quite as you intended. In European languages if one sticks with Arial and the other common fonts that isn't a problem, as those are always present or map nearly 1:1. Presumably there is a similar "common" Chinese font for web display, and that is the font you need to use. Not MS Gothic.

In any case, one way around this is to use the "Text to Path" option on export. In the attached example I first made the Chinese4.emf file, then saved to Chinese4.svg. Then I saved it to Chinese4b.emf using the "text to path" option. Opened that back up again in Inkscape (everything still in the right place) and saved as Chinese4b.svg. On my XP system, I can see the Chinese text in the two 4b versions (because they are drawings, not real text), whereas the Chinese4.emf comes up with empty squares for all characters. Chinese4.svg looks OK in Firefox but I can see some slight offsets on the text, probably due to font substitutions. Chinese4b.svg looks exactly like it did in Inkscape on Linux, with no offsets. Of course it is no longer text, just a drawing of text, so people cannot select, copy, and paste anything from it into an editor.

Revision history for this message
Dzmitry Matveyev (dzmitry-matveyev) wrote :

Here are attached examples with emf, svg and png files for MS Gothic, Arial, Microsoft JhengHei UI and Microsoft YaHei UI fonts.

Revision history for this message
David Mathog (mathog) wrote :
Revision history for this message
David Mathog (mathog) wrote :
Revision history for this message
David Mathog (mathog) wrote :

This is an odd one. I had to install MS Gothic on my XP machine to see this example. Once I did it loaded in Inkscape as you had said (above the expected position). However, if I moved it down to where it should have been, then tightened up the red rectangle so that it just touched the edges of the characters, I could save it to SVG and it didn't move. Moreover the modified EMF read back in without shifting.

Windows XP Preview of both of these have the text lined up properly in the red box.

That should NOT be happening. The key values from the original EMF are:

rclbounds: (-3,-6)->(1153,1142)
characters drawn at coordinates (*,31)
transform on rectangle (0.0625,0,0,0.0625,0,0)
rectangle (before transform) : (456,408) -> (9800,408) -> (9800,1768) -> (456,1768) -> (start)
rectangle (after transform): (28.5,255.5) -> (612.5,25.5) -> (612.5,110.5)->(28.5,110.5)->(start)

The key values from the EMF I made are:

rclbounds: (0,0) -> (2314,282)
characters drawn at coordinates (*,197)
(no transform)
rectangle: (63,63) -> (1230,63) -> (1230,233) -> (63, 233) -> (start)

Coordinates for Y are positive DOWN, and that coordinate is the lower left corner of the character. Normally. You can see that the coordinates for the one I made place the lower left corner well down in the rectangle, where it is actually observed, and the ones from the original have it high up in the rectangle, where it was also observed.

So what is going on here? Well there is a record type EMR_SETTEXTALIGN which can change the alignment point of the text. None of these are specified in your file so it defaults. But defaults to what? Microsoft doesn't actually say in the EMF specification. We can get a clue though from the Arial.emf example, which when loaded into my copy of Inkscape has the Latin characters "(1/)" in the proper position, but all of the Chinese characters are shifted up. The coordinates of those characters are within a few pixels each other vertically (although all the latin ones are at 28, and all the Chinese ones at 31, which is actually lower in the image.)

In the EMF preview on XP the MS GOTHIC emf files, both my EMF (which specifies SETTEXTALIGN) and your EMF (which does not) are drawn correctly.

My best guess is that when not specified Microsoft expects the client to figure out which SETTEXTALIGN mode to use from the Unicode value of the characters. Having only worked previously with Latin/European characters, this never came up before. I will see if doing that resolves the issue.

Revision history for this message
David Mathog (mathog) wrote :

Solved it. The SETTEXTALIGN was defaulting correctly, as it turns out, to UPPER LEFT. The problem was that within a single truetype font the units per em was 2048 for Latin characters and 256 for Chinese ones. There was an implicit assumption in the code that all TrueType characters were 2048, and that caused the offset.

The attached patch has been submitted to trunk and is revision 13467.

Changed in inkscape:
status: New → Fix Committed
assignee: nobody → David Mathog (mathog)
su_v (suv-lp)
Changed in inkscape:
importance: Undecided → Medium
milestone: none → 0.91
Revision history for this message
Kanstantsin (kanstantsin-krasouski) wrote :

Hi,

The issue is fixed, but it seems to me a new one has been introduced - data are partially lost after conversion. I can't say for sure that it caused by this concrete fix. Could you please advice if I should post a new bug or reopen this one?

Source EMF and target SVG files for this new issue are attached.

Revision history for this message
Kanstantsin (kanstantsin-krasouski) wrote :

Inkscape revision we've checked on is 13502.

Revision history for this message
David Mathog (mathog) wrote :

Tracked it down, it is due to bug #1348417. If your test file is sent through a "debug" version of inkscape that dumps the constructed SVG to a file, and that file is then read in, then it is displayed properly. Bug #1348417 is scaling SOME parts of an SVG (text) but not others (clipping rectangles), and the mess you see is the result. Look at that bug for further information. The problem came in at revision 13468 which implemented EMF input clipping correctly to resolve bug #1340683, but which unfortunately revealed bug #1348417, which is in an unrelated part of the code.

Revision history for this message
David Mathog (mathog) wrote :

The test files seem to open properly in revision 13544 of Inkscape.

Changed in inkscape:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.