Mapping errors from some Unicode character codes to groff entities

Bug #1008115 reported by Alkis Georgopoulos
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
groff (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

While running `man help2man` I noticed that the Greek small letter Phi (u03C6) was displayed as Greek Phi symbol (u03D5) instead, e.g. like this:
"που εμϕανίζονται στο αρχείο για να συμπεριληϕθούν"
instead of this:
"που εμφανίζονται στο αρχείο για να συμπεριληφθούν".

`zcat /usr/share/man/el/man1/help2man.1.gz` vefiries that the problem is in `man` and not in the help2man manpage.

Using man-db 2.6.1-2.

Revision history for this message
Colin Watson (cjwatson) wrote :

groff_char(7) seems to document some oddities around phi:

  "These glyphs are intended for technical use, not for real Greek; normally, the uppercase letters have upright shape, and the lowercase ones are slanted. There is a problem with the mapping of letter phi to Unicode. Prior to Unicode version 3.0, the difference between U+03C6, GREEK SMALL LETTER PHI, and U+03D5, GREEK PHI SYMBOL, was not clearly described; only the glyph shapes in the Unicode book could be used as a reference. Starting with Unicode 3.0, the reference glyphs have been exchanged and described verbally also: In mathematical context, U+03D5 is the stroked variant and U+03C6 the curly glyph. Unfortunately, most font vendors didn't update their fonts to this (incompatible) change in Unicode. At the time of this writing (January 2006), it is not clear yet whether the Adobe Glyph Names `phi' and `phi1' also change its meaning if used for mathematics, thus compatibility problems are likely to happen – being conservative, groff currently assumes that `phi' in a PostScript symbol font is the stroked version.

  In groff, symbol `\[*f]' always denotes the stroked version of phi, and `\[+f]' the curly variant."

It might be useful if somebody familiar with Greek could take this up directly with groff upstream, since it seems that the needs of mathematical Greek may conflict with those of real Greek? It certainly seems odd that the definitions in glyphuni.cpp and uniglyph.cpp aren't mirror images of each other.

affects: man-db (Ubuntu) → groff (Ubuntu)
Changed in groff (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> "being conservative, groff currently assumes that `phi' in a PostScript symbol font is the stroked version."

Ouch, yeah if there are printers out there with embedded fonts that follow Unicode < 3.0, that's a good point. But on the other hand it breaks Greek man pages.
Since that was written in 2006 though, they may now decide to switch the default to Unicode >= 3.0, or at least to introduce some option to select the targeted Unicode version.

Thank you Colin, I'll take it up directly with groff upstream (after a couple of weeks) and update this bug report accordingly.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

I sent an email to the bug-groff mailing list: http://lists.gnu.org/archive/html/bug-groff/2012-06/msg00002.html

Revision history for this message
Werner Lemberg (wl) wrote :

The problem was not related to PS output, it was a mapping bug from Unicode character codes to groff entities.

This is fixed now in the CVS. Thanks for the report.

Changed in groff (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

Although this may be fixed in the upstream project's revision control system it is not actually Fix Committed for the Ubuntu package of that software. Subsequently, I am setting the bug task back to Triaged.

Changed in groff (Ubuntu):
status: Fix Committed → Triaged
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

The problem appears in other characters too, e.g. in all the accented lowercase letters (άέήίόύώ) and possibly in more:

Existing wrong encoding:
Επιπλέον υλικό μπορεί να συμπεριληϕθεί στο παραγόμενο αποτέλεσμα
Hexdump:
00000000 ce 95 cf 80 ce b9 cf 80 ce bb e1 bd b3 ce bf ce
00000010 bd 20 cf 85 ce bb ce b9 ce ba e1 bd b9 20 ce bc
00000020 cf 80 ce bf cf 81 ce b5 e1 bd b7 20 ce bd ce b1
00000030 20 cf 83 cf 85 ce bc cf 80 ce b5 cf 81 ce b9 ce
00000040 bb ce b7 cf 95 ce b8 ce b5 e1 bd b7 20 cf 83 cf
00000050 84 ce bf 20 cf 80 ce b1 cf 81 ce b1 ce b3 e1 bd
00000060 b9 ce bc ce b5 ce bd ce bf 20 ce b1 cf 80 ce bf
00000070 cf 84 e1 bd b3 ce bb ce b5 cf 83 ce bc ce b1 0a

Correct encoding:
Επιπλέον υλικό μπορεί να συμπεριληφθεί στο παραγόμενο αποτέλεσμα
Hexdump:
00000000 ce 95 cf 80 ce b9 cf 80 ce bb ce ad ce bf ce bd
00000010 20 cf 85 ce bb ce b9 ce ba cf 8c 20 ce bc cf 80
00000020 ce bf cf 81 ce b5 ce af 20 ce bd ce b1 20 cf 83
00000030 cf 85 ce bc cf 80 ce b5 cf 81 ce b9 ce bb ce b7
00000040 cf 86 ce b8 ce b5 ce af 20 cf 83 cf 84 ce bf 20
00000050 cf 80 ce b1 cf 81 ce b1 ce b3 cf 8c ce bc ce b5
00000060 ce bd ce bf 20 ce b1 cf 80 ce bf cf 84 ce ad ce
00000070 bb ce b5 cf 83 ce bc ce b1 0a

I'll try to make a list of all the problematic characters in a couple of days.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

The affected characters from the Greek Unicode ranges are:

ʹ (u0374) gets transformed to ʹ (u02B9)
; (u037E) gets stripped
΅ (u0385) gets transformed to ΅ (u1FEE)
Ά (u0386) gets transformed to Ά (u1FBB)
· (u0387) gets transformed to · (u00B7)
Έ (u0388) gets transformed to Έ (u1FC9)
Ή (u0389) gets transformed to Ή (u1FCB)
Ί (u038A) gets transformed to Ί (u1FDB)
Ό (u038C) gets transformed to Ό (u1FF9)
Ύ (u038E) gets transformed to Ύ (u1FEB)
Ώ (u038F) gets transformed to Ώ (u1FFB)
ΐ (u0390) gets transformed to ΐ (u1FD3)
ά (u03AC) gets transformed to ά (u1F71)
έ (u03AD) gets transformed to έ (u1F73)
ή (u03AE) gets transformed to ή (u1F75)
ί (u03AF) gets transformed to ί (u1F77)
ΰ (u03B0) gets transformed to ΰ (u1FE3)
φ (u03C6) gets transformed to ϕ (u03D5)
ό (u03CC) gets transformed to ό (u1F79)
ύ (u03CD) gets transformed to ύ (u1F7B)
ώ (u03CE) gets transformed to ώ (u1F7D)
ϕ (u03D5) gets transformed to φ (u03C6)
ι (u1FBE) gets transformed to ι (u03B9)
` (u1FEF) gets transformed to ` (u0060)
´ (u1FFD) gets transformed to ´ (u00B4)

summary: - man displays φ (u03C6) as ϕ (u03D5)
+ Mapping errors from some Unicode character codes to groff entities
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.