"man man > man.txt" produces invalid characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
bsdmainutils (Debian) |
Fix Released
|
Unknown
|
|||
bsdmainutils (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
man-db (Ubuntu) |
Won't Fix
|
Low
|
Unassigned |
Bug Description
Binary package hint: man-db
The file man.txt, produced by the command "man man > man.txt" (other man pages are concerned, too), will display invalid characters in different text editors (gedit, nano, abiword). These invalid characters include the continuation hyphen, and they're obviously (examined in a Hex editor) Unicode.
This does not happen when using man's "--encoding" option (tested with UTF-8 and ISO-8859-1).
gedit will auto-detect the file's character set as ISO-8859-15, but will reject opening the file when the UTF-8 character set is explicitely set in the File Open dialog.
In gnome-terminal, the man page is displayed correctly (with no "--encoding" option, or converted to UTF-8).
Ubuntu 8.10
man-db 2.5.2-2
$ locale
LANG=en_US.UTF-8
LC_CTYPE=
LC_NUMERIC=
LC_TIME=
LC_COLLATE=
LC_MONETARY=
LC_MESSAGES=
LC_PAPER=
LC_NAME=
LC_ADDRESS=
LC_TELEPHONE=
LC_MEASUREMENT=
LC_IDENTIFICATI
LC_ALL=
Related branches
Changed in bsdmainutils: | |
status: | Unknown → New |
Changed in bsdmainutils (Debian): | |
status: | New → Fix Committed |
Changed in bsdmainutils (Debian): | |
status: | Fix Committed → Fix Released |
Thanks for your report. This is primarily a bug in the col program (in the bsdmainutils package), which is used by man to filter some special characters out of groff output when writing to a file. Unfortunately col does not deal correctly with UTF-8, and the result is a file containing invalid UTF-8 which editors will quite reasonably refuse to treat as UTF-8 and certainly not to automatically detect as UTF-8 (although some editors provide a way to force the issue). This is another symptom of the same problem reported in Debian as http:// bugs.debian. org/cgi- bin/bugreport. cgi?bug= 319952.
In this case, groff outputs the UTF-8-encoded sequence of Unicode codepoints U+2010 U+0008 U+2010 to represent an overstruck (i.e. bold) continuation hyphen. col mangles that into the byte sequence E2 80 E2 80 90, constructed by removing the last byte from the UTF-8 representation of U+2010 and then appending the full representation of that same character. Correct behaviour would be for the U+0008 (backspace) character to backspace over the whole first character, not just part of it.
I'm leaving a bug task open on man-db at a lower importance because I do think man-db bears some responsibility for the tools it uses, even if they're clearly buggy. Given the historical problems with col, I have been wondering for a while if I shouldn't produce a miniature implementation of it and embed it into man. Normally duplication is bad, and it makes me feel uncomfortable, but in this case col's implementation is pretty stable and unlikely to need to vary significantly among systems.