Some Japanese manpages are not displayed correctly

Bug #301312 reported by Mitsuya Shibata
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Manpage Repository
Expired
Low
Unassigned

Bug Description

By several confusing problems, some Japanese manpages are not displayed correctly.
 ex. http://manpages.ubuntu.com/manpages/intrepid/ja/man1/ls.html
 * Title has wrong character "‹".
 * "引た瑤砲弔い董" should be displayed "引き数について、".
 * Topics of "NAME" should be "名前" in Japanese, but it is "前".
 * and so on...

This problems are caused by following reasons:
1. Encoding of almost Japanese manpages in /usr/share/man/ja/ is EUC-JP, is not UTF-8.
2. man's behavior depend on "locale". Following two commands output difference result:
 * LANG=en_GB.utf8 w3mman -l usr/share/man/ja/man1/ls.1.gz
 * LANG=ja_JP.utf8 w3mman -l usr/share/man/ja/man1/ls.1.gz
3. col command which seems to be called in w3mman is not support UTF-8.
 ref. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=319952
4. CGI::escapeHTML set default charset to ISO-8859-1.
 To use multibyte character, should set "charset".
 ref. http://perldoc.perl.org/CGI.html#AUTOESCAPING-HTML

To resolve problems, I create patch files for two scripts in lp:ubuntu-manpage-repository .
In main/bin/fetch-man-pages.sh:
 * Use man command instead of w3mman. And set device option "-Tutf8".
 * If convert manpages for "ja", set language option "-L ja_JP.utf8".
Latter is very "one-time solution". However I couldn't get more smart way.

In main/bin/w3mman-to-html.pl:
 * Set charset for CGI::escapeHTML to "UTF-8".
 * Handle backspace (\x08) for multibyte character with UTF-8.
  For example, "NAME" is "N\x08NA\x08AM\x08ME\x08E" in formatted manpage.
  In this script, delete \x08 and one character former. (not one "byte" former)
  And modified pipe, left/right quote escaping.
 * Modified h3 tagging for non ascii topic title (.SH).

There are not well testing patches.
Please check it, especially in other locales.

This problem was reported by Nazo.
 https://bugs.launchpad.net/ubuntu-jp-improvement/+bug/300724

Revision history for this message
Mitsuya Shibata (cosmos-door) wrote :
Revision history for this message
Mitsuya Shibata (cosmos-door) wrote :
description: updated
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Hi there. Thanks for the patches. I'll apply them and regenerate a test repository. Since I don't speak Japanese, I'll need you to test them for me :-)

:-Dustin

Revision history for this message
Mitsuya Shibata (cosmos-door) wrote :

Sure. I shall be glad to test new generated man pages.
If you regenerate it, feel free tell me.

Changed in ubuntu-manpage-repository:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Colin Watson (cjwatson) wrote :

I don't think these patches should be applied as they stand, although I wouldn't be surprised to find that output improvements are indeed needed in various places. Manual page encoding is a complex subject and it is *not* usually amenable to quick fixes and special-case hacks, at least not if you want to preserve your sanity. :-) man-db has a lot of intelligence in this area nowadays, and I would strongly advise making use of it wherever possible rather than attempting to override it.

In more recent versions of Ubuntu you'll find that Japanese manual pages have been gradually moving over to UTF-8, since dh_installman now automatically recodes to UTF-8 at package build time (with some assistance from man-db to figure out what the original encoding was). Thus, it's absolutely necessary for manpages.ubuntu.com to handle both cases. Furthermore, this class of problem is not specific to Japanese and it is incorrect to special-case only Japanese; the necessary special-casing should already be present in man-db if you call it correctly.

Please don't use '-Tutf8' or '-L ja_JP.utf8'; you'll get wrong results in many situations. Use '-E UTF-8' instead to force man to generate UTF-8 output. (It's possible you'll need to be running in *some* UTF-8 locale for CJK pages, but it shouldn't make too much difference which one.)

col is actually called by man directly, not by w3mman. To avoid this, the best solution is probably to set MAN_KEEP_FORMATTING=1 in the environment. Some adjustments to the filter script may be needed to cope with the resulting differences in output.

Revision history for this message
Colin Watson (cjwatson) wrote :

Regarding MAN_KEEP_FORMATTING=1, see also bug 353900.

Revision history for this message
David Britton (dpb) wrote :

I realize this bug is old, but the links no longer even work, nor can I find the /ja/ manpages referenced. Will mark this as incomplete.

Changed in ubuntu-manpage-repository:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Ubuntu Manpage Repository because there has been no activity for 60 days.]

Changed in ubuntu-manpage-repository:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.