automatic hyphenation uses Unicode HYPHEN character

Bug #47011 reported by Mark Rose
10
Affects Status Importance Assigned to Milestone
groff (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Open up Konsole. Open the manual page for any manual that has quotes, e.g. man find. The straight quote ' and backtick ` are rendered as other characters. Setting the environment variable to C fixes this, e.g. LANG=C man find. My default environment variable is en_CA.UTF-8.

This also applies to a regular console outside of X, not just in Konsole, so it may be something different than Bug #44231.

--

It appears the issue is that many man pages are using old glyphs as defined in ASCII. These need to be updated.

Revision history for this message
Graeme Hewson (ghewson) wrote :

I see the same problem. For instance, line 299 (in an 80-character wide Konsole session) and the following lines for "man man" display as:

              some parts of it may only be displayed properly when using GNU
              nroffâs latin1(7) device.

              Description Octal latin1 ascii
              ---------------------------------------------
              continuation hyphen 255 ­ -
              bullet (middle dot) 267 · o
              acute accent 264 ´ â
              multiplication sign 327 Ã x

This is on a fresh install from CD of 6.06 (the release version). In my environment I have:

LANG=en_GB.UTF-8
LANGUAGE=en_GB:en

if I unset LANG, the problem doesn't appear (I haven't experimented with LANGUAGE).

Yet on another system, the problem doesn't appear. This system had an alpha version of Dapper installed, and has been continually updated since, so it should be the same as the release version. The two systems have the same versions of /usr/share/man/man1/man.1.gz and /etc/manpath.config (checked with ls -l and cksum). What's different is the Encoding setting of Konsole. On the system without the problem the setting is Default. On the system with the problem, the setting is Western European (iso 8859-1). Changing the setting to Default on the latter causes the problem to disappear.

Revision history for this message
Mark Rose (markrose) wrote :

Changing my Konsole Encoding setting to Default fixed Konsole over here, too (mine was also set at iso8859-1).

That doesn't explain why the linux consoles also experience the problem, however.

Revision history for this message
Graeme Hewson (ghewson) wrote :

OK, I think I've worked out what's going on. The iso-8859-1 setting was probably left over from my previous Breezy Badger installation (I kept my home drive for the "fresh" install), so that's a red herring.

I created a new account for testing, and Konsole works fine. LANG is en_GB.UTF-8, Settings/Encoding is Default, and all punctuation marks that I can see in man pages display fine.

There is a problem with the console display, though, because LANG is also en_GB.UTF-8 here. The display does the "right" thing in rendering undisplayable characters as rectangles, but apart from that there are no "wrong" characters. If I unset LANG, all is well. Clearly LANG should not be set for consoles.

I also found another problem with my original Konsole settings, though. The default font in a new account is DejaVu Sans Mono, but I was using Bitstream Vera Sans Mono. In this font, word breaks are displayed as boxes, such as in gcc(1), where "remain-der" is split across two lines. If I change my font to DejaVu (actually I can't see any difference between the two fonts), the problem is resolved.

Revision history for this message
Mark Rose (markrose) wrote :

Hmm. I also did the upgrade from Breezy, and also had Bitstream instead of DejaVu selected.

Unsetting LANG isn't an elegant solution to the console man page problem. It seems that the manual pages are following bad practice and using old glyphs as defined in ASCII/ANSI X3.4/ISO 646. There's a big discussion on the topic at http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html .

Mark Rose (markrose)
description: updated
Revision history for this message
Graeme Hewson (ghewson) wrote :

Still, as far as I can see, all characters are readable in Konsole. However, ` and ' (U+0060 and U+0027) are displayed as left and right single quotation marks with my now default settings (corresponding, AFAICS, to the default Kubuntu settings), which I think means some fiddling is going on. Readable they might be, but I don't think the rendering is correct.

I think this "fiddling" is partly to blame for the problem with the console, where on mine the quotation marks are displayed as rectangles. Yes, I agree that unsetting LANG wouldn't be an elegant solution. I said, in haste, that LANG shouldn't be set for consoles, but I hadn't appreciated that consoles can support UTF8. I've now discovered unicode_start, consolechars, vt-is-UTF8 and friends.

vt-is-UTF8 tells me the console is in UTF8 mode by default. It can only display a subset of glyphs, though (ISTR the default is those of ISO 8859), and I suppose excluded from that default subset are the quotation marks. Also excluded, it seems, is the en dash, U+2013, which is used by man in UTF8 mode to split a word across lines. I don't think there are any other characters incorrectly displayed.

Revision history for this message
Kees Cook (kees) wrote :

Assigning this to man-db.

Revision history for this message
Marco Rodrigues (gothicx) wrote :

You still have the same problem on Feisty ?

Changed in man-db:
assignee: nobody → gothicx
status: Confirmed → Incomplete
Revision history for this message
Mark Rose (markrose) wrote :

Appears to be solved :)

Revision history for this message
Mark Rose (markrose) wrote :

A clean install of Feisty no longer has this issue.

Changed in man-db:
status: Incomplete → Fix Released
Revision history for this message
Marco Rodrigues (gothicx) wrote :

That's nice :-) thanks!

Revision history for this message
Graeme Hewson (ghewson) wrote :

In practical terms, it's fixed for me. Some problems remain, however. In the following, LANG=en_GB.UTF-8.

With the Konsole font set to Bitstream Vera Sans Mono, word breaks are rendered as rectangles. See, for instance, "man gcc", where in an 80-character wide terminal "remainder" in the sentence under the synopsis is split across lines as "remain-der". If that's because the font doesn't have an en dash, I believe the dash should be rendered as a hyphen instead.

Indeed, a text console displays word breaks as hyphens. However, it displays inverted commas as lower-case Greek mu and gamma. It also incorrectly renders some of the characters in the table under the description of the --ascii option in "man man".

Changed in man-db:
status: Fix Released → Confirmed
Revision history for this message
Colin Watson (cjwatson) wrote :

The rendering of hyphens as the Unicode HYPHEN character is a groff bug. I'd decided not to do this a while back, but evidently missed the case of automatic hyphenation.

groff (1.18.1.1-7) unstable; urgency=low

  * Too many fonts are missing the Unicode HYPHEN character, so I give up.
    Render "-" as HYPHEN-MINUS (ASCII 0x2D) by default. (Of course, manual
    pages using "-" when they should be using "\-" should still be fixed.)

 -- Colin Watson <email address hidden> Fri, 18 Mar 2005 17:57:51 +0000

Your problems at the Linux console are probably due to a couple of interwoven bugs in console-setup, which have been fixed in Gutsy. In the meantime, try running 'sudo setupcon' at the console to set it up properly.

The discussion about quote marks in this bug is really quite confused. If groff were using ASCII quotes, you couldn't possibly be seeing unreadable characters! All characters in ASCII are in every font you might reasonably choose to use. groff already follows Markus Kuhn's recommendations where it can (i.e. when using the UTF-8 device and when manual pages haven't been poorly written such that it's been explicitly instructed not to do so). As far as the rendering of ` and ' is concerned, note that groff is a typographical markup language and not something that just passes through whatever characters it receives; groff_char(7) documents that ` renders as a left single quotation mark and ' renders as a right single quotation mark, and that you can use \` and \(aq if you want the corresponding literal ASCII characters. Note that Markus Kuhn's web page even explicitly mentions troff as a program where ` can continue to be used as before.

Changed in man-db:
assignee: gothicx → nobody
Revision history for this message
Graeme Hewson (ghewson) wrote :

I'm using Feisty. /usr/share/man/man1/man.1.gz uses ` and ', such as in:

Format the manual page referenced by
.RI ` alias ',

With "man man" the characters are displayed as highlighted Greek letters on my console. I confirm that after running setupcon, they're displayed as inverted commas (albeit highlighted).

Revision history for this message
David Wilson (mcs6502) wrote :

The bug is still present in Hardy (8.04.1). Single quotes, bullets and hyphens are all mis-displayed.

For example, from "man sshd_config":

     sshd(8) reads configuration data from /etc/ssh/sshd_config (or the file specified with -f on the command line).
     The file contains keyword-argument pairs, one per line. Lines starting with â#â and empty lines are interâ[m
     preted as comments. Arguments may optionally be enclosed in double quotes (") in order to represent arguments
     containing spaces.

and

           Â· Protocol 2
           Â· ChallengeResponseAuthentication no
           Â· X11Forwarding yes
           Â· PrintMotd no
           Â· AcceptEnv LANG LC_*
           Â· Subsystem sftp /usr/lib/openssh/sftp-server
           Â· UsePAM yes

This is with LANG=en_AU.UTF-8

As above, workaround is to use the C locale.

Revision history for this message
Olivier Duclos (odc) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.