char-upcase, char-downcase misbehave on some characters

Bug #1906584 reported by Paul F. Dietz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
New
Undecided
Unassigned

Bug Description

(defparameter *c* '(#\LATIN_CAPITAL_LETTER_D_WITH_SMALL_LETTER_Z_WITH_CARON
                             #\LATIN_CAPITAL_LETTER_L_WITH_SMALL_LETTER_J
                             #\LATIN_CAPITAL_LETTER_N_WITH_SMALL_LETTER_J
                             #\LATIN_CAPITAL_LETTER_D_WITH_SMALL_LETTER_Z))
(mapcar #'upper-case-p *c*) ==> (nil nil nil nil)
(mapcar #'char-downcase *c*) ==> (#\LATIN_SMALL_LETTER_DZ_WITH_CARON #\LATIN_SMALL_LETTER_LJ
 #\LATIN_SMALL_LETTER_NJ #\LATIN_SMALL_LETTER_DZ)
(mapcar #'lower-case-p (mapcar #'char-downcase *c*)) ==> (t t t t)

char-downcase is supposed to return a different character than its argument only when that argument is an upper case character.

Revision history for this message
Douglas Katzman (dougk) wrote :

It's debatable. We're trying to accord with Unicode, not with an obsolete spec that didn't fully anticipate characters that themselves are neither upper nor lower-case, but have a way to convert to upper or lower case.
To take one of your examples, https://www.compart.com/en/unicode/U+01F2 says that
(code-char #x1f2) has an upper-case character (#x1f1) and a lower-case character (#1f3).

Revision history for this message
Christophe Rhodes (csr21-cantab) wrote : Re: [Bug 1906584] Re: char-upcase, char-downcase misbehave on some characters

Douglas Katzman <email address hidden> writes:

> It's debatable. We're trying to accord with Unicode, not with an
> obsolete spec that didn't fully anticipate characters that themselves
> are neither upper nor lower-case, but have a way to convert to upper
> or lower case.
> To take one of your examples, https://www.compart.com/en/unicode/U+01F2 says that
> (code-char #x1f2) has an upper-case character (#x1f1) and a lower-case character (#1f3).

I think we have the ability to distinguish between the Unicode
operations, which probably should never be done character-by-character
anyway, and the specified Lisp operations, and given that we have to do
that I think we should try to follow the invariants in the specified
lisp operations if we can.

I think we do already support treating various characters as not having
case even though Unicode says they do:

  (lower-case-p #\ß) ; => NIL
  (sb-unicode:lowercase-p #\ß) ; => T

and I think that we should probably do the same thing for these
characters and case changing, to preserve the specified invariants.
(Yes, they might do a different thing from Unicode-specified case
functions; that's fine, we have exported functions in SB-UNICODE for
that.) This affects the (currently) four titlecase characters with case
mappings to lower and upper case.

So the remaining problem is that tools-for-build/ucd.lisp is an
unreadable pile of magic, and every time I upgrade the version of
Unicode I say to myself that I need to rewrite it completely so that
it's understandable, and every time I run out of energy (and lately I've
run out of energy even to start the Unicode upgrade process, so we're
substantially out of date). There is code in tools-for-build/ucd.lisp
that builds a cases table; it probably "just" needs a small modification
to hold this additional data.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.