Comment 2 for bug 9027

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 00:55:10 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: wrong behavior of [:upper:] and/or [:lower:] if IGNORECASE is set and locale's mb_cur_max >
 1

Package: gawk
Version: 1:3.1.4-1
Severity: grave
Tags: patch

On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
[:upper:] and/or [:lower:] don't work as expected if IGNORECASE is set.

For example,
 % echo aaa | LANG=C gawk 'BEGIN { IGNORECASE=1 } /[[:upper:]]+/ { print }'
 aaa
 # correct, a matches [:upper:] when IGNORECASE=1

 % echo aaa | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE=1 } /[[:upper:]]+/ { print }'
 %
 # wrong, a doesn't match [:upper:] when IGNORECASE=1

If GAWK_NO_DFA=1, it works fine as well as LANG=C.

As I checked the source code, I found this chunks in
regcomp.c:build_charclass()
(the same code found in glibc)

  if ((syntax & RE_ICASE)
      && (strcmp (class_name, "upper") == 0 || strcmp (class_name, "lower") == 0))
    class_name = "alpha";

However, dfa.c doesn't do it the same way. So, this patch fixes this
problem:

--- dfa.c.orig 2004-10-12 19:38:48.000000000 +0900
+++ dfa.c 2004-10-12 19:38:11.000000000 +0900
@@ -596,6 +596,9 @@
   {
     wctype_t wt;
     /* Query the character class as wctype_t. */
+ if (case_fold && (strcmp(str, "upper") == 0 || strcmp(str, "lower") == 0)) {
+ strcpy(str, "alpha");
+ }
     wt = wctype (str);

     if (ch_classes_al == 0)

Regards,
Fumitoshi UKAI