Message-ID: <email address hidden> Date: Wed, 13 Oct 2004 00:55:10 +0900 From: Fumitoshi UKAI <email address hidden> To: <email address hidden> Subject: wrong behavior of [:upper:] and/or [:lower:] if IGNORECASE is set and locale's mb_cur_max > 1
Package: gawk Version: 1:3.1.4-1 Severity: grave Tags: patch
On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales, [:upper:] and/or [:lower:] don't work as expected if IGNORECASE is set.
For example, % echo aaa | LANG=C gawk 'BEGIN { IGNORECASE=1 } /[[:upper:]]+/ { print }' aaa # correct, a matches [:upper:] when IGNORECASE=1
% echo aaa | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE=1 } /[[:upper:]]+/ { print }' % # wrong, a doesn't match [:upper:] when IGNORECASE=1
If GAWK_NO_DFA=1, it works fine as well as LANG=C.
As I checked the source code, I found this chunks in regcomp.c:build_charclass() (the same code found in glibc)
if ((syntax & RE_ICASE) && (strcmp (class_name, "upper") == 0 || strcmp (class_name, "lower") == 0)) class_name = "alpha";
However, dfa.c doesn't do it the same way. So, this patch fixes this problem:
--- dfa.c.orig 2004-10-12 19:38:48.000000000 +0900 +++ dfa.c 2004-10-12 19:38:11.000000000 +0900 @@ -596,6 +596,9 @@ { wctype_t wt; /* Query the character class as wctype_t. */ + if (case_fold && (strcmp(str, "upper") == 0 || strcmp(str, "lower") == 0)) { + strcpy(str, "alpha"); + } wt = wctype (str);
if (ch_classes_al == 0)
Regards, Fumitoshi UKAI
Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 00:55:10 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: wrong behavior of [:upper:] and/or [:lower:] if IGNORECASE is set and locale's mb_cur_max >
1
Package: gawk
Version: 1:3.1.4-1
Severity: grave
Tags: patch
On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
[:upper:] and/or [:lower:] don't work as expected if IGNORECASE is set.
For example,
% echo aaa | LANG=C gawk 'BEGIN { IGNORECASE=1 } /[[:upper:]]+/ { print }'
aaa
# correct, a matches [:upper:] when IGNORECASE=1
% echo aaa | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE=1 } /[[:upper:]]+/ { print }'
%
# wrong, a doesn't match [:upper:] when IGNORECASE=1
If GAWK_NO_DFA=1, it works fine as well as LANG=C.
As I checked the source code, I found this chunks in c:build_ charclass( )
regcomp.
(the same code found in glibc)
if ((syntax & RE_ICASE)
&& (strcmp (class_name, "upper") == 0 || strcmp (class_name, "lower") == 0))
class_name = "alpha";
However, dfa.c doesn't do it the same way. So, this patch fixes this
problem:
--- dfa.c.orig 2004-10-12 19:38:48.000000000 +0900
+++ dfa.c 2004-10-12 19:38:11.000000000 +0900
@@ -596,6 +596,9 @@
{
wctype_t wt;
/* Query the character class as wctype_t. */
+ if (case_fold && (strcmp(str, "upper") == 0 || strcmp(str, "lower") == 0)) {
+ strcpy(str, "alpha");
+ }
wt = wctype (str);
if (ch_classes_al == 0)
Regards,
Fumitoshi UKAI