Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 02:29:28 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: Re: range of characters doesn't match as expected if IGNORECASE is set and locale's
mb_cur_max > 1
tags 276206 + patch
thanks
At Wed, 13 Oct 2004 01:23:31 +0900,
Fumitoshi UKAI wrote:
> On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
> [a-a] doesn't match with A as expected if IGNORECASE is set.
>
> For example,
> % echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
>
> % echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> %
> # wrong, A should match [a-a] when IGNORECASE=1
>
> If GAWK_NO_DFA=1, it works fine as well as LANG=C.
> % echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
> %
>
> Note that [a-z] will match with A, that is not because IGNORECASE works,
> but because collation order in UTF-8 is "a A b B .. z".
> That is, [a-z] won't match with Z even if IGNORECASE=1.
>
> % echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
> %
Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 02:29:28 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: Re: range of characters doesn't match as expected if IGNORECASE is set and locale's
mb_cur_max > 1
tags 276206 + patch
thanks
At Wed, 13 Oct 2004 01:23:31 +0900,
Fumitoshi UKAI wrote:
> On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
> [a-a] doesn't match with A as expected if IGNORECASE is set.
>
> For example,
> % echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
>
> % echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> %
> # wrong, A should match [a-a] when IGNORECASE=1
>
> If GAWK_NO_DFA=1, it works fine as well as LANG=C.
> % echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
> %
>
> Note that [a-z] will match with A, that is not because IGNORECASE works,
> but because collation order in UTF-8 is "a A b B .. z".
> That is, [a-z] won't match with Z even if IGNORECASE=1.
>
> % echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
> %
I think this patch fixes this problem:
--- dfa.c.orig 2004-10-13 02:27:29.000000000 +0900 IF_NECESSARY( work_mbc- >range_ ends, wchar_t,
range_ ends_al, work_mbc->nranges + 1); mbc->range_ ends[work_ mbc->nranges+ +] = (wchar_t)wc2; (wint_t) wc) || iswupper( (wint_t) wc)) (wint_t) wc2) || iswupper( (wint_t) wc2))) { (wint_t) wc)) (wint_t) wc); (wint_t) wc); IF_NECESSARY( work_mbc- >range_ sts, wchar_t, >range_ sts[work_ mbc->nranges] = (wchar_t)altcase; (wint_t) wc2)) (wint_t) wc2); (wint_t) wc2); IF_NECESSARY( work_mbc- >range_ ends, wchar_t, >range_ ends[work_ mbc->nranges+ +] = (wchar_t)altcase;
+++ dfa.c 2004-10-13 02:27:54.000000000 +0900
@@ -682,6 +682,28 @@
REALLOC_
work_
+ if (case_fold && (iswlower(
+ && (iswlower(
+ wint_t altcase;
+ altcase = wc;
+ if (iswlower(
+ altcase = towupper(
+ else
+ altcase = towlower(
+ REALLOC_
+ range_sts_al, work_mbc->nranges + 1);
+ work_mbc-
+
+ altcase = wc2;
+ if (iswlower(
+ altcase = towupper(
+ else
+ altcase = towlower(
+ REALLOC_
+ range_ends_al, work_mbc->nranges + 1);
+ work_mbc-
+
+ }
}
else if (wc != WEOF)
/* build normal characters. */
Regards,
Fumitoshi UKAI