Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 01:23:31 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max
> 1
Package: gawk
Version: 1:3.1.4-1
On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
[a-a] doesn't match with A as expected if IGNORECASE is set.
For example,
% echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
A
% echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
%
# wrong, A should match [a-a] when IGNORECASE=1
If GAWK_NO_DFA=1, it works fine as well as LANG=C.
% echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
A
%
Note that [a-z] will match with A, that is not because IGNORECASE works,
but because collation order in UTF-8 is "a A b B .. z".
That is, [a-z] won't match with Z even if IGNORECASE=1.
Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 01:23:31 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max
> 1
Package: gawk
Version: 1:3.1.4-1
On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
[a-a] doesn't match with A as expected if IGNORECASE is set.
For example,
% echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
A
% echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
%
# wrong, A should match [a-a] when IGNORECASE=1
If GAWK_NO_DFA=1, it works fine as well as LANG=C.
% echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
A
%
Note that [a-z] will match with A, that is not because IGNORECASE works,
but because collation order in UTF-8 is "a A b B .. z".
That is, [a-z] won't match with Z even if IGNORECASE=1.
% echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
%
Regards,
Fumitoshi UKAI