Comment 4 for bug 9234

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 01:23:31 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max
 > 1

Package: gawk
Version: 1:3.1.4-1

On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
[a-a] doesn't match with A as expected if IGNORECASE is set.

For example,
 % echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
 A

 % echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
 %
 # wrong, A should match [a-a] when IGNORECASE=1

If GAWK_NO_DFA=1, it works fine as well as LANG=C.
 % echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
 A
 %

Note that [a-z] will match with A, that is not because IGNORECASE works,
but because collation order in UTF-8 is "a A b B .. z".
That is, [a-z] won't match with Z even if IGNORECASE=1.

 % echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
 %

Regards,
Fumitoshi UKAI