range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max > 1
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gawk (Debian) |
Fix Released
|
Unknown
|
|||
gawk (Ubuntu) |
Invalid
|
High
|
Unassigned |
Bug Description
Automatically imported from Debian bug report #276206 http://
In Debian Bug tracker #276206, Fumitoshi UKAI (ukai) wrote : critical bugs in multibyte locales(UTF-8, CJK, ..) regexp | #2 |
severity 249245 grave
severity 274352 grave
severity 226397 grave
severity 276209 grave
merge 249245 226397 238167
severity 277122 grave
severity 276206 grave
thanks
Bug#249245 can be fixed by patch derived from gawk's dfa.c.
Bug#274352 can be fixed by 1 line patch.
Bug#277122 (in gawk dfa.c) is the same bugs as Bug#274352 (in grep dfa.c).
Bug#276209 (in grep) and Bug#276206 (in gawk) is the same bug in dfa.c about
case insensitivity of character ranges.
All of these bugs break behaviour in multibyte locales (UTF-8, CJK, ..)
Regards,
Fumitoshi UKAI
Debian Bug Importer (debzilla) wrote : | #3 |
Automatically imported from Debian bug report #276206 http://
Debian Bug Importer (debzilla) wrote : | #4 |
Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 01:23:31 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: range of characters doesn't match as expected if IGNORECASE is set and locale's mb_cur_max
> 1
Package: gawk
Version: 1:3.1.4-1
On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
[a-a] doesn't match with A as expected if IGNORECASE is set.
For example,
% echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
A
% echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
%
# wrong, A should match [a-a] when IGNORECASE=1
If GAWK_NO_DFA=1, it works fine as well as LANG=C.
% echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
A
%
Note that [a-z] will match with A, that is not because IGNORECASE works,
but because collation order in UTF-8 is "a A b B .. z".
That is, [a-z] won't match with Z even if IGNORECASE=1.
% echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
%
Regards,
Fumitoshi UKAI
Debian Bug Importer (debzilla) wrote : | #5 |
Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 02:29:28 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: Re: range of characters doesn't match as expected if IGNORECASE is set and locale's
mb_cur_max > 1
tags 276206 + patch
thanks
At Wed, 13 Oct 2004 01:23:31 +0900,
Fumitoshi UKAI wrote:
> On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
> [a-a] doesn't match with A as expected if IGNORECASE is set.
>
> For example,
> % echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
>
> % echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> %
> # wrong, A should match [a-a] when IGNORECASE=1
>
> If GAWK_NO_DFA=1, it works fine as well as LANG=C.
> % echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
> %
>
> Note that [a-z] will match with A, that is not because IGNORECASE works,
> but because collation order in UTF-8 is "a A b B .. z".
> That is, [a-z] won't match with Z even if IGNORECASE=1.
>
> % echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
> %
I think this patch fixes this problem:
--- dfa.c.orig 2004-10-13 02:27:29.000000000 +0900
+++ dfa.c 2004-10-13 02:27:54.000000000 +0900
@@ -682,6 +682,28 @@
REALLOC_
work_
+ if (case_fold && (iswlower(
+ && (iswlower(
+ wint_t altcase;
+ altcase = wc;
+ if (iswlower(
+ altcase = towupper(
+ else
+ altcase = towlower(
+ REALLOC_
+ range_sts_al, work_mbc->nranges + 1);
+ work_mbc-
+
+ altcase = wc2;
+ if (iswlower(
+ altcase = towupper(
+ else
+ altcase = towlower(
+ REALLOC_
+ range_ends_al, work_mbc->nranges + 1);
+ work_mbc-
+
+ }
}
else if (wc != WEOF)
/* build normal characters. */
Regards,
Fumitoshi UKAI
Debian Bug Importer (debzilla) wrote : | #6 |
Message-ID: <email address hidden>
Date: Tue, 19 Oct 2004 11:30:56 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: critical bugs in multibyte locales(UTF-8, CJK, ..) regexp
severity 249245 grave
severity 274352 grave
severity 226397 grave
severity 276209 grave
merge 249245 226397 238167
severity 277122 grave
severity 276206 grave
thanks
Bug#249245 can be fixed by patch derived from gawk's dfa.c.
Bug#274352 can be fixed by 1 line patch.
Bug#277122 (in gawk dfa.c) is the same bugs as Bug#274352 (in grep dfa.c).
Bug#276209 (in grep) and Bug#276206 (in gawk) is the same bug in dfa.c about
case insensitivity of character ranges.
All of these bugs break behaviour in multibyte locales (UTF-8, CJK, ..)
Regards,
Fumitoshi UKAI
Martin Pitt (pitti) wrote : | #7 |
I checked all commands in the bug report, Warty's version of gawk behaves correctly.
In Debian Bug tracker #276206, Fumitoshi UKAI (ukai) wrote : Fixed in NMU of gawk 1:3.1.4-1.2 | #8 |
tag 276206 + fixed
tag 277122 + fixed
quit
This message was generated automatically in response to a
non-maintainer upload. The .changes file follows.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 20 Oct 2004 01:41:40 +0900
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-1.2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: Fumitoshi UKAI <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 276206 277122
Changes:
gawk (1:3.1.4-1.2) unstable; urgency=low
.
* NMU to fix RC bugs
* 12_dfa.
to fix CASEIGNORE match on [a-z] or [A-Z] in multibyte locales (UTF-8,.)
closes: Bug#276206
* 13_dfa.
to fix wrong match '[' against character class such as [[:space:]]
in multibyte locales (UTF-8, ...)
closes: Bug#277122
Files:
f9efc0ef141272
afc7be320bfb12
a70f8b4ad65c18
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
iD8DBQFBdUSv9D5
+kxtBcbg3aPT8pT
=vhfA
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #9 |
Message-Id: <email address hidden>
Date: Tue, 19 Oct 2004 13:02:11 -0400
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Cc: Fumitoshi UKAI <email address hidden>, James Troup <email address hidden>
Subject: Fixed in NMU of gawk 1:3.1.4-1.2
tag 276206 + fixed
tag 277122 + fixed
quit
This message was generated automatically in response to a
non-maintainer upload. The .changes file follows.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 20 Oct 2004 01:41:40 +0900
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-1.2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: Fumitoshi UKAI <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 276206 277122
Changes:
gawk (1:3.1.4-1.2) unstable; urgency=low
.
* NMU to fix RC bugs
* 12_dfa.
to fix CASEIGNORE match on [a-z] or [A-Z] in multibyte locales (UTF-8,.)
closes: Bug#276206
* 13_dfa.
to fix wrong match '[' against character class such as [[:space:]]
in multibyte locales (UTF-8, ...)
closes: Bug#277122
Files:
f9efc0ef141272
afc7be320bfb12
a70f8b4ad65c18
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
iD8DBQFBdUSv9D5
+kxtBcbg3aPT8pT
=vhfA
-----END PGP SIGNATURE-----
In Debian Bug tracker #276206, Fumitoshi UKAI (ukai) wrote : rc bug for sarge | #10 |
# grep
tags 249245 - fixed
tags 249245 + sarge
tags 274352 - fixed
tags 274352 + sarge
tags 276202 - fixed
tags 276202 + sarge
tags 276209 - fixed
tags 276209 + sarge
# gawk
tags 266519 - fixed
tags 266519 + sarge
tags 276201 - fixed
tags 276201 + sarge
tags 276206 - fixed
tags 276206 + sarge
tags 277122 - fixed
tags 277122 + sarge
tags 264829 - fixed
tags 264829 + sarge
tags 266043 - fixed
tags 266043 + sarge
tags 271231 - fixed
tags 271231 + sarge
Debian Bug Importer (debzilla) wrote : | #11 |
Message-ID: <email address hidden>
Date: Thu, 28 Oct 2004 12:04:42 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: rc bug for sarge
# grep
tags 249245 - fixed
tags 249245 + sarge
tags 274352 - fixed
tags 274352 + sarge
tags 276202 - fixed
tags 276202 + sarge
tags 276209 - fixed
tags 276209 + sarge
# gawk
tags 266519 - fixed
tags 266519 + sarge
tags 276201 - fixed
tags 276201 + sarge
tags 276206 - fixed
tags 276206 + sarge
tags 277122 - fixed
tags 277122 + sarge
tags 264829 - fixed
tags 264829 + sarge
tags 266043 - fixed
tags 266043 + sarge
tags 271231 - fixed
tags 271231 + sarge
In Debian Bug tracker #276206, James Troup (james-nocrew) wrote : Bug#276206: fixed in gawk 1:3.1.4-2 | #12 |
Source: gawk
Source-Version: 1:3.1.4-2
We believe that the bug you reported is fixed in the latest version of
gawk, which is due to be installed in the Debian FTP archive:
gawk_3.
to pool/main/
gawk_3.1.4-2.dsc
to pool/main/
gawk_3.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
James Troup <email address hidden> (supplier of updated gawk package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Fri, 26 Nov 2004 18:30:42 +0000
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: James Troup <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 263964 266519 276201 276206 277122 278135
Changes:
gawk (1:3.1.4-2) unstable; urgency=low
.
* 14_io.c-
that wait() when a redirect hits EOF without checking whether or not
this is the kind of redirect which would have an orphan to wait() on.
Closes: #263964
.
* debian/control (Build-Depends): Add a versioned build-depends on a
fixed binutils for m68k. Closes: #278135
.
* Merge in NMU changes. Many thanks to Fumitoshi UKAI. Closes:
#276206, #277122, #266519, #276201
.
* 11_dfa.
13_
that it works for me.
.
* 10_dfa.
* 10_dfa.
for the same problem.
.
* 15_builtin.
wide-char to{lower,upper}() handling.
.
* 16_awkgram.
gawk reading past the end of the file for an awk script that is big
enough to fill more than a buffer's worth and does not end with a
newline.
.
* 17_fix-
improve handling of non-numeric constants so that numbers like 00.34
don't get confused as being octal.
Files:
492e13079781d1
a175a8e9572d74
262ea208b69d0f
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iQIVAwUBQad8NNf
6YsNJPlGlnroFGn
Debian Bug Importer (debzilla) wrote : | #13 |
Message-Id: <email address hidden>
Date: Fri, 26 Nov 2004 14:02:14 -0500
From: James Troup <email address hidden>
To: <email address hidden>
Subject: Bug#276206: fixed in gawk 1:3.1.4-2
Source: gawk
Source-Version: 1:3.1.4-2
We believe that the bug you reported is fixed in the latest version of
gawk, which is due to be installed in the Debian FTP archive:
gawk_3.
to pool/main/
gawk_3.1.4-2.dsc
to pool/main/
gawk_3.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
James Troup <email address hidden> (supplier of updated gawk package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Fri, 26 Nov 2004 18:30:42 +0000
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: James Troup <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 263964 266519 276201 276206 277122 278135
Changes:
gawk (1:3.1.4-2) unstable; urgency=low
.
* 14_io.c-
that wait() when a redirect hits EOF without checking whether or not
this is the kind of redirect which would have an orphan to wait() on.
Closes: #263964
.
* debian/control (Build-Depends): Add a versioned build-depends on a
fixed binutils for m68k. Closes: #278135
.
* Merge in NMU changes. Many thanks to Fumitoshi UKAI. Closes:
#276206, #277122, #266519, #276201
.
* 11_dfa.
13_
that it works for me.
.
* 10_dfa.
* 10_dfa.
for the same problem.
.
* 15_builtin.
wide-char to{lower,upper}() handling.
.
* 16_awkgram.
gawk reading past the end of the file for an awk script that is big
enough to fill more than a buffer's worth and does not end with a
newline.
.
* 17_fix-
improve handling of non-numeric constants so that numbers like 00.34
don't get confused as being octal.
Files:
492e13079781d1
a175a8e9572d74
262ea208b69d0f
Changed in gawk: | |
status: | Unknown → Fix Released |
tags 276206 + patch
thanks
At Wed, 13 Oct 2004 01:23:31 +0900,
Fumitoshi UKAI wrote:
> On all locales that mb_cur_max > 1, such as CJK or UTF-8 locales,
> [a-a] doesn't match with A as expected if IGNORECASE is set.
>
> For example,
> % echo A | LANG=C gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
>
> % echo A | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> %
> # wrong, A should match [a-a] when IGNORECASE=1
>
> If GAWK_NO_DFA=1, it works fine as well as LANG=C.
> % echo A | GAWK_NO_DFA=1 LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-a]+/{print}'
> A
> %
>
> Note that [a-z] will match with A, that is not because IGNORECASE works,
> but because collation order in UTF-8 is "a A b B .. z".
> That is, [a-z] won't match with Z even if IGNORECASE=1.
>
> % echo Z | LANG=en_US.UTF-8 gawk 'BEGIN { IGNORECASE = 1} /[a-z]+/{print}'
> %
I think this patch fixes this problem:
--- dfa.c.orig 2004-10-13 02:27:29.000000000 +0900 IF_NECESSARY( work_mbc- >range_ ends, wchar_t,
range_ ends_al, work_mbc->nranges + 1); mbc->range_ ends[work_ mbc->nranges+ +] = (wchar_t)wc2; (wint_t) wc) || iswupper( (wint_t) wc)) (wint_t) wc2) || iswupper( (wint_t) wc2))) { (wint_t) wc)) (wint_t) wc); (wint_t) wc); IF_NECESSARY( work_mbc- >range_ sts, wchar_t, >range_ sts[work_ mbc->nranges] = (wchar_t)altcase; (wint_t) wc2)) (wint_t) wc2); (wint_t) wc2); IF_NECESSARY( work_mbc- >range_ ends, wchar_t, >range_ ends[work_ mbc->nranges+ +] = (wchar_t)altcase;
+++ dfa.c 2004-10-13 02:27:54.000000000 +0900
@@ -682,6 +682,28 @@
REALLOC_
work_
+ if (case_fold && (iswlower(
+ && (iswlower(
+ wint_t altcase;
+ altcase = wc;
+ if (iswlower(
+ altcase = towupper(
+ else
+ altcase = towlower(
+ REALLOC_
+ range_sts_al, work_mbc->nranges + 1);
+ work_mbc-
+
+ altcase = wc2;
+ if (iswlower(
+ altcase = towupper(
+ else
+ altcase = towlower(
+ REALLOC_
+ range_ends_al, work_mbc->nranges + 1);
+ work_mbc-
+
+ }
}
else if (wc != WEOF)
/* build normal characters. */
Regards,
Fumitoshi UKAI