gawk: Odd regexp matching problem if locale's mb_cur_max > 1
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gawk (Debian) |
Fix Released
|
Unknown
|
|||
gawk (Ubuntu) |
Invalid
|
High
|
Unassigned |
Bug Description
Automatically imported from Debian bug report #266519 http://
In Debian Bug tracker #266519, Tatsuya Kinoshita (tats) wrote : | #1 |
In Debian Bug tracker #266519, Fumitoshi UKAI (ukai) wrote : | #2 |
At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:
> > Package: gawk
> > Version: 1:3.1.4-1
>
> > Executing the following line in a shell:
> >
> > echo -e '--- orig/lisp/
> >
> > yields not the expected two lines of output, but instead only the first one:
> >
> > --- orig/lisp/ChangeLog
> >
> >
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> >
> > echo -e '--- orig/lisp/
> >
> > yields:
> >
> > --- orig/lisp/ChangeLog
> > +++ mod/lisp/ChangeLog
> >
> >
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
>
> ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
> locales, this bug causes gawk scripts unusable.
>
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
>
> Could anyone fix this bug?
One possible workaround is use GAWK_NO_DFA=1
% echo -e '--- orig/lisp/
--- orig/lisp/ChangeLog
+++ mod/lisp/ChangeLog
I may find the reason of this bug. This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.
--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
{
int remain_bytes, i;
buf_begin -= buf_offset;
+#if 0
if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
buf_offset = (unsigned char const *)begin - buf_begin;
buf_begin = begin;
buf_end = end;
goto go_fast;
}
-
+#endif
buf_offset = 0;
buf_begin = begin;
buf_end = end;
Regards,
Fumitoshi UKAI <email address hidden> / <email address hidden>
Hewlett-Packard Laboratories Japan http://
In Debian Bug tracker #266519, Fumitoshi UKAI (ukai) wrote : | #3 |
severity 266519 grave
retitle 266519 gawk: Odd regexp matching problem if locale's mb_cur_max > 1
tags 266519 + patch
thanks
Not only on CJK, but also on all locales that is mb_cur_max > 1.
This means all UTF-8 locales, such as en_US.UTF-8, exhibit the same problem.
So I think this bug should be considered as release critical.
This patch solves this problem.
(Explanation:
begin-end points input string and this portion checks if the
input string is the same as previous one and skips updating
mbs related buffers. However, gawk uses a buffer for each input lines,
so begin-end points the same address but its contents may differ
from previous ones.)
--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
{
int remain_bytes, i;
buf_begin -= buf_offset;
+#if 0
if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
buf_offset = (unsigned char const *)begin - buf_begin;
buf_begin = begin;
buf_end = end;
goto go_fast;
}
-
+#endif
buf_offset = 0;
buf_begin = begin;
buf_end = end;
Regards,
Fumitoshi UKAI
Debian Bug Importer (debzilla) wrote : | #4 |
Automatically imported from Debian bug report #266519 http://
Debian Bug Importer (debzilla) wrote : | #5 |
Message-Id: <20040818055735
Date: Wed, 18 Aug 2004 14:57:35 +0900
From: Miles Bader <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: gawk: Odd regexp matching problem if LANG=ja_JP
Package: gawk
Version: 1:3.1.4-1
Severity: normal
Executing the following line in a shell:
echo -e '--- orig/lisp/
yields not the expected two lines of output, but instead only the first one:
--- orig/lisp/ChangeLog
If the LANG-setting portion is changed to use C, then it works as
expected (others such as "de" seem to work too):
echo -e '--- orig/lisp/
yields:
--- orig/lisp/ChangeLog
+++ mod/lisp/ChangeLog
I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
and ja_JP.eucjp all exhibit the same problem.
Thanks,
-Miles
-- System Information:
Debian Release: 3.1
APT prefers unstable
APT policy: (500, 'unstable'), (101, 'experimental')
Architecture: i386 (i686)
Kernel: Linux 2.6.8.1
Locale: LANG=ja_JP.UTF-8, LC_CTYPE=
Versions of packages gawk depends on:
ii libc6 2.3.2.ds1-16 GNU C Library: Shared libraries an
-- no debconf information
Debian Bug Importer (debzilla) wrote : | #6 |
Message-Id: <20041011.
Date: Mon, 11 Oct 2004 23:29:15 +0900 (JST)
From: Tatsuya Kinoshita <email address hidden>
To: <email address hidden>, <email address hidden>
Cc: <email address hidden>, <email address hidden>
Subject: Re: gawk: Odd regexp matching problem if LANG=ja_JP
----Security_
Content-Type: Text/Plain; charset=us-ascii
Content-
On August 18, 2004 at 2:57PM +0900,
miles (at lsi.nec.co.jp) wrote:
> Package: gawk
> Version: 1:3.1.4-1
> Executing the following line in a shell:
>
> echo -e '--- orig/lisp/
>
> yields not the expected two lines of output, but instead only the first one:
>
> --- orig/lisp/ChangeLog
>
>
> If the LANG-setting portion is changed to use C, then it works as
> expected (others such as "de" seem to work too):
>
> echo -e '--- orig/lisp/
>
> yields:
>
> --- orig/lisp/ChangeLog
> +++ mod/lisp/ChangeLog
>
>
> I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> and ja_JP.eucjp all exhibit the same problem.
ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
locales, this bug causes gawk scripts unusable.
Downgrading gawk to version 1:3.1.3-3 prevents the problem.
Could anyone fix this bug?
Thanks,
--
Tatsuya Kinoshita
----Security_
Content-Type: application/
Content-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
iD8DBQBBapi+
xEDvdADGk+
=iLC+
-----END PGP SIGNATURE-----
----Security_
Debian Bug Importer (debzilla) wrote : | #7 |
Message-ID: <email address hidden>
Date: Tue, 12 Oct 2004 01:16:54 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Cc: <email address hidden>, Tatsuya Kinoshita <email address hidden>,
<email address hidden>, <email address hidden>
Subject: Re: gawk: Odd regexp matching problem if LANG=ja_JP
At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:
> > Package: gawk
> > Version: 1:3.1.4-1
>
> > Executing the following line in a shell:
> >
> > echo -e '--- orig/lisp/
> >
> > yields not the expected two lines of output, but instead only the first one:
> >
> > --- orig/lisp/ChangeLog
> >
> >
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> >
> > echo -e '--- orig/lisp/
> >
> > yields:
> >
> > --- orig/lisp/ChangeLog
> > +++ mod/lisp/ChangeLog
> >
> >
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
>
> ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
> locales, this bug causes gawk scripts unusable.
>
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
>
> Could anyone fix this bug?
One possible workaround is use GAWK_NO_DFA=1
% echo -e '--- orig/lisp/
--- orig/lisp/ChangeLog
+++ mod/lisp/ChangeLog
I may find the reason of this bug. This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.
--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
{
int remain_bytes, i;
buf_begin -= buf_offset;
+#if 0
if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
buf_offset = (unsigned char const *)begin - buf_begin;
buf_begin = begin;
buf_end = end;
goto go_fast;
}
-
+#endif
buf_offset = 0;
buf_begin = begin;
buf_end = end;
Regards,
Fumitoshi UKAI <email address hidden> / <email address hidden>
Hewlett-Packard Laboratories Japan http://
Debian Bug Importer (debzilla) wrote : | #8 |
Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 00:40:39 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: Re: gawk: Odd regexp matching problem if LANG=ja_JP
severity 266519 grave
retitle 266519 gawk: Odd regexp matching problem if locale's mb_cur_max > 1
tags 266519 + patch
thanks
Not only on CJK, but also on all locales that is mb_cur_max > 1.
This means all UTF-8 locales, such as en_US.UTF-8, exhibit the same problem.
So I think this bug should be considered as release critical.
This patch solves this problem.
(Explanation:
begin-end points input string and this portion checks if the
input string is the same as previous one and skips updating
mbs related buffers. However, gawk uses a buffer for each input lines,
so begin-end points the same address but its contents may differ
from previous ones.)
--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
{
int remain_bytes, i;
buf_begin -= buf_offset;
+#if 0
if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
buf_offset = (unsigned char const *)begin - buf_begin;
buf_begin = begin;
buf_end = end;
goto go_fast;
}
-
+#endif
buf_offset = 0;
buf_begin = begin;
buf_end = end;
Regards,
Fumitoshi UKAI
Martin Pitt (pitti) wrote : | #9 |
(In reply to comment #2)
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
This is exactly the version that Warty ships. I also checked it:
$ echo -e '--- orig/lisp/
'/[Cc]hangeLog/ { print }'
--- orig/lisp/ChangeLog
+++ mod/lisp/ChangeLog
Closing as NOTWARTY. The Debian version already has a patch and certainly will
be fixed soon, too.
In Debian Bug tracker #266519, Fumitoshi UKAI (ukai) wrote : Fixed in NMU of gawk 1:3.1.4-1.1 | #10 |
tag 266519 + fixed
tag 276201 + fixed
quit
This message was generated automatically in response to a
non-maintainer upload. The .changes file follows.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Tue, 19 Oct 2004 01:16:27 +0900
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-1.1
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: Fumitoshi UKAI <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 266519 276201
Changes:
gawk (1:3.1.4-1.1) unstable; urgency=low
.
* NMU to fix RC bugs
* 10_dfa.
to fix odd regexp matching in multibyte locales (UTF-8, CJK, ..)
closes: Bug#266519
* 11_dfa.
to fix CASEIGNORE match on [:upper:] and [:lower:] in
multibyte locales (UTF-8, CJK, ...)
closes: Bug#276201
Files:
47cdd14a4532a0
0e16583a1390c7
a1a43961a3154a
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
iD8DBQFBc+
B8rEeS5lv/
=urId
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #11 |
Message-Id: <email address hidden>
Date: Mon, 18 Oct 2004 12:47:03 -0400
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Cc: Fumitoshi UKAI <email address hidden>, James Troup <email address hidden>
Subject: Fixed in NMU of gawk 1:3.1.4-1.1
tag 266519 + fixed
tag 276201 + fixed
quit
This message was generated automatically in response to a
non-maintainer upload. The .changes file follows.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Tue, 19 Oct 2004 01:16:27 +0900
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-1.1
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: Fumitoshi UKAI <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 266519 276201
Changes:
gawk (1:3.1.4-1.1) unstable; urgency=low
.
* NMU to fix RC bugs
* 10_dfa.
to fix odd regexp matching in multibyte locales (UTF-8, CJK, ..)
closes: Bug#266519
* 11_dfa.
to fix CASEIGNORE match on [:upper:] and [:lower:] in
multibyte locales (UTF-8, CJK, ...)
closes: Bug#276201
Files:
47cdd14a4532a0
0e16583a1390c7
a1a43961a3154a
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
iD8DBQFBc+
B8rEeS5lv/
=urId
-----END PGP SIGNATURE-----
In Debian Bug tracker #266519, Fumitoshi UKAI (ukai) wrote : rc bug for sarge | #12 |
# grep
tags 249245 - fixed
tags 249245 + sarge
tags 274352 - fixed
tags 274352 + sarge
tags 276202 - fixed
tags 276202 + sarge
tags 276209 - fixed
tags 276209 + sarge
# gawk
tags 266519 - fixed
tags 266519 + sarge
tags 276201 - fixed
tags 276201 + sarge
tags 276206 - fixed
tags 276206 + sarge
tags 277122 - fixed
tags 277122 + sarge
tags 264829 - fixed
tags 264829 + sarge
tags 266043 - fixed
tags 266043 + sarge
tags 271231 - fixed
tags 271231 + sarge
In Debian Bug tracker #266519, Oded Shimon (ods15) wrote : Patch: Odd regexp matching problem if locale's mb_cur_max > 1 | #13 |
Package: gawk
Version: 1:3.1.4-1
Followup-For: Bug #266519
I have a patch for this bug which does not involve removing go_fast, but
it does involve adding a check loop. I believe this is still faster than
the previous patch, and it was the best I could do with my programming
knowledge.
I checked, this patch actually compiles and fixes the bug. :)
Regards,
- ods15
diff -u dfa.c ~/sources/
--- dfa.c 2004-10-29 11:58:47.000000000 +0200
+++ /home/ods15/
@@ -2895,6 +2895,10 @@
register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
static int sbit[NOTCHAR]; /* Table for anding with d->success. */
static int sbit_init;
+ static unsigned char * sameas; /* a simple check that the content
+ between begin and end are indeed
+ what they used to be */
+ static int sizesameas;
if (! sbit_init)
{
@@ -2918,14 +2922,31 @@
if (MB_CUR_MAX > 1)
{
int remain_bytes, i;
+
+ if (!sameas) {
+ MALLOC(sameas, unsigned char, end - begin + 2);
+ memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
+ sizesameas = end - begin + 1;
+ }
+
buf_begin -= buf_offset;
if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
+ int yesgood = sizesameas == end - begin + 1;
+ for (i = 0; i < sizesameas && yesgood; i++) {
+ if (sameas[i] != begin[i]) yesgood = 0;
+ }
+ if (yesgood) {
buf_offset = (unsigned char const *)begin - buf_begin;
buf_begin = begin;
buf_end = end;
goto go_fast;
+ }
}
+ REALLOC(sameas, unsigned char, end - begin + 2);
+ for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
+ sizesameas = end - begin + 1;
+
buf_offset = 0;
buf_begin = begin;
buf_end = end;
-- System Information:
Debian Release: 3.1
APT prefers unstable
APT policy: (500, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.6
Locale: LANG=C, LC_CTYPE=C
Versions of packages gawk depends on:
ii libc6 2.3.2.ds1-18 GNU C Library: Shared libraries an
-- no debconf information
Debian Bug Importer (debzilla) wrote : | #14 |
Message-ID: <email address hidden>
Date: Thu, 28 Oct 2004 12:04:42 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: rc bug for sarge
# grep
tags 249245 - fixed
tags 249245 + sarge
tags 274352 - fixed
tags 274352 + sarge
tags 276202 - fixed
tags 276202 + sarge
tags 276209 - fixed
tags 276209 + sarge
# gawk
tags 266519 - fixed
tags 266519 + sarge
tags 276201 - fixed
tags 276201 + sarge
tags 276206 - fixed
tags 276206 + sarge
tags 277122 - fixed
tags 277122 + sarge
tags 264829 - fixed
tags 264829 + sarge
tags 266043 - fixed
tags 266043 + sarge
tags 271231 - fixed
tags 271231 + sarge
Debian Bug Importer (debzilla) wrote : | #15 |
Message-Id: <E1CNTmv-
Date: Fri, 29 Oct 2004 12:15:17 +0200
From: Oded Shimon <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: Patch: Odd regexp matching problem if locale's mb_cur_max > 1
Package: gawk
Version: 1:3.1.4-1
Followup-For: Bug #266519
I have a patch for this bug which does not involve removing go_fast, but
it does involve adding a check loop. I believe this is still faster than
the previous patch, and it was the best I could do with my programming
knowledge.
I checked, this patch actually compiles and fixes the bug. :)
Regards,
- ods15
diff -u dfa.c ~/sources/
--- dfa.c 2004-10-29 11:58:47.000000000 +0200
+++ /home/ods15/
@@ -2895,6 +2895,10 @@
register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
static int sbit[NOTCHAR]; /* Table for anding with d->success. */
static int sbit_init;
+ static unsigned char * sameas; /* a simple check that the content
+ between begin and end are indeed
+ what they used to be */
+ static int sizesameas;
if (! sbit_init)
{
@@ -2918,14 +2922,31 @@
if (MB_CUR_MAX > 1)
{
int remain_bytes, i;
+
+ if (!sameas) {
+ MALLOC(sameas, unsigned char, end - begin + 2);
+ memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
+ sizesameas = end - begin + 1;
+ }
+
buf_begin -= buf_offset;
if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
+ int yesgood = sizesameas == end - begin + 1;
+ for (i = 0; i < sizesameas && yesgood; i++) {
+ if (sameas[i] != begin[i]) yesgood = 0;
+ }
+ if (yesgood) {
buf_offset = (unsigned char const *)begin - buf_begin;
buf_begin = begin;
buf_end = end;
goto go_fast;
+ }
}
+ REALLOC(sameas, unsigned char, end - begin + 2);
+ for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
+ sizesameas = end - begin + 1;
+
buf_offset = 0;
buf_begin = begin;
buf_end = end;
-- System Information:
Debian Release: 3.1
APT prefers unstable
APT policy: (500, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.6
Locale: LANG=C, LC_CTYPE=C
Versions of packages gawk depends on:
ii libc6 2.3.2.ds1-18 GNU C Library: Shared libraries an
-- no debconf information
In Debian Bug tracker #266519, Tatsuya Kinoshita (tats) wrote : Re: Bug#266519: Patch: Odd regexp matching problem if locale's mb_cur_max > 1 | #16 |
Hi, Fumitoshi,
Thanks for the NMU.
BTW, how about the following patch?
On October 29, 2004 at 12:15PM +0200,
ods15 (at ods15.dyndns.org) wrote:
> Package: gawk
> Version: 1:3.1.4-1
> Followup-For: Bug #266519
> I have a patch for this bug which does not involve removing go_fast, but
> it does involve adding a check loop. I believe this is still faster than
> the previous patch, and it was the best I could do with my programming
> knowledge.
> I checked, this patch actually compiles and fixes the bug. :)
>
> Regards,
> - ods15
>
>
> diff -u dfa.c ~/sources/
>
> --- dfa.c 2004-10-29 11:58:47.000000000 +0200
> +++ /home/ods15/
> @@ -2895,6 +2895,10 @@
> register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
> static int sbit[NOTCHAR]; /* Table for anding with d->success. */
> static int sbit_init;
> + static unsigned char * sameas; /* a simple check that the content
> + between begin and end are indeed
> + what they used to be */
> + static int sizesameas;
>
> if (! sbit_init)
> {
> @@ -2918,14 +2922,31 @@
> if (MB_CUR_MAX > 1)
> {
> int remain_bytes, i;
> +
> + if (!sameas) {
> + MALLOC(sameas, unsigned char, end - begin + 2);
> + memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
> + sizesameas = end - begin + 1;
> + }
> +
> buf_begin -= buf_offset;
> if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
> + int yesgood = sizesameas == end - begin + 1;
> + for (i = 0; i < sizesameas && yesgood; i++) {
> + if (sameas[i] != begin[i]) yesgood = 0;
> + }
> + if (yesgood) {
> buf_offset = (unsigned char const *)begin - buf_begin;
> buf_begin = begin;
> buf_end = end;
> goto go_fast;
> + }
> }
>
> + REALLOC(sameas, unsigned char, end - begin + 2);
> + for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
> + sizesameas = end - begin + 1;
> +
> buf_offset = 0;
> buf_begin = begin;
> buf_end = end;
--
Tatsuya Kinoshita
Debian Bug Importer (debzilla) wrote : | #17 |
Message-Id: <20041103.
Date: Wed, 03 Nov 2004 20:44:55 +0900 (JST)
From: Tatsuya Kinoshita <email address hidden>
To: Fumitoshi UKAI <email address hidden>
Cc: Oded Shimon <email address hidden>, <email address hidden>
Subject: Re: Bug#266519: Patch: Odd regexp matching problem if locale's
mb_cur_max > 1
----Security_
Content-Type: Text/Plain; charset=us-ascii
Content-
Hi, Fumitoshi,
Thanks for the NMU.
BTW, how about the following patch?
On October 29, 2004 at 12:15PM +0200,
ods15 (at ods15.dyndns.org) wrote:
> Package: gawk
> Version: 1:3.1.4-1
> Followup-For: Bug #266519
> I have a patch for this bug which does not involve removing go_fast, but
> it does involve adding a check loop. I believe this is still faster than
> the previous patch, and it was the best I could do with my programming
> knowledge.
> I checked, this patch actually compiles and fixes the bug. :)
>
> Regards,
> - ods15
>
>
> diff -u dfa.c ~/sources/
>
> --- dfa.c 2004-10-29 11:58:47.000000000 +0200
> +++ /home/ods15/
> @@ -2895,6 +2895,10 @@
> register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
> static int sbit[NOTCHAR]; /* Table for anding with d->success. */
> static int sbit_init;
> + static unsigned char * sameas; /* a simple check that the content
> + between begin and end are indeed
> + what they used to be */
> + static int sizesameas;
>
> if (! sbit_init)
> {
> @@ -2918,14 +2922,31 @@
> if (MB_CUR_MAX > 1)
> {
> int remain_bytes, i;
> +
> + if (!sameas) {
> + MALLOC(sameas, unsigned char, end - begin + 2);
> + memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
> + sizesameas = end - begin + 1;
> + }
> +
> buf_begin -= buf_offset;
> if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
> + int yesgood = sizesameas == end - begin + 1;
> + for (i = 0; i < sizesameas && yesgood; i++) {
> + if (sameas[i] != begin[i]) yesgood = 0;
> + }
> + if (yesgood) {
> buf_offset = (unsigned char const *)begin - buf_begin;
> buf_begin = begin;
> buf_end = end;
> goto go_fast;
> + }
> }
>
> + REALLOC(sameas, unsigned char, end - begin + 2);
> + for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
> + sizesameas = end - begin + 1;
> +
> buf_offset = 0;
> buf_begin = begin;
> buf_end = end;
--
Tatsuya Kinoshita
----Security_
Content-Type: application/
Content-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
iD8DBQBBiMS4gV4
F7Tzm2fDZhBKeOc
=1y5c
-----END PGP SIGNATURE-----
----Security_
In Debian Bug tracker #266519, James Troup (james-nocrew) wrote : Bug#266519: fixed in gawk 1:3.1.4-2 | #18 |
Source: gawk
Source-Version: 1:3.1.4-2
We believe that the bug you reported is fixed in the latest version of
gawk, which is due to be installed in the Debian FTP archive:
gawk_3.
to pool/main/
gawk_3.1.4-2.dsc
to pool/main/
gawk_3.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
James Troup <email address hidden> (supplier of updated gawk package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Fri, 26 Nov 2004 18:30:42 +0000
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: James Troup <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 263964 266519 276201 276206 277122 278135
Changes:
gawk (1:3.1.4-2) unstable; urgency=low
.
* 14_io.c-
that wait() when a redirect hits EOF without checking whether or not
this is the kind of redirect which would have an orphan to wait() on.
Closes: #263964
.
* debian/control (Build-Depends): Add a versioned build-depends on a
fixed binutils for m68k. Closes: #278135
.
* Merge in NMU changes. Many thanks to Fumitoshi UKAI. Closes:
#276206, #277122, #266519, #276201
.
* 11_dfa.
13_
that it works for me.
.
* 10_dfa.
* 10_dfa.
for the same problem.
.
* 15_builtin.
wide-char to{lower,upper}() handling.
.
* 16_awkgram.
gawk reading past the end of the file for an awk script that is big
enough to fill more than a buffer's worth and does not end with a
newline.
.
* 17_fix-
improve handling of non-numeric constants so that numbers like 00.34
don't get confused as being octal.
Files:
492e13079781d1
a175a8e9572d74
262ea208b69d0f
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iQIVAwUBQad8NNf
6YsNJPlGlnroFGn
Debian Bug Importer (debzilla) wrote : | #19 |
Message-Id: <email address hidden>
Date: Fri, 26 Nov 2004 14:02:14 -0500
From: James Troup <email address hidden>
To: <email address hidden>
Subject: Bug#266519: fixed in gawk 1:3.1.4-2
Source: gawk
Source-Version: 1:3.1.4-2
We believe that the bug you reported is fixed in the latest version of
gawk, which is due to be installed in the Debian FTP archive:
gawk_3.
to pool/main/
gawk_3.1.4-2.dsc
to pool/main/
gawk_3.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
James Troup <email address hidden> (supplier of updated gawk package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Fri, 26 Nov 2004 18:30:42 +0000
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: James Troup <email address hidden>
Description:
gawk - GNU awk, a pattern scanning and processing language
Closes: 263964 266519 276201 276206 277122 278135
Changes:
gawk (1:3.1.4-2) unstable; urgency=low
.
* 14_io.c-
that wait() when a redirect hits EOF without checking whether or not
this is the kind of redirect which would have an orphan to wait() on.
Closes: #263964
.
* debian/control (Build-Depends): Add a versioned build-depends on a
fixed binutils for m68k. Closes: #278135
.
* Merge in NMU changes. Many thanks to Fumitoshi UKAI. Closes:
#276206, #277122, #266519, #276201
.
* 11_dfa.
13_
that it works for me.
.
* 10_dfa.
* 10_dfa.
for the same problem.
.
* 15_builtin.
wide-char to{lower,upper}() handling.
.
* 16_awkgram.
gawk reading past the end of the file for an awk script that is big
enough to fill more than a buffer's worth and does not end with a
newline.
.
* 17_fix-
improve handling of non-numeric constants so that numbers like 00.34
don't get confused as being octal.
Files:
492e13079781d1
a175a8e9572d74
262ea208b69d0f
Changed in gawk: | |
status: | Unknown → Fix Released |
On August 18, 2004 at 2:57PM +0900,
miles (at lsi.nec.co.jp) wrote:
> Package: gawk
> Version: 1:3.1.4-1
> Executing the following line in a shell: ChangeLog\ n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }' ChangeLog\ n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
>
> echo -e '--- orig/lisp/
>
> yields not the expected two lines of output, but instead only the first one:
>
> --- orig/lisp/ChangeLog
>
>
> If the LANG-setting portion is changed to use C, then it works as
> expected (others such as "de" seem to work too):
>
> echo -e '--- orig/lisp/
>
> yields:
>
> --- orig/lisp/ChangeLog
> +++ mod/lisp/ChangeLog
>
>
> I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> and ja_JP.eucjp all exhibit the same problem.
ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
locales, this bug causes gawk scripts unusable.
Downgrading gawk to version 1:3.1.3-3 prevents the problem.
Could anyone fix this bug?
Thanks,
--
Tatsuya Kinoshita