At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:
> > Package: gawk
> > Version: 1:3.1.4-1
>
> > Executing the following line in a shell:
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'
> >
> > yields not the expected two lines of output, but instead only the first one:
> >
> > --- orig/lisp/ChangeLog
> >
> >
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
> >
> > yields:
> >
> > --- orig/lisp/ChangeLog
> > +++ mod/lisp/ChangeLog
> >
> >
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
>
> ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
> locales, this bug causes gawk scripts unusable.
>
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
>
> Could anyone fix this bug?
I may find the reason of this bug. This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.
At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:
> > Package: gawk ChangeLog\ n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }' ChangeLog\ n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
> > Version: 1:3.1.4-1
>
> > Executing the following line in a shell:
> >
> > echo -e '--- orig/lisp/
> >
> > yields not the expected two lines of output, but instead only the first one:
> >
> > --- orig/lisp/ChangeLog
> >
> >
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> >
> > echo -e '--- orig/lisp/
> >
> > yields:
> >
> > --- orig/lisp/ChangeLog
> > +++ mod/lisp/ChangeLog
> >
> >
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
>
> ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
> locales, this bug causes gawk scripts unusable.
>
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
>
> Could anyone fix this bug?
One possible workaround is use GAWK_NO_DFA=1
% echo -e '--- orig/lisp/ ChangeLog\ n+++ mod/lisp/ChangeLog' | LANG=ja_JP.eucJP GAWK_NO_DFA=1 gawk '/[Cc]hangeLog/ { print }'
--- orig/lisp/ChangeLog
+++ mod/lisp/ChangeLog
I may find the reason of this bug. This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.
--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
{
int remain_bytes, i;
buf_begin -= buf_offset;
+#if 0
if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
buf_offset = (unsigned char const *)begin - buf_begin;
buf_begin = begin;
buf_end = end;
goto go_fast;
}
-
+#endif
buf_offset = 0;
buf_begin = begin;
buf_end = end;
Regards, ecardfile. com/id/ ukai
Fumitoshi UKAI <email address hidden> / <email address hidden>
Hewlett-Packard Laboratories Japan http://