Comment 2 for bug 9026

Revision history for this message
In , Fumitoshi UKAI (ukai) wrote : Re: gawk: Odd regexp matching problem if LANG=ja_JP

At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:

> > Package: gawk
> > Version: 1:3.1.4-1
>
> > Executing the following line in a shell:
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'
> >
> > yields not the expected two lines of output, but instead only the first one:
> >
> > --- orig/lisp/ChangeLog
> >
> >
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
> >
> > yields:
> >
> > --- orig/lisp/ChangeLog
> > +++ mod/lisp/ChangeLog
> >
> >
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
>
> ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
> locales, this bug causes gawk scripts unusable.
>
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
>
> Could anyone fix this bug?

One possible workaround is use GAWK_NO_DFA=1

 % echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP.eucJP GAWK_NO_DFA=1 gawk '/[Cc]hangeLog/ { print }'
 --- orig/lisp/ChangeLog
 +++ mod/lisp/ChangeLog

I may find the reason of this bug. This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.

--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
     {
       int remain_bytes, i;
       buf_begin -= buf_offset;
+#if 0
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
  buf_offset = (unsigned char const *)begin - buf_begin;
  buf_begin = begin;
  buf_end = end;
  goto go_fast;
       }
-
+#endif
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

Regards,
Fumitoshi UKAI <email address hidden> / <email address hidden>
Hewlett-Packard Laboratories Japan http://ecardfile.com/id/ukai