Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).
I believe a better solution would be for grep to convert the characte=
r
regexp to an octet regexp. E.g. the character regexp "." (which I'll=
assume
for simplicity matches any character) might be translated to
(?:[\x00-\x7f]|[\xc0-\xf7][\x80-\xbf]*).
That translation assumes that an accented character formed by
composition is to be considered distinct from a single unicode charac=
ter
(H. S. Teoh's example above). I'm not familiar with the unicode spec=
.
Maybe it's reasonable to consider them different. Otherwise, I belie=
ve
the translate-the-regexp approach is still applicable but requires
longer translations.
However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.
Message-id: <email address hidden>
Date: Tue, 29 Jun 2004 23:15:22 +1000
From: Peter Moulder <email address hidden>
To: <email address hidden>
Subject: perl not fair comparison: perl gets "wrong" answer for utf-8 text
Suppose UTF-8 LC_CTYPE.
$ (echo r=F4le; echo role) | grep 'r.le'
r=F4le
role
$ (echo r=F4le; echo role) | perl -ne '/r.le/ and print'
role
$ (echo r=F4le; echo role) | grep 'r..le'
$ (echo r=F4le; echo role) | perl -ne '/r..le/ and print'
r=F4le
(This is with perl_5.8.3-3, grep_2.5.1.ds1-2.)
Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).
I believe a better solution would be for grep to convert the characte= \x7f]|[ \xc0-\xf7] [\x80-\ xbf]*).
r
regexp to an octet regexp. E.g. the character regexp "." (which I'll=
assume
for simplicity matches any character) might be translated to
(?:[\x00-
That translation assumes that an accented character formed by the-regexp approach is still applicable but requires
composition is to be considered distinct from a single unicode charac=
ter
(H. S. Teoh's example above). I'm not familiar with the unicode spec=
.
Maybe it's reasonable to consider them different. Otherwise, I belie=
ve
the translate-
longer translations.
However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.
pjrm.