Comment 17 for bug 7906

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-id: <email address hidden>
Date: Tue, 29 Jun 2004 23:15:22 +1000
From: Peter Moulder <email address hidden>
To: <email address hidden>
Subject: perl not fair comparison: perl gets "wrong" answer for utf-8 text

Suppose UTF-8 LC_CTYPE.

  $ (echo r=F4le; echo role) | grep 'r.le'
  r=F4le
  role
  $ (echo r=F4le; echo role) | perl -ne '/r.le/ and print'
  role
  $ (echo r=F4le; echo role) | grep 'r..le'
  $ (echo r=F4le; echo role) | perl -ne '/r..le/ and print'
  r=F4le

(This is with perl_5.8.3-3, grep_2.5.1.ds1-2.)

Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).

I believe a better solution would be for grep to convert the characte=
r
regexp to an octet regexp. E.g. the character regexp "." (which I'll=
 assume
for simplicity matches any character) might be translated to
(?:[\x00-\x7f]|[\xc0-\xf7][\x80-\xbf]*).

That translation assumes that an accented character formed by
composition is to be considered distinct from a single unicode charac=
ter
(H. S. Teoh's example above). I'm not familiar with the unicode spec=
.
Maybe it's reasonable to consider them different. Otherwise, I belie=
ve
the translate-the-regexp approach is still applicable but requires
longer translations.

However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.

pjrm.