Comment 6 for bug 7906

Revision history for this message
In , Peter Moulder (peter-moulder) wrote : perl not fair comparison: perl gets "wrong" answer for utf-8 text

Suppose UTF-8 LC_CTYPE.

  $ (echo rôle; echo role) | grep 'r.le'
  rôle
  role
  $ (echo rôle; echo role) | perl -ne '/r.le/ and print'
  role
  $ (echo rôle; echo role) | grep 'r..le'
  $ (echo rôle; echo role) | perl -ne '/r..le/ and print'
  rôle

(This is with perl_5.8.3-3, grep_2.5.1.ds1-2.)

Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).

I believe a better solution would be for grep to convert the character
regexp to an octet regexp. E.g. the character regexp "." (which I'll assume
for simplicity matches any character) might be translated to
(?:[\x00-\x7f]|[\xc0-\xf7][\x80-\xbf]*).

That translation assumes that an accented character formed by
composition is to be considered distinct from a single unicode character
(H. S. Teoh's example above). I'm not familiar with the unicode spec.
Maybe it's reasonable to consider them different. Otherwise, I believe
the translate-the-regexp approach is still applicable but requires
longer translations.

However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.

pjrm.