Comment 18 for bug 7906

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-id: <email address hidden>
Date: Wed, 30 Jun 2004 16:46:42 +1000
From: Peter Moulder <email address hidden>
To: <email address hidden>
Subject: gprof; combining diacritical marks; octet regexp conversion

According to gprof on grep compiled with -pg (without installing libc=
6-prof),
~all the time is spent in check_multibyte_string.

In the case of utf-8, we don't need the return value of
check_multibyte_string: bytes other than the initial byte of utf-8
characters (ignoring the composition case) have (c & 0xc0) =3D=3D 0x8=
0 (i.e. bit7=3D1, bit6=3D0).
To handle the composition case, we add a test that the wide character
is a combining diacritical: "|| (c =3D=3D 0xcc) || ((c =3D=3D 0xcd) &=
& (nextchar
<=3D 0xaf))".

Combining diacritical marks are a hassle for grep. According to
http://en.wikipedia.org/wiki/Combining_diacritical_mark, a character =
can
be followed by more than one combining diacritical mark character.
Presumably, order doesn't matter, so grep 'a<string of n combining
diacritical mark characters>' can match n factorial different strings=
 in
the haystack text, without counting use of precomposed characters.
The simplest way of handling this would be to convert to a canonical
form, say decomposed form with combining diacritical marks in sorted
order.

Note that I haven't checked the unicode standards on this point:
possibly order is to be considered significant, in which case the onl=
y
possible matches are decomposed vs use of precomposed character. Thi=
s
would make the convert-to-octet-regexp approach practical (see below)=
.

Another issue with decomposable characters is that we must use negati=
ve
lookahead tests: if searching for `o' then we must check that the
matched 'o' isn't followed by a combining diacritical mark character.
The alternative of canonicalizing to precomposed form instead of
decomposed form has its own expense: if there are 112 possible
diacritical mark characters, and characters can be followed by an
arbitrary selection of those, then we need an extra 112 bits per
canonical character to represent those. And even that only presence =
or
absence (rather than number or order) is significant for combining
diacritical marks, i.e. it assumes that e<macron><acute><acute> is to=
 be
considered equivalent to e<acute><macron>. If number and order are
significant then no finite number of bits suffices.

Note that grep doesn't currently handle combining diacritical marks:

  $ printf 'e\xcc\x80\n'
  =C3=A8
  $ printf 'e\xcc\x80\n' | grep '=C3=A8'
  <nothing>
  $ printf 'e\xcc\x80\n' | grep '^.$'
  <nothing>

More remarks on the idea of converting character regexps to byte rege=
xps
(see previous message).

First, note that it works for UTF-8, but not e.g. GBK, precisely beca=
use
in UTF-8 one can't mistake the middle of a character for the beginnin=
g
of one. E.g. in GBK, the string for =E6=88=91=E6=88=91 contains the =
string for =E6=A4=85;
there is no way to tell where the character beginnings are short of
something like what grep is already doing.

Is it worth adding special code for UTF-8 (probably sharable with oth=
er
UTF encodings) if we still need something like the current code to
handle GBK and other multi-byte encodings? Well, UTF-8 is likely to
become the primary encoding on Debian systems.

The example translation given for "." may discourage from its
complexity. However, we should note that many, perhaps most, common
regexps don't need any translation at all (ignoring character
composition issues). E.g. '[abc]a.*b($|\>)' doesn't need any change.
'r=C3=B4*l' needs a minor change of wrapping the multi-byte =C3=B4: '=
r(?:=C3=B4)*l'.
Similarly for character ranges that include multi-byte characters:
'r[io=C3=B4=C3=A9]le' becomes 'r(?:[io]|=C3=B4|=C3=A9)le'. The fragm=
ent '.+' (extended
regexp) may or may not need change, depending on whether preceding &
following fragments only match whole characters. E.g. 'a.+b' doesn't
need change, but 'a.+.+b' does need changing. Whether '[a-z]' needs
changing depends on LC_COLLATE, i.e. it depends whether that range
includes =C3=B4 and so on. However, in most places where a regexp us=
es
[a-z], one would want to exclude =C3=B4 and so on, so it would be a b=
ug not
to specify LC_COLLATE=3DC.

I've now checked Single Unix Spec for the regexp '.': it matches all
characters other than NUL (and in particular it matches \n, except th=
at
grep should already have split its input into lines excluding \n).

pjrm.