According to gprof on grep compiled with -pg (without installing libc6-prof), ~all the time is spent in check_multibyte_string. In the case of utf-8, we don't need the return value of check_multibyte_string: bytes other than the initial byte of utf-8 characters (ignoring the composition case) have (c & 0xc0) == 0x80 (i.e. bit7=1, bit6=0). To handle the composition case, we add a test that the wide character is a combining diacritical: "|| (c == 0xcc) || ((c == 0xcd) && (nextchar <= 0xaf))". Combining diacritical marks are a hassle for grep. According to http://en.wikipedia.org/wiki/Combining_diacritical_mark, a character can be followed by more than one combining diacritical mark character. Presumably, order doesn't matter, so grep 'a' can match n factorial different strings in the haystack text, without counting use of precomposed characters. The simplest way of handling this would be to convert to a canonical form, say decomposed form with combining diacritical marks in sorted order. Note that I haven't checked the unicode standards on this point: possibly order is to be considered significant, in which case the only possible matches are decomposed vs use of precomposed character. This would make the convert-to-octet-regexp approach practical (see below). Another issue with decomposable characters is that we must use negative lookahead tests: if searching for `o' then we must check that the matched 'o' isn't followed by a combining diacritical mark character. The alternative of canonicalizing to precomposed form instead of decomposed form has its own expense: if there are 112 possible diacritical mark characters, and characters can be followed by an arbitrary selection of those, then we need an extra 112 bits per canonical character to represent those. And even that only presence or absence (rather than number or order) is significant for combining diacritical marks, i.e. it assumes that e is to be considered equivalent to e. If number and order are significant then no finite number of bits suffices. Note that grep doesn't currently handle combining diacritical marks: $ printf 'e\xcc\x80\n' è $ printf 'e\xcc\x80\n' | grep 'è' $ printf 'e\xcc\x80\n' | grep '^.$' More remarks on the idea of converting character regexps to byte regexps (see previous message). First, note that it works for UTF-8, but not e.g. GBK, precisely because in UTF-8 one can't mistake the middle of a character for the beginning of one. E.g. in GBK, the string for 我我 contains the string for 椅; there is no way to tell where the character beginnings are short of something like what grep is already doing. Is it worth adding special code for UTF-8 (probably sharable with other UTF encodings) if we still need something like the current code to handle GBK and other multi-byte encodings? Well, UTF-8 is likely to become the primary encoding on Debian systems. The example translation given for "." may discourage from its complexity. However, we should note that many, perhaps most, common regexps don't need any translation at all (ignoring character composition issues). E.g. '[abc]a.*b($|\>)' doesn't need any change. 'rô*l' needs a minor change of wrapping the multi-byte ô: 'r(?:ô)*l'. Similarly for character ranges that include multi-byte characters: 'r[ioôé]le' becomes 'r(?:[io]|ô|é)le'. The fragment '.+' (extended regexp) may or may not need change, depending on whether preceding & following fragments only match whole characters. E.g. 'a.+b' doesn't need change, but 'a.+.+b' does need changing. Whether '[a-z]' needs changing depends on LC_COLLATE, i.e. it depends whether that range includes ô and so on. However, in most places where a regexp uses [a-z], one would want to exclude ô and so on, so it would be a bug not to specify LC_COLLATE=C. I've now checked Single Unix Spec for the regexp '.': it matches all characters other than NUL (and in particular it matches \n, except that grep should already have split its input into lines excluding \n). pjrm.