huge performance hit for -i with UTF-8 locales

Bug #75695 reported by Peter Cordes
6
Affects Status Importance Assigned to Milestone
grep
Unknown
Unknown
grep (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

On a source tree with 28MB of .c and .h files (Mesa), grep is slow with -i and fast without it with the default Ubuntu locale settings (LANG=en_US.UTF-8, no LC_ variables set). Actually, even some [Vv] style patterns are much faster with LANG=C, so this is even more like
https://bugs.launchpad.net/distros/ubuntu/+source/grep/+bug/47634

 My box is a core 2 duo (2.4GHz), which makes a beast like gnome feel almost as snappy as fluxbox :) Everything is in the disk cache, so I/O isn't a factor. Neither is memory bandwidth. The machine was otherwise idle. I'm running AMD64 Edgy.

peter@tesla:/usr/local/src/g965/mesa$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
... (all the same)

(times are measured for the second run in a row, so the CPU core it runs on is at full clock speed the whole time.)
time find -name '*.[ch]' | xargs grep -i 'volatile_s3tc'
 real 0m3.498s; user 0m3.483s; sys 0m0.023s

time find -name '*.[ch]' | xargs grep 'volatile.*s3tc'
 real 0m0.076s; user 0m0.050s; sys 0m0.023s

Non UTF-8 locales are just as fast as without -i
time find -name '*.[ch]' | LANG=C xargs grep -i 'volatile.*s3tc'
 real 0m0.083s; user 0m0.067s; sys 0m0.020s

time find -name '*.[ch]' | LANG=en_CA xargs grep -i 'volatile.*s3tc'
 real 0m0.079s; user 0m0.050s; sys 0m0.027s

 Making a case insensitive pattern takes more time, but is not really slow. However, it probably doesn't really match everything that grep -i would on input that wasn't all 7 bit ASCII:
 time find -name '*.[ch]' | xargs grep '[Vv][Oo][Ll][Aa][Tt][Ii][Ll][Ee].*[Ss]3[Tt][Cc]'
real 0m0.340s; user 0m0.313s; sys 0m0.027s

It is affected by locale settings, too.
time find -name '*.[ch]' | LANG=C xargs grep '[Vv][Oo][Ll][Aa][Tt][Ii][Ll][Ee].*[Ss]3[Tt][Cc]'
real 0m0.096s; user 0m0.080s; sys 0m0.027s

Revision history for this message
ahendry (andrew-hendry) wrote :
Download full text (5.1 KiB)

Was going to log a new bug, but this one looks very similar. Heres the details I found.

Peter try setting LANG to something matching under /usr/lib/locale and run your test to see if we have the same bug. probably en_CA.utf8

----------------------------------------------------------------------------

Ive noticed a small performance hit arising from locale settings. Ubunutu (gnome-language-selector) sets lang in a format like:

LANG=en_AU.UTF-8
LANGUAGE=en_AU:en
LC_CTYPE="en_AU.UTF-8"
LC_NUMERIC="en_AU.UTF-8"
LC_TIME="en_AU.UTF-8"
LC_COLLATE="en_AU.UTF-8"
LC_MONETARY="en_AU.UTF-8"
LC_MESSAGES="en_AU.UTF-8"
LC_PAPER="en_AU.UTF-8"
LC_NAME="en_AU.UTF-8"
LC_ADDRESS="en_AU.UTF-8"
LC_TELEPHONE="en_AU.UTF-8"
LC_MEASUREMENT="en_AU.UTF-8"
LC_IDENTIFICATION="en_AU.UTF-8"

When binaries are executed, the LANG is looked up in /usr/lib/locale, which on my system looks like:

xxx@xxx:/usr/lib/locale$ ls
en_AU.utf8 en_DK.utf8 en_IE.utf8 en_PH.utf8 en_ZA.utf8
en_BW.utf8 en_GB.utf8 en_IN en_SG.utf8 en_ZW.utf8
en_CA.utf8 en_HK.utf8 en_NZ.utf8 en_US.utf8

They dont match up, en_AU.utf8 vs en_AU.UTF-8

With the default LANG, running strace across various binaries, 'ls' for example gives many messages such as:
open("/usr/lib/locale/en_AU.UTF-8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)

With LANG=en_AU.UTF-8
strace -c ls
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000062 4 14 read
  0.00 0.000000 0 1 write
  0.00 0.000000 0 105 77 open
  0.00 0.000000 0 29 close
  0.00 0.000000 0 1 execve
  0.00 0.000000 0 12 12 access
  0.00 0.000000 0 3 brk
  0.00 0.000000 0 2 2 ioctl
  0.00 0.000000 0 6 munmap
  0.00 0.000000 0 1 mprotect
  0.00 0.000000 0 1 _sysctl
  0.00 0.000000 0 2 rt_sigaction
  0.00 0.000000 0 1 rt_sigprocmask
  0.00 0.000000 0 1 getrlimit
  0.00 0.000000 0 42 mmap2
  0.00 0.000000 0 29 fstat64
  0.00 0.000000 0 2 getdents64
  0.00 0.000000 0 1 fcntl64
  0.00 0.000000 0 2 futex
  0.00 0.000000 0 1 set_thread_area
  0.00 0.000000 0 1 set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00 0.000062 257 91 total

With LANG=en_AU.utf8, to match the directory in /usr/lib/locale
strace -c ls
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
   nan 0.000000 0 14 read
   nan 0.000000 0 3 write
   nan 0.000000 ...

Read more...

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

Thanks for your report. This is know upstream and a patch has been proposed.

Changed in grep:
status: New → Confirmed
Revision history for this message
Ma Hsiao-chun (mahsiaochun) wrote :

The upstream allready mark their bug as fixed. Is this still a issue?

Changed in grep (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Peter Cordes (peter-cordes) wrote :

The performance hit of -i hasn't changed with 12.04 LTS. Will have to check with a newer grep, I guess. Seeing e.g. 25 secs to grep -i on the .c/.h files in a Linux source tree, 0.5 secs to grep without -i. 1.3 secs for a LANG=C grep -i. No disk I/O, files are cached.

  So a factor of about 20 slowdown for en_CA.utf8 vs. POSIX case insensitive grepping.

 Ubuntu 12.04 does set LANG=en_CA.utf8, and /usr/lib/locale now just contains locale-archive. So I'm not seeing any system calls trying to open non-existant files like ahendry was.

 Again, haven't yet tried with the most recent ubuntu. This should be trivially easy for most people to test, as it doesn't require grep to actually match anything. (I still used the volatile s3tc pattern from my original report when searching the Linux tree). You just need a new version of grep, and locale support for a utf8 English locale (e.g. en_US.utf8).

 just run these 3 commands:
time find -name '*.[ch]' | xargs grep -i 'volatile.*s3tc'
time find -name '*.[ch]' | xargs grep 'volatile.*s3tc'
time find -name '*.[ch]' | LANG=C xargs grep -i 'volatile.*s3tc'

 If the LANG=C version isn't much faster than the grep -i with your default locale (and/or LANG=en_US.utf8 if your default for some reason isn't slow), then the problem is fixed and grep has fast case-insensitive utf8 matching.

Revision history for this message
Peter Cordes (peter-cordes) wrote :

with 14.04:

time find -name '*.[ch]' | xargs grep -i 'volatile.*s3tc' # en_CA.utf8
real 0m44.320s
user 0m43.777s
sys 0m0.459s

time find -name '*.[ch]' | LANG=C xargs grep -i 'volatile.*s3tc'
real 0m2.078s
user 0m1.795s
sys 0m0.381s

time find -name '*.[ch]' | xargs grep 'volatile.*s3tc' # en_CA.utf8
real 0m1.876s
user 0m0.414s
sys 0m0.523s

Slowdown is still a factor of 20.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.