grep is extremely slow with UTF-8

Bug #7906 reported by Debian Bug Importer on 2004-09-11
28
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grep (Debian)
Fix Released
Unknown
grep (Ubuntu)
Medium
Ian Jackson

Bug Description

Automatically imported from Debian bug report #181378 http://bugs.debian.org/181378

Package: grep
Version: 2.5.1-5
Followup-For: Bug #181378

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I experienced the same problem, but I just noticed that grep is fast
when the LC_ALL environment variable is set to C. Here's the output of
locale on my system:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Seems like grep handles character encoding conversions inefficiently or
something?

- -- System Information:
Debian Release: testing/unstable
Architecture: powerpc
Kernel: Linux thor 2.4.20-ben8-xfs-lolat #18 Wed Aug 6 10:56:56 CEST 2003 ppc
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8

Versions of packages grep depends on:
ii libc6 2.3.1-16 GNU C Library: Shared libraries an

- -- no debconf information

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/N62aWoGvjmrbsgARArZVAJ4iVqUDptDeldcvNgA2DlWmoOuXnwCffIQ5
n9TZeES03gReAsL5IkS9ock=
=KnnW
-----END PGP SIGNATURE-----

merge 181378 206470
thanks

These bugs appear to be the same (see latest messages in #181378).

As for the bugs themselves, could it be that the problem is caused by grep
localizing every input character, as opposed to localizing the regex and
then matching the resulting bytes? I haven't looked at the code to be
sure, but this is what immediately came to mind when I read about the
LC_CTYPE=C speed difference.

Translating every input character would, indeed, slow things down a lot. A
better alternative would be to localize the regex, match on a byte-by-byte
basis, and then localize the output only if it matches. However, this may
have pathological problems if multiple representations of the same
character are possible (e.g. Unicode combining diacritics vs. precomposed
characters). I'm not sure what the solution would be in this case.

T

--
If you look at a thing nine hundred and ninety-nine times, you are perfectly
safe; if you look at it the thousandth time, you are in frightful danger of
seeing it for the first time. -- G. K. Chesterton

severity 206470 important

This patch seems to help; I extracted it from the src rpm at

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/1/SRPMS/

and tweaked one hunk for it to apply.

--
Earthling Michel Dänzer | Debian (powerpc), X and DRI developer
Software libre enthusiast | http://svcs.affero.net/rm.php?r=daenzer

Package: grep
Version: 2.5.1.ds1-2
Severity: normal
Followup-For: Bug #181378

I found the magic frontier of grep: DFAs with 1024 states.

Please make grep a little quicker or replace it completely with pcre or
another fast implementation, as far as POSIX allows it.

$ time egrep .\{1024,\} debug | wc
    109 17987 169422

real 1m47.522s
user 1m28.960s
sys 0m3.680s

$ time egrep .\{1023,\} debug | wc
    109 17987 169422

real 0m1.074s
user 0m0.940s
sys 0m0.100s

$ time perl -ne '/.{1024,}/ and print' debug | wc
    109 17987 169422

real 0m0.077s
user 0m0.070s
sys 0m0.000s

-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux wwid 2.4.22-1-k7 #5 Sat Oct 4 14:11:12 EST 2003 i686
Locale: LANG=de_DE.ISO-8859-15@euro, LC_CTYPE=C (ignored: LC_ALL set to de_DE@euro)

Versions of packages grep depends on:
ii libc6 2.3.2.ds1-11 GNU C Library: Shared libraries an

-- no debconf information

Suppose UTF-8 LC_CTYPE.

  $ (echo rôle; echo role) | grep 'r.le'
  rôle
  role
  $ (echo rôle; echo role) | perl -ne '/r.le/ and print'
  role
  $ (echo rôle; echo role) | grep 'r..le'
  $ (echo rôle; echo role) | perl -ne '/r..le/ and print'
  rôle

(This is with perl_5.8.3-3, grep_2.5.1.ds1-2.)

Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).

I believe a better solution would be for grep to convert the character
regexp to an octet regexp. E.g. the character regexp "." (which I'll assume
for simplicity matches any character) might be translated to
(?:[\x00-\x7f]|[\xc0-\xf7][\x80-\xbf]*).

That translation assumes that an accented character formed by
composition is to be considered distinct from a single unicode character
(H. S. Teoh's example above). I'm not familiar with the unicode spec.
Maybe it's reasonable to consider them different. Otherwise, I believe
the translate-the-regexp approach is still applicable but requires
longer translations.

However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.

pjrm.

Download full text (4.0 KiB)

According to gprof on grep compiled with -pg (without installing libc6-prof),
~all the time is spent in check_multibyte_string.

In the case of utf-8, we don't need the return value of
check_multibyte_string: bytes other than the initial byte of utf-8
characters (ignoring the composition case) have (c & 0xc0) == 0x80 (i.e. bit7=1, bit6=0).
To handle the composition case, we add a test that the wide character
is a combining diacritical: "|| (c == 0xcc) || ((c == 0xcd) && (nextchar
<= 0xaf))".

Combining diacritical marks are a hassle for grep. According to
http://en.wikipedia.org/wiki/Combining_diacritical_mark, a character can
be followed by more than one combining diacritical mark character.
Presumably, order doesn't matter, so grep 'a<string of n combining
diacritical mark characters>' can match n factorial different strings in
the haystack text, without counting use of precomposed characters.
The simplest way of handling this would be to convert to a canonical
form, say decomposed form with combining diacritical marks in sorted
order.

Note that I haven't checked the unicode standards on this point:
possibly order is to be considered significant, in which case the only
possible matches are decomposed vs use of precomposed character. This
would make the convert-to-octet-regexp approach practical (see below).

Another issue with decomposable characters is that we must use negative
lookahead tests: if searching for `o' then we must check that the
matched 'o' isn't followed by a combining diacritical mark character.
The alternative of canonicalizing to precomposed form instead of
decomposed form has its own expense: if there are 112 possible
diacritical mark characters, and characters can be followed by an
arbitrary selection of those, then we need an extra 112 bits per
canonical character to represent those. And even that only presence or
absence (rather than number or order) is significant for combining
diacritical marks, i.e. it assumes that e<macron><acute><acute> is to be
considered equivalent to e<acute><macron>. If number and order are
significant then no finite number of bits suffices.

Note that grep doesn't currently handle combining diacritical marks:

  $ printf 'e\xcc\x80\n'
  è
  $ printf 'e\xcc\x80\n' | grep 'è'
  <nothing>
  $ printf 'e\xcc\x80\n' | grep '^.$'
  <nothing>

More remarks on the idea of converting character regexps to byte regexps
(see previous message).

First, note that it works for UTF-8, but not e.g. GBK, precisely because
in UTF-8 one can't mistake the middle of a character for the beginning
of one. E.g. in GBK, the string for 我我 contains the string for 椅;
there is no way to tell where the character beginnings are short of
something like what grep is already doing.

Is it worth adding special code for UTF-8 (probably sharable with other
UTF encodings) if we still need something like the current code to
handle GBK and other multi-byte encodings? Well, UTF-8 is likely to
become the primary encoding on Debian systems.

The example translation given for "." may discourage from its
complexity. However, we should note that many, perhaps most, common
regexps don't need any translation at all (ignoring chara...

Read more...

reassign 224993 grep
severity 224993 grave
severity 206470 grave
merge 206470 224993
thanks
Somebody on gnu-utils-bug please fix the grep LC_CTYPE bug!
See it http://bugs.debian.org/206470
Often greps that would take just a second have to be killed as the CPU
meter goes to max and minutes go by!
You guys in the western world don't notice it because you don't use
(certain) LC_CTYPEs.

severity 181378 important
thanks

--
Colin Watson [<email address hidden>]

Debian Bug Importer (debzilla) wrote :

Automatically imported from Debian bug report #181378 http://bugs.debian.org/181378

Debian Bug Importer (debzilla) wrote :

Message-ID: <20030217151028.GA15075@westpeak>
Date: Mon, 17 Feb 2003 23:10:28 +0800
From: Max Zou <email address hidden>
To: <email address hidden>
Subject: grep is extremely slow

--GvXjxJ+pjyke8COw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Package: grep
Version: 2.5.1-1

When I try to use the latest "grep" to search a pattern in a 100-KB file,=
=20
it is considerably slower than previous version of grep.

Here is a time comparison with grep v2.4.2-3 on the same
machine.

# using grep v2.4.2-3
$ time ./grep-old test file |wc
2232 6696 44261

real 0m0.058s
user 0m0.060s
sys 0m0.000s

# using grep v2.5.1-1
$ time grep test srcfp/a |wc
2232 6696 44261

real 0m10.497s
user 0m10.430s
sys 0m0.010s

Is there any problem with the algorithm used in the latest grep?

I am using Debian sid, kernel-2.4.18 and libc6 2.3.1-11 on
a PIII 700MHz machine with 384MB RAM.

Thanks!

--
regards
ZM

--GvXjxJ+pjyke8COw
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+UPtkmLGEZ2dSLSIRAnRJAKDQQ0pj+nuDThlktOEmrOn1FLXNrwCfXpqi
9vEEfvukAdVO/ZcY/u5Z8SE=
=IWjO
-----END PGP SIGNATURE-----

--GvXjxJ+pjyke8COw--

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Mon, 11 Aug 2003 16:52:10 +0200
From: Michel Daenzer <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: Depends on locale

Package: grep
Version: 2.5.1-5
Followup-For: Bug #181378

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I experienced the same problem, but I just noticed that grep is fast
when the LC_ALL environment variable is set to C. Here's the output of
locale on my system:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Seems like grep handles character encoding conversions inefficiently or
something?

- -- System Information:
Debian Release: testing/unstable
Architecture: powerpc
Kernel: Linux thor 2.4.20-ben8-xfs-lolat #18 Wed Aug 6 10:56:56 CEST 2003 ppc
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8

Versions of packages grep depends on:
ii libc6 2.3.1-16 GNU C Library: Shared libraries an

- -- no debconf information

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/N62aWoGvjmrbsgARArZVAJ4iVqUDptDeldcvNgA2DlWmoOuXnwCffIQ5
n9TZeES03gReAsL5IkS9ock=
=KnnW
-----END PGP SIGNATURE-----

Debian Bug Importer (debzilla) wrote :

Message-ID: <20030904190238.GB6837@crystal>
Date: Thu, 4 Sep 2003 15:02:38 -0400
From: "H. S. Teoh" <email address hidden>
To: <email address hidden>, <email address hidden>, <email address hidden>
Subject: Identical bugs

merge 181378 206470
thanks

These bugs appear to be the same (see latest messages in #181378).

As for the bugs themselves, could it be that the problem is caused by grep
localizing every input character, as opposed to localizing the regex and
then matching the resulting bytes? I haven't looked at the code to be
sure, but this is what immediately came to mind when I read about the
LC_CTYPE=C speed difference.

Translating every input character would, indeed, slow things down a lot. A
better alternative would be to localize the regex, match on a byte-by-byte
basis, and then localize the output only if it matches. However, this may
have pathological problems if multiple representations of the same
character are possible (e.g. Unicode combining diacritics vs. precomposed
characters). I'm not sure what the solution would be in this case.

T

--
If you look at a thing nine hundred and ninety-nine times, you are perfectly
safe; if you look at it the thousandth time, you are in frightful danger of
seeing it for the first time. -- G. K. Chesterton

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Tue, 07 Oct 2003 07:33:06 +0800
From: Dan Jacobson <email address hidden>
To: <email address hidden>

severity 206470 important

Debian Bug Importer (debzilla) wrote :
Download full text (24.7 KiB)

Message-Id: <email address hidden>
Date: Sat, 13 Dec 2003 16:19:43 +0100
From: Michel =?ISO-8859-1?Q?D=E4nzer?= <email address hidden>
To: <email address hidden>
Subject: patch

--=-R5LE1TcFVajMuwhQdHD2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

This patch seems to help; I extracted it from the src rpm at

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/1/SRPMS/

and tweaked one hunk for it to apply.

--=20
Earthling Michel D=C3=A4nzer | Debian (powerpc), X and DRI develop=
er
Software libre enthusiast | http://svcs.affero.net/rm.php?r=3Ddaenzer

--=-R5LE1TcFVajMuwhQdHD2
Content-Disposition: attachment; filename=56-grep-2.5.1-gofast.patch
Content-Type: text/x-patch; name=56-grep-2.5.1-gofast.patch; charset=UTF-8
Content-Transfer-Encoding: base64

LS0tIHNyYy9ncmVwLmMuZ29mYXN0CTIwMDMtMTItMTAgMTA6NTA6NTUuMzY0Mzc2OTQ5ICswMDAw
DQorKysgc3JjL2dyZXAuYwkyMDAzLTEyLTEwIDEyOjAwOjIxLjE1NDc0OTMzMCArMDAwMA0KQEAg
LTE4Niw3ICsxODYsOCBAQA0KIA0KIC8qIEZ1bmN0aW9ucyB3ZSdsbCB1c2UgdG8gc2VhcmNoLiAq
Lw0KIHN0YXRpYyB2b2lkICgqY29tcGlsZSkgUEFSQU1TICgoY2hhciBjb25zdCAqLCBzaXplX3Qp
KTsNCi1zdGF0aWMgc2l6ZV90ICgqZXhlY3V0ZSkgUEFSQU1TICgoY2hhciBjb25zdCAqLCBzaXpl
X3QsIHNpemVfdCAqLCBpbnQpKTsNCitzdGF0aWMgc2l6ZV90ICgqZXhlY3V0ZSkgUEFSQU1TICgo
Y2hhciBjb25zdCAqLCBzaXplX3QsIHN0cnVjdCBtYl9jYWNoZSAqLA0KKwkJCQkgIHNpemVfdCAq
LCBpbnQpKTsNCiANCiAvKiBMaWtlIGVycm9yLCBidXQgc3VwcHJlc3MgdGhlIGRpYWdub3N0aWMg
aWYgcmVxdWVzdGVkLiAgKi8NCiBzdGF0aWMgdm9pZA0KQEAgLTUxNiw3ICs1MTcsNyBAQA0KIH0N
CiANCiBzdGF0aWMgdm9pZA0KLXBybGluZSAoY2hhciBjb25zdCAqYmVnLCBjaGFyIGNvbnN0ICps
aW0sIGludCBzZXApDQorcHJsaW5lIChjaGFyIGNvbnN0ICpiZWcsIGNoYXIgY29uc3QgKmxpbSwg
aW50IHNlcCwgc3RydWN0IG1iX2NhY2hlICptYl9jYWNoZSkNCiB7DQogICBpZiAob3V0X2ZpbGUp
DQogICAgIHByaW50ZiAoIiVzJWMiLCBmaWxlbmFtZSwgc2VwICYgZmlsZW5hbWVfbWFzayk7DQpA
QCAtNTM5LDcgKzU0MCw4IEBADQogICAgIHsNCiAgICAgICBzaXplX3QgbWF0Y2hfc2l6ZTsNCiAg
ICAgICBzaXplX3QgbWF0Y2hfb2Zmc2V0Ow0KLSAgICAgIHdoaWxlICgobWF0Y2hfb2Zmc2V0ID0g
KCpleGVjdXRlKSAoYmVnLCBsaW0gLSBiZWcsICZtYXRjaF9zaXplLCAxKSkNCisgICAgICB3aGls
ZSAoKG1hdGNoX29mZnNldCA9ICgqZXhlY3V0ZSkgKGJlZywgbGltIC0gYmVnLCBtYl9jYWNoZSwN
CisJCQkJCSAmbWF0Y2hfc2l6ZSwgMSkpDQogCSAgIT0gKHNpemVfdCkgLTEpDQogICAgICAgICB7
DQogCSAgY2hhciBjb25zdCAqYiA9IGJlZyArIG1hdGNoX29mZnNldDsNCkBAIC01NzMsNyArNTc1
LDggQEANCiAJICBpbnQgaTsNCiAJICBmb3IgKGkgPSAwOyBpIDwgbGltIC0gYmVnOyBpKyspDQog
CSAgICBpYmVnW2ldID0gdG9sb3dlciAoYmVnW2ldKTsNCi0JICB3aGlsZSAoKG1hdGNoX29mZnNl
dCA9ICgqZXhlY3V0ZSkgKGliZWcsIGlsaW0taWJlZywgJm1hdGNoX3NpemUsIDEpKQ0KKwkgIHdo
aWxlICgobWF0Y2hfb2Zmc2V0ID0gKCpleGVjdXRlKSAoaWJlZywgaWxpbS1pYmVnLCBtYl9jYWNo
ZSwNCisJCQkJCSAgICAgJm1hdGNoX3NpemUsIDEpKQ0KIAkJICE9IChzaXplX3QpIC0xKQ0KIAkg
ICAgew0KIAkgICAgICBjaGFyIGNvbnN0ICpiID0gYmVnICsgbWF0Y2hfb2Zmc2V0Ow0KQEAgLTU5
MSw3ICs1OTQsOCBAQA0KIAkgIGxhc3RvdXQgPSBsaW07DQogCSAgcmV0dXJuOw0KIAl9DQotICAg
ICAgd2hpbGUgKGxpbS1iZWcgJiYgKG1hdGNoX29mZnNldCA9ICgqZXhlY3V0ZSkgKGJlZywgbGlt
IC0gYmVnLCAmbWF0Y2hfc2l6ZSwgMSkpDQorICAgICAgd2hpbGUgKGxpbS1iZWcgJiYgKG1hdGNo
X29mZnNldCA9ICgqZXhlY3V0ZSkgKGJlZywgbGltIC0gYmVnLCBtYl9jYWNoZSwNCisJCQkJCQkg
ICAgJm1hdGNoX3NpemUsIDEpKQ0KIAkgICAgICE9IChzaXplX3QpIC0xKQ0KIAl7DQogCSAgY2hh
ciBjb25zdCAqYiA9IGJlZyArIG1h...

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Wed, 18 Feb 2004 04:44:26 +0100
From: Roland Illig <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: grep: ... and Perl is a thousand times faster ...

Package: grep
Version: 2.5.1.ds1-2
Severity: normal
Followup-For: Bug #181378

I found the magic frontier of grep: DFAs with 1024 states.

Please make grep a little quicker or replace it completely with pcre or
another fast implementation, as far as POSIX allows it.

$ time egrep .\{1024,\} debug | wc
    109 17987 169422

real 1m47.522s
user 1m28.960s
sys 0m3.680s

$ time egrep .\{1023,\} debug | wc
    109 17987 169422

real 0m1.074s
user 0m0.940s
sys 0m0.100s

$ time perl -ne '/.{1024,}/ and print' debug | wc
    109 17987 169422

real 0m0.077s
user 0m0.070s
sys 0m0.000s

-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux wwid 2.4.22-1-k7 #5 Sat Oct 4 14:11:12 EST 2003 i686
Locale: LANG=de_DE.ISO-8859-15@euro, LC_CTYPE=C (ignored: LC_ALL set to de_DE@euro)

Versions of packages grep depends on:
ii libc6 2.3.2.ds1-11 GNU C Library: Shared libraries an

-- no debconf information

Debian Bug Importer (debzilla) wrote :

Message-id: <email address hidden>
Date: Tue, 29 Jun 2004 23:15:22 +1000
From: Peter Moulder <email address hidden>
To: <email address hidden>
Subject: perl not fair comparison: perl gets "wrong" answer for utf-8 text

Suppose UTF-8 LC_CTYPE.

  $ (echo r=F4le; echo role) | grep 'r.le'
  r=F4le
  role
  $ (echo r=F4le; echo role) | perl -ne '/r.le/ and print'
  role
  $ (echo r=F4le; echo role) | grep 'r..le'
  $ (echo r=F4le; echo role) | perl -ne '/r..le/ and print'
  r=F4le

(This is with perl_5.8.3-3, grep_2.5.1.ds1-2.)

Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).

I believe a better solution would be for grep to convert the characte=
r
regexp to an octet regexp. E.g. the character regexp "." (which I'll=
 assume
for simplicity matches any character) might be translated to
(?:[\x00-\x7f]|[\xc0-\xf7][\x80-\xbf]*).

That translation assumes that an accented character formed by
composition is to be considered distinct from a single unicode charac=
ter
(H. S. Teoh's example above). I'm not familiar with the unicode spec=
.
Maybe it's reasonable to consider them different. Otherwise, I belie=
ve
the translate-the-regexp approach is still applicable but requires
longer translations.

However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.

pjrm.

Debian Bug Importer (debzilla) wrote :
Download full text (4.4 KiB)

Message-id: <email address hidden>
Date: Wed, 30 Jun 2004 16:46:42 +1000
From: Peter Moulder <email address hidden>
To: <email address hidden>
Subject: gprof; combining diacritical marks; octet regexp conversion

According to gprof on grep compiled with -pg (without installing libc=
6-prof),
~all the time is spent in check_multibyte_string.

In the case of utf-8, we don't need the return value of
check_multibyte_string: bytes other than the initial byte of utf-8
characters (ignoring the composition case) have (c & 0xc0) =3D=3D 0x8=
0 (i.e. bit7=3D1, bit6=3D0).
To handle the composition case, we add a test that the wide character
is a combining diacritical: "|| (c =3D=3D 0xcc) || ((c =3D=3D 0xcd) &=
& (nextchar
<=3D 0xaf))".

Combining diacritical marks are a hassle for grep. According to
http://en.wikipedia.org/wiki/Combining_diacritical_mark, a character =
can
be followed by more than one combining diacritical mark character.
Presumably, order doesn't matter, so grep 'a<string of n combining
diacritical mark characters>' can match n factorial different strings=
 in
the haystack text, without counting use of precomposed characters.
The simplest way of handling this would be to convert to a canonical
form, say decomposed form with combining diacritical marks in sorted
order.

Note that I haven't checked the unicode standards on this point:
possibly order is to be considered significant, in which case the onl=
y
possible matches are decomposed vs use of precomposed character. Thi=
s
would make the convert-to-octet-regexp approach practical (see below)=
.

Another issue with decomposable characters is that we must use negati=
ve
lookahead tests: if searching for `o' then we must check that the
matched 'o' isn't followed by a combining diacritical mark character.
The alternative of canonicalizing to precomposed form instead of
decomposed form has its own expense: if there are 112 possible
diacritical mark characters, and characters can be followed by an
arbitrary selection of those, then we need an extra 112 bits per
canonical character to represent those. And even that only presence =
or
absence (rather than number or order) is significant for combining
diacritical marks, i.e. it assumes that e<macron><acute><acute> is to=
 be
considered equivalent to e<acute><macron>. If number and order are
significant then no finite number of bits suffices.

Note that grep doesn't currently handle combining diacritical marks:

  $ printf 'e\xcc\x80\n'
  =C3=A8
  $ printf 'e\xcc\x80\n' | grep '=C3=A8'
  <nothing>
  $ printf 'e\xcc\x80\n' | grep '^.$'
  <nothing>

More remarks on the idea of converting character regexps to byte rege=
xps
(see previous message).

First, note that it works for UTF-8, but not e.g. GBK, precisely beca=
use
in UTF-8 one can't mistake the middle of a character for the beginnin=
g
of one. E.g. in GBK, the string for =E6=88=91=E6=88=91 contains the =
string for =E6=A4=85;
there is no way to tell where the character beginnings are short of
something like what grep is already doing.

Is it worth adding special code for UTF-8 (probably sharable with oth=
er
UTF encodings) if we still need ...

Read more...

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Sat, 11 Sep 2004 04:33:28 +0800
From: Dan Jacobson <email address hidden>
To: <email address hidden>
Cc: <email address hidden>, <email address hidden>
Subject: LC_CTYPE makes grep 800 times slower!

reassign 224993 grep
severity 224993 grave
severity 206470 grave
merge 206470 224993
thanks
Somebody on gnu-utils-bug please fix the grep LC_CTYPE bug!
See it http://bugs.debian.org/206470
Often greps that would take just a second have to be killed as the CPU
meter goes to max and minutes go by!
You guys in the western world don't notice it because you don't use
(certain) LC_CTYPEs.

Thom May (thombot) wrote :

This is an absolutely absurd inflation of severity.

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Sat, 11 Sep 2004 01:14:59 +0100
From: Colin Watson <email address hidden>
To: <email address hidden>
Subject: not release-critical

severity 181378 important
thanks

--
Colin Watson [<email address hidden>]

I don't know that I can add useful info here, but this just bit me
too.

  $ wc -l /tmp/setuid;
  50 /tmp/setuid

  $ time ltrace grep -v /dev/ /tmp/setuid 2>&1 |LANG=C grep '^mbrtowc(' |wc -l
  77989

  real 0m14.802s
  user 0m6.118s
  sys 0m8.071s

  $ calc 50/14.802
          ~3.37792190244561545737

Justin

Debian Bug Importer (debzilla) wrote :

Message-ID: <20041206152037.GA23899@andromeda>
Date: Mon, 6 Dec 2004 10:20:37 -0500
From: Justin Pryzby <email address hidden>
To: <email address hidden>
Subject: profile

I don't know that I can add useful info here, but this just bit me
too.

  $ wc -l /tmp/setuid;
  50 /tmp/setuid

  $ time ltrace grep -v /dev/ /tmp/setuid 2>&1 |LANG=C grep '^mbrtowc(' |wc -l
  77989

  real 0m14.802s
  user 0m6.118s
  sys 0m8.071s

  $ calc 50/14.802
          ~3.37792190244561545737

Justin

tags 181378 +patch
thanks

Here is Fedora Core 3's patch to grep that makes it work quickly in
UTF-8 environments.

I haven't tested if it applies cleanly to Debian's grep. But if you
make me a co-maintainer, I'll happily spend a couple of hours merging
this and other useful Red Hat patches into our grep.

Simon

Debian Bug Importer (debzilla) wrote :
Download full text (19.7 KiB)

Message-ID: <email address hidden>
Date: Wed, 8 Dec 2004 16:29:56 -0500
From: Simon Law <email address hidden>
To: <email address hidden>
Subject: Red Hat's UTF-8 speedup patch

--mP3DRpeJDSE+ciuQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

tags 181378 +patch
thanks

Here is Fedora Core 3's patch to grep that makes it work quickly in
UTF-8 environments.

I haven't tested if it applies cleanly to Debian's grep. But if you
make me a co-maintainer, I'll happily spend a couple of hours merging
this and other useful Red Hat patches into our grep.

Simon

--mP3DRpeJDSE+ciuQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="grep-2.5.1-gofast.patch"

This patch is written by Tim Waugh <email address hidden> and is ripped
from Fedora Core 3's grep 2.5.1-31 package.

It is meant to cache results from mbrtowc(), which significantly speeds
up execution speed in UTF-8 locales.

 -- Simon Law <email address hidden> Wed, 8 Dec 2004 16:28:03 -0500

--- grep-2.5.1/src/grep.c.gofast 2004-02-26 13:17:39.000000000 +0000
+++ grep-2.5.1/src/grep.c 2004-02-26 13:17:39.000000000 +0000
@@ -186,7 +186,8 @@

 /* Functions we'll use to search. */
 static void (*compile) PARAMS ((char const *, size_t));
-static size_t (*execute) PARAMS ((char const *, size_t, size_t *, int));
+static size_t (*execute) PARAMS ((char const *, size_t, struct mb_cache *,
+ size_t *, int));

 /* Like error, but suppress the diagnostic if requested. */
 static void
@@ -516,7 +517,7 @@
 }

 static void
-prline (char const *beg, char const *lim, int sep)
+prline (char const *beg, char const *lim, int sep, struct mb_cache *mb_cache)
 {
   if (out_file)
     printf ("%s%c", filename, sep & filename_mask);
@@ -539,7 +540,8 @@
     {
       size_t match_size;
       size_t match_offset;
- while ((match_offset = (*execute) (beg, lim - beg, &match_size, 1))
+ while ((match_offset = (*execute) (beg, lim - beg, mb_cache,
+ &match_size, 1))
    != (size_t) -1)
         {
    char const *b = beg + match_offset;
@@ -573,7 +575,8 @@
    int i;
    for (i = 0; i < lim - beg; i++)
      ibeg[i] = tolower (beg[i]);
- while ((match_offset = (*execute) (ibeg, ilim-ibeg, &match_size, 1))
+ while ((match_offset = (*execute) (ibeg, ilim-ibeg, mb_cache,
+ &match_size, 1))
    != (size_t) -1)
      {
        char const *b = beg + match_offset;
@@ -591,7 +594,8 @@
    lastout = lim;
    return;
  }
- while (lim-beg && (match_offset = (*execute) (beg, lim - beg, &match_size, 1))
+ while (lim-beg && (match_offset = (*execute) (beg, lim - beg, mb_cache,
+ &match_size, 1))
       != (size_t) -1)
  {
    char const *b = beg + match_offset;
@@ -619,7 +623,7 @@
 /* Print pending lines of trailing context prior to LIM. Trailing context ends
    at the next matching line when OUTLEFT is 0. */
 static void
-prpending (char const *lim)
+prpending (char const *lim, struct mb_cache *mb_cache)
 {
   if (!lastout)
     lastout = bufbeg;
@@ -629,9 +633,10 @@
       size_t match_size;
       --pending;
       if (outleft
- || (((*execute) (lastout, nl - lastout, &match_size, 0) == (size_t) -1)
+ || (((*ex...

Matt Zimmerman (mdz) wrote :

*** Bug 15162 has been marked as a duplicate of this bug. ***

Matt Zimmerman (mdz) wrote :

<dooglus> the fedora guys fixed it ages ago (see patch, here:
http://cvs.fedora.redhat.com/viewcvs/devel/grep/grep-2.5.1-egf-speedup.patch )

I haven't looked at the above patch myself, but I'm archiving it here

Chris Moore (dooglus) wrote :

I have mentioned this bug to a few people, and the response I usually get is
"well, UTF handling is bound to be a bit slower, don't worry about it".

That's missing the point. This bug doesn't make grep 'a bit slower', it changes
it from using linear time to using quadratic time in some cases.

Here's an example, grepping through a million short lines of text:

    yes | head -999999 | LC_ALL=C time grep . > /dev/null

it runs in 0.27 seconds on my laptop.

Here's the same example, doing the same work, but in a UTF-8 locale:

    yes | head -999999 | LC_ALL=en_US.UTF-8 time grep . > /dev/null

it runs in about *40 hours* on the same laptop. It is around 500,000 times
slower than using the 'C' locale.

That's 40 hours to grep through a 2 megabyte file.

Some people have been unable to reproduce this severe slow-down. I suspect it
may be that they don't have the en_US.UTF-8 locale generated. You can find a
list of UTF-8 locales that have been generated in /etc/locale.gen.

I hope this bug can be fixed, because it really is quite a problem, especially
since the default locale in a ubuntu install is a UTF-8 one!

Note that the fedora fix referred to in the previous comment works around this
problem in the dfa code by disabling the dfa code when using a UTF-8 locale,
rather than by addressing the real root of the problem (which is that dfaexec()
in src/dfa.c does a complete scan of the remaining input buffer each time it is
called).

Matt Zimmerman (mdz) wrote :

You said on IRC that you'd fixed the bug; perhaps you'd care to share your patch?

Chris Moore (dooglus) wrote :

I was mistaken. My changes certainly sped things up, but broke more than they
fixed, sorry.

Package: grep
Version: 2.5.1.ds1-5
Followup-For: Bug #181378

Hello,

I tried the gofast patche, and did not find a real improvement.

However, Fedora is now using a different patch, which improve dramaticaly
grep performances on an UTF-8 environment.

Please find attached the following patches:
  * I put the original Fedora patches in the orig directory. The other
    patches are updated for the Debian package.
  * 64-egf-speedup.patch
    It does most of the work. Here is the explanation, according to:
    http://savannah.gnu.org/patch/?func=detailitem&item_id=3803
> The full story behind this patch is that grep-2.5.1a does not handle
> UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a
> is:
> * whenever a buffer is parsed, go through the entire buffer deciding
> how many bytes make up each character
> * use this information when necessary
>
> This patch changes that to:
> * when information about how many bytes make up a character is needed,
> work it out on demand
>
> On the face of it, this is a small obvious improvement. In fact it is
> much better than that, because the original scheme would calculate
> character lengths several times for each buffer: in fact, one full
> pass for every single potential match!

  * 65-dfa-optional.patch
    I'm not sure this one is really needed.
    I've read the DFA algorithme is slow for UTF-8 and this patch disable
    it in that case (and it can be forced enabled by setting an evirronment
    variable)
  * grep-2.5.1-tests.patch
    Fedora also added a test for UTF-8.
  * 66-match_icase.patch
  * 67-w.patch
    After testing the new UTF-8 tests, these too seems to be needed.
    (It is not really related to the grep's speed, but these patches may
    be interresting)

I tried a grep packages with all these patches, and for the following
command:
    grep '^' /var/lib/dpkg/available> /dev/null
grep is more than 1500 faster on an UTF-8 environment.
(on my machine, it take less than 3/4s instead of more than 10 minutes!)

Also, I did not notice any regression, and grep is not dramatically
slower on the C locale.

These patches may be important for Etch since the transition to UTF-8 is
mentionned on the (unofficial) Etch TODO list:
http://wiki.debian.net/?EtchTODOList

(And the French team is considering using UTF-8 for the default French
locale)

Thanks in advance,
--
Nekral

Debian Bug Importer (debzilla) wrote :
Download full text (15.8 KiB)

Message-ID: <email address hidden>
Date: Wed, 7 Sep 2005 11:52:34 +0200
From: Nicolas =?iso-8859-1?Q?Fran=E7ois?= <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: grep is extremely slow with UTF-8

--EVF5PPMfhYS0aIcm
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Package: grep
Version: 2.5.1.ds1-5
Followup-For: Bug #181378

Hello,

I tried the gofast patche, and did not find a real improvement.

However, Fedora is now using a different patch, which improve dramaticaly
grep performances on an UTF-8 environment.

Please find attached the following patches:
  * I put the original Fedora patches in the orig directory. The other
    patches are updated for the Debian package.
  * 64-egf-speedup.patch
    It does most of the work. Here is the explanation, according to:
    http://savannah.gnu.org/patch/?func=detailitem&item_id=3803
> The full story behind this patch is that grep-2.5.1a does not handle
> UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a
> is:
> * whenever a buffer is parsed, go through the entire buffer deciding
> how many bytes make up each character
> * use this information when necessary
>
> This patch changes that to:
> * when information about how many bytes make up a character is needed,
> work it out on demand
>
> On the face of it, this is a small obvious improvement. In fact it is
> much better than that, because the original scheme would calculate
> character lengths several times for each buffer: in fact, one full
> pass for every single potential match!

  * 65-dfa-optional.patch
    I'm not sure this one is really needed.
    I've read the DFA algorithme is slow for UTF-8 and this patch disable
    it in that case (and it can be forced enabled by setting an evirronment
    variable)
  * grep-2.5.1-tests.patch
    Fedora also added a test for UTF-8.
  * 66-match_icase.patch
  * 67-w.patch
    After testing the new UTF-8 tests, these too seems to be needed.
    (It is not really related to the grep's speed, but these patches may
    be interresting)

I tried a grep packages with all these patches, and for the following
command:
    grep '^' /var/lib/dpkg/available> /dev/null
grep is more than 1500 faster on an UTF-8 environment.
(on my machine, it take less than 3/4s instead of more than 10 minutes!)

Also, I did not notice any regression, and grep is not dramatically
slower on the C locale.

These patches may be important for Etch since the transition to UTF-8 is
mentionned on the (unofficial) Etch TODO list:
http://wiki.debian.net/?EtchTODOList

(And the French team is considering using UTF-8 for the default French
locale)

Thanks in advance,
--
Nekral

--EVF5PPMfhYS0aIcm
Content-Type: application/octet-stream
Content-Disposition: attachment; filename="patches.tar.bz2"
Content-Transfer-Encoding: base64

QlpoOTFBWSZTWXoH9EYAZS1//v/0VaB9/////+/f//////4Ahg4KYgIAAg4AQQBACGA8/wAA
ABpKKIIlUSFAAACigKUCoSSpCkipVVKpVUACCSkAQQoSAiQoAAUFAoqACgAFAAKAAAAEQApQ
BIAAAAAAAAEQACQAAAAACgAUBBgAAAAAAAAAAEwAAAAEYAAmAAKogQCCCDU2kCeJoU9DVPan
qnjTVP0pp6j1HlHimmmj1AZABp6nogaaAQYAAAAAAA...

Source: grep
Source-Version: 2.5.1.ds1-6

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds1-6.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds1-6.diff.gz
grep_2.5.1.ds1-6.dsc
  to pool/main/g/grep/grep_2.5.1.ds1-6.dsc
grep_2.5.1.ds1-6_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds1-6_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
 grep (2.5.1.ds1-6) unstable; urgency=low
 .
   * 64-egf-speedup.patch, 65-dfa-optional.patch, 66-match_icase.patch,
     67-w.patch speed up grep. Thanks to Nicolas François
     <email address hidden> (Closes: #181378, #206470, #224993)
   * Deleted the CVS directories
Files:
 7797de5e94d5c6b930a29e0a7fc5e205 669 base required grep_2.5.1.ds1-6.dsc
 caea29b0505d0401fb03d0f3c5b0de75 29266 base required grep_2.5.1.ds1-6.diff.gz
 38ead74511b3423ee277778ae85c2077 172330 base required grep_2.5.1.ds1-6_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDI8JagY5NIXPNpFURAq2QAKC4AK77tJr5vlyg5sSVasgEMr49RQCgrZYI
wsu93RoCSrY292GZAvXAoTM=
=XhKe
-----END PGP SIGNATURE-----

Source: grep
Source-Version: 2.5.1.ds1-6

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds1-6.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds1-6.diff.gz
grep_2.5.1.ds1-6.dsc
  to pool/main/g/grep/grep_2.5.1.ds1-6.dsc
grep_2.5.1.ds1-6_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds1-6_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
 grep (2.5.1.ds1-6) unstable; urgency=low
 .
   * 64-egf-speedup.patch, 65-dfa-optional.patch, 66-match_icase.patch,
     67-w.patch speed up grep. Thanks to Nicolas François
     <email address hidden> (Closes: #181378, #206470, #224993)
   * Deleted the CVS directories
Files:
 7797de5e94d5c6b930a29e0a7fc5e205 669 base required grep_2.5.1.ds1-6.dsc
 caea29b0505d0401fb03d0f3c5b0de75 29266 base required grep_2.5.1.ds1-6.diff.gz
 38ead74511b3423ee277778ae85c2077 172330 base required grep_2.5.1.ds1-6_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDI8JagY5NIXPNpFURAq2QAKC4AK77tJr5vlyg5sSVasgEMr49RQCgrZYI
wsu93RoCSrY292GZAvXAoTM=
=XhKe
-----END PGP SIGNATURE-----

Source: grep
Source-Version: 2.5.1.ds1-6

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds1-6.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds1-6.diff.gz
grep_2.5.1.ds1-6.dsc
  to pool/main/g/grep/grep_2.5.1.ds1-6.dsc
grep_2.5.1.ds1-6_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds1-6_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
 grep (2.5.1.ds1-6) unstable; urgency=low
 .
   * 64-egf-speedup.patch, 65-dfa-optional.patch, 66-match_icase.patch,
     67-w.patch speed up grep. Thanks to Nicolas François
     <email address hidden> (Closes: #181378, #206470, #224993)
   * Deleted the CVS directories
Files:
 7797de5e94d5c6b930a29e0a7fc5e205 669 base required grep_2.5.1.ds1-6.dsc
 caea29b0505d0401fb03d0f3c5b0de75 29266 base required grep_2.5.1.ds1-6.diff.gz
 38ead74511b3423ee277778ae85c2077 172330 base required grep_2.5.1.ds1-6_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDI8JagY5NIXPNpFURAq2QAKC4AK77tJr5vlyg5sSVasgEMr49RQCgrZYI
wsu93RoCSrY292GZAvXAoTM=
=XhKe
-----END PGP SIGNATURE-----

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Sat, 10 Sep 2005 22:47:05 -0700
From: Santiago Ruano Rincon <email address hidden>
To: <email address hidden>
Subject: Bug#181378: fixed in grep 2.5.1.ds1-6

Source: grep
Source-Version: 2.5.1.ds1-6

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds1-6.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds1-6.diff.gz
grep_2.5.1.ds1-6.dsc
  to pool/main/g/grep/grep_2.5.1.ds1-6.dsc
grep_2.5.1.ds1-6_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds1-6_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
 grep (2.5.1.ds1-6) unstable; urgency=low
 .
   * 64-egf-speedup.patch, 65-dfa-optional.patch, 66-match_icase.patch,
     67-w.patch speed up grep. Thanks to Nicolas François
     <email address hidden> (Closes: #181378, #206470, #224993)
   * Deleted the CVS directories
Files:
 7797de5e94d5c6b930a29e0a7fc5e205 669 base required grep_2.5.1.ds1-6.dsc
 caea29b0505d0401fb03d0f3c5b0de75 29266 base required grep_2.5.1.ds1-6.diff.gz
 38ead74511b3423ee277778ae85c2077 172330 base required grep_2.5.1.ds1-6_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDI8JagY5NIXPNpFURAq2QAKC4AK77tJr5vlyg5sSVasgEMr49RQCgrZYI
wsu93RoCSrY292GZAvXAoTM=
=XhKe
-----END PGP SIGNATURE-----

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Sat, 10 Sep 2005 22:47:05 -0700
From: Santiago Ruano Rincon <email address hidden>
To: <email address hidden>
Subject: Bug#206470: fixed in grep 2.5.1.ds1-6

Source: grep
Source-Version: 2.5.1.ds1-6

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds1-6.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds1-6.diff.gz
grep_2.5.1.ds1-6.dsc
  to pool/main/g/grep/grep_2.5.1.ds1-6.dsc
grep_2.5.1.ds1-6_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds1-6_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
 grep (2.5.1.ds1-6) unstable; urgency=low
 .
   * 64-egf-speedup.patch, 65-dfa-optional.patch, 66-match_icase.patch,
     67-w.patch speed up grep. Thanks to Nicolas François
     <email address hidden> (Closes: #181378, #206470, #224993)
   * Deleted the CVS directories
Files:
 7797de5e94d5c6b930a29e0a7fc5e205 669 base required grep_2.5.1.ds1-6.dsc
 caea29b0505d0401fb03d0f3c5b0de75 29266 base required grep_2.5.1.ds1-6.diff.gz
 38ead74511b3423ee277778ae85c2077 172330 base required grep_2.5.1.ds1-6_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDI8JagY5NIXPNpFURAq2QAKC4AK77tJr5vlyg5sSVasgEMr49RQCgrZYI
wsu93RoCSrY292GZAvXAoTM=
=XhKe
-----END PGP SIGNATURE-----

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Sat, 10 Sep 2005 22:47:05 -0700
From: Santiago Ruano Rincon <email address hidden>
To: <email address hidden>
Subject: Bug#224993: fixed in grep 2.5.1.ds1-6

Source: grep
Source-Version: 2.5.1.ds1-6

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds1-6.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds1-6.diff.gz
grep_2.5.1.ds1-6.dsc
  to pool/main/g/grep/grep_2.5.1.ds1-6.dsc
grep_2.5.1.ds1-6_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds1-6_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
 grep (2.5.1.ds1-6) unstable; urgency=low
 .
   * 64-egf-speedup.patch, 65-dfa-optional.patch, 66-match_icase.patch,
     67-w.patch speed up grep. Thanks to Nicolas François
     <email address hidden> (Closes: #181378, #206470, #224993)
   * Deleted the CVS directories
Files:
 7797de5e94d5c6b930a29e0a7fc5e205 669 base required grep_2.5.1.ds1-6.dsc
 caea29b0505d0401fb03d0f3c5b0de75 29266 base required grep_2.5.1.ds1-6.diff.gz
 38ead74511b3423ee277778ae85c2077 172330 base required grep_2.5.1.ds1-6_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDI8JagY5NIXPNpFURAq2QAKC4AK77tJr5vlyg5sSVasgEMr49RQCgrZYI
wsu93RoCSrY292GZAvXAoTM=
=XhKe
-----END PGP SIGNATURE-----

Matt Zimmerman (mdz) wrote :

It would be awfully nice to have this fixed for Breezy final, but I'm wary of
touching such a core package for the sake of a performance issue. Please look
over the patches and see if you can establish the amount of risk

Chris Moore (dooglus) wrote :

This bug really does need fixing. grep isn't just slower, it's unusable in some cases. I show a case in comment 16 above
where grep takes 40 hours to grep for a very simple pattern in a 2 megabyte file.

If you make a 2 gigabyte file instead of a 2 megabyte file (1000 times bigger), the execution time doesn't multiply by
1000, but by a million, so it will take over 4,500 years to run. Just to grep through 2 gigabytes?

Note that (1) UTF locales are used by default (I think) and (2) grep is used by default crontabs. This means that the
problem will be exhibited in clean installs.

I first noticed the problem when my laptop started crashing at random. It turned out that there's something wrong with it
which causes it to power down if it gets hot. It was this grep bug which was causing it to overheat - grep is CPU bound
due to this bug.

When evaluating the risk of fixing the bug, please also look at the risk of leaving grep broken.

Matt Zimmerman (mdz) wrote :

(In reply to comment #24)
> This bug really does need fixing. grep isn't just slower, it's unusable in
some cases. I show a case in comment 16 above
> where grep takes 40 hours to grep for a very simple pattern in a 2 megabyte file.

It is a genuine bug and should be fixed. That's why we have an open bug report
about it; no need to plead that case.

> Note that (1) UTF locales are used by default (I think) and (2) grep is used
by default crontabs. This means that the
> problem will be exhibited in clean installs.

All of this has been true in Ubuntu for nearly a year now, and I have seen no
reports of catastrophic failures, so while it certainly ought to be fixed if
possible, it does not seem like a crisis. When it comes to non-interactive
jobs, it is certainly better to have grep complete 10x more slowly than to have
it return incorrect results, so this change requires careful testing and review
even to be considered for inclusion in Breezy at such a late stage.

found 181378 2.5.1.ds2-1
thanks

* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:

> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
> 66-match_icase.patch and 67-w.patch from debian/patches,
> closes: #329876.

  Those patches fixed a bug (and two merged) that had been opened for 2
  and a half years. I think it'd be useful if you tried to contact the
  authors of the patches, and try to fix them instead of removing them?

> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
> 51-dircategory-info.patch from debian/patches, the GNU Free
> Documentation License from debian/copyright and debian/fdl.txt,
> closes: #281647.

  Still, grep.1 remains, which (a) contains verbatim paragraphs from
  grep.texi yet (b) comes in the upstream tarball with a license notice.
  Does this mean that grep.1 is?:

    - under the GFDL, so should be removed
    - under the GPL (the general license of the tarball), despite
      sharing contents with grep.texi
    - undistributable, because it has no license attached

  Cheers,

--
Adeodato Simó
    EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621

Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
dozens.
                -- Michel de Montaigne

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Mon, 26 Sep 2005 20:04:24 +0200
From: Adeodato =?utf-8?B?U2ltw7M=?= <email address hidden>
To: <email address hidden>,
 Anibal Monsalve Salazar <email address hidden>
Cc: <email address hidden>, <email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)

found 181378 2.5.1.ds2-1
thanks

* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:

> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
> 66-match_icase.patch and 67-w.patch from debian/patches,
> closes: #329876.

  Those patches fixed a bug (and two merged) that had been opened for 2
  and a half years. I think it'd be useful if you tried to contact the
  authors of the patches, and try to fix them instead of removing them?

> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
> 51-dircategory-info.patch from debian/patches, the GNU Free
> Documentation License from debian/copyright and debian/fdl.txt,
> closes: #281647.

  Still, grep.1 remains, which (a) contains verbatim paragraphs from
  grep.texi yet (b) comes in the upstream tarball with a license notice.
  Does this mean that grep.1 is?:

    - under the GFDL, so should be removed
    - under the GPL (the general license of the tarball), despite
      sharing contents with grep.texi
    - undistributable, because it has no license attached

  Cheers,

--
Adeodato Simó
    EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621

Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
dozens.
                -- Michel de Montaigne

reopen 181378
reopen 206470
reopen 224993
thanks

Anibal Monsalve Salazar
--
 .''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://debian.org/
  `- http://v7w.com/anibal

On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Simó wrote:
>found 181378 2.5.1.ds2-1
>thanks
>
>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>
>> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
>> 66-match_icase.patch and 67-w.patch from debian/patches,
>> closes: #329876.
>
> Those patches fixed a bug (and two merged) that had been opened for 2
> and a half years. I think it'd be useful if you tried to contact the
> authors of the patches, and try to fix them instead of removing them?

Sure, the grep maintainers decided to pull out them and will go
trough the patches again.

I have bcc-ed #181378.

>> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
>> 51-dircategory-info.patch from debian/patches, the GNU Free
>> Documentation License from debian/copyright and debian/fdl.txt,
>> closes: #281647.
>
> Still, grep.1 remains, which (a) contains verbatim paragraphs from
> grep.texi yet (b) comes in the upstream tarball with a license notice.
> Does this mean that grep.1 is?:
>
> - under the GFDL, so should be removed

grep.texi is the only documentation file under the GFDL whereas
grep.1 is not.

> - under the GPL (the general license of the tarball), despite
> sharing contents with grep.texi

grep.1 is covered by the license of the tarball which is the GPL.

> - undistributable, because it has no license attached

I don't think so. If grep.1 is undistributable so many others files
are.

grep.1 is not the only only file without an explicit license. Other
files without an explicit license are:

lib/alloca.c
lib/closeout.h
lib/hard-locale.h
lib/regex.h
lib/savedir.h
lib/xstrtol.h
po/cat-id-tbl.c
src/dosbuf.c
src/getpagesize.h
src/grepmat.c
src/vms_fab.c
src/vms_fab.h
vms/config_vms.h
config.h

> Cheers,
>
>--
>Adeodato Simó
> EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621
>
>Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
>dozens.
> -- Michel de Montaigne

Aníbal Monsalve Salazar
--
 .''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://debian.org/
  `- http://v7w.com/anibal

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Tue, 27 Sep 2005 09:54:17 +1000
From: =?iso-8859-1?Q?An=EDbal?= Monsalve Salazar <email address hidden>
To: <email address hidden>
Subject: grep reopen #181378, #206470, #224993

--YOiw+WO4Gc95oc3L
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline

reopen 181378
reopen 206470
reopen 224993
thanks

Anibal Monsalve Salazar
--
 .''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://debian.org/
  `- http://v7w.com/anibal

--YOiw+WO4Gc95oc3L
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDOIopgY5NIXPNpFURAr8hAJ9Px2akqbGFEeTf9RJxyBzX6fa56ACePlew
kTTX2IAEGQFTScWSrioq1cg=
=WCxZ
-----END PGP SIGNATURE-----

--YOiw+WO4Gc95oc3L--

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Tue, 27 Sep 2005 11:12:07 +1000
From: =?iso-8859-1?Q?An=EDbal?= Monsalve Salazar <email address hidden>
To: <email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)

--OkEUgNLVrkMgtt3o
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Sim=F3 wrote:
>found 181378 2.5.1.ds2-1
>thanks
>
>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>
>> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
>> 66-match_icase.patch and 67-w.patch from debian/patches,
>> closes: #329876.
>
> Those patches fixed a bug (and two merged) that had been opened for 2
> and a half years. I think it'd be useful if you tried to contact the
> authors of the patches, and try to fix them instead of removing them?

Sure, the grep maintainers decided to pull out them and will go
trough the patches again.

I have bcc-ed #181378.

>> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
>> 51-dircategory-info.patch from debian/patches, the GNU Free
>> Documentation License from debian/copyright and debian/fdl.txt,
>> closes: #281647.
>
> Still, grep.1 remains, which (a) contains verbatim paragraphs from
> grep.texi yet (b) comes in the upstream tarball with a license notice.
> Does this mean that grep.1 is?:
>
> - under the GFDL, so should be removed

grep.texi is the only documentation file under the GFDL whereas
grep.1 is not.

> - under the GPL (the general license of the tarball), despite
> sharing contents with grep.texi

grep.1 is covered by the license of the tarball which is the GPL.

> - undistributable, because it has no license attached

I don't think so. If grep.1 is undistributable so many others files
are.

grep.1 is not the only only file without an explicit license. Other
files without an explicit license are:

lib/alloca.c
lib/closeout.h
lib/hard-locale.h
lib/regex.h
lib/savedir.h
lib/xstrtol.h
po/cat-id-tbl.c
src/dosbuf.c
src/getpagesize.h
src/grepmat.c
src/vms_fab.c
src/vms_fab.h
vms/config_vms.h
config.h

> Cheers,
>
>--=20
>Adeodato Sim=F3
> EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621
>=20
>Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
>dozens.
> -- Michel de Montaigne

An=EDbal Monsalve Salazar
--
 .''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://debian.org/
  `- http://v7w.com/anibal

--OkEUgNLVrkMgtt3o
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDOJxmgY5NIXPNpFURAo5gAKC8oopAOpLGFAt+0efm9sbLk+bm5wCeKEzo
IMUFNZYKTEfOOL7QVq5WiyI=
=4m7u
-----END PGP SIGNATURE-----

--OkEUgNLVrkMgtt3o--

Hello,

On Tue, Sep 27, 2005 at 11:12:07AM +1000, Aníbal Monsalve Salazar wrote:
> On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Simó wrote:
> >found 181378 2.5.1.ds2-1
> >thanks
> >
> >* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
> >
> >> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
> >> 66-match_icase.patch and 67-w.patch from debian/patches,
> >> closes: #329876.
> >
> > Those patches fixed a bug (and two merged) that had been opened for 2
> > and a half years. I think it'd be useful if you tried to contact the
> > authors of the patches, and try to fix them instead of removing them?
>
> Sure, the grep maintainers decided to pull out them and will go
> trough the patches again.

I wondered if I introduced this issue while porting the Fedora patches to
Debian, so I tried Fedora's grep...which has the same issue.

You can reproduce it with this simple command:
echo foobar | grep -Fw ""

This was introduced by the patch I named '64-egf-speedup.patch'

You can fix it by changing the 'while (1)' by 'while (len)' (or by
embedding this while loop in a 'if (len){...}', I don't know if there is a
real difference, and what is the best way).
Tim Waugh, who wrote the original patches, may have a better understanding
of the grep's code.

The testsuite still pass with this patch.

BTW, I don't know if you received a mail I sent to <email address hidden>,
which indicated that the additional patches (which I submitted because
they helped passing the testsuite) were fixing: #209194 #218873 #226397
#238167

If you plan to re-introduce these patches, please tell me. While checking
for this issue (#329876), I've seen that there was one issue fixed in a
Fedora update, related to this patch:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=161700
I can update 64-egf-speedup.patch if you want.

Kind Regards,
--
Nekral

On Tue, Sep 27, 2005 at 11:53:41PM +0200, Nicolas François wrote:
>On Tue, Sep 27, 2005 at 11:12:07AM +1000, Aníbal Monsalve Salazar wrote:
>>On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Simó wrote:
>>>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>>>
>>>> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
>>>> 66-match_icase.patch and 67-w.patch from debian/patches,
>>>> closes: #329876.
>>>
>>> Those patches fixed a bug (and two merged) that had been opened for 2
>>> and a half years. I think it'd be useful if you tried to contact the
>>> authors of the patches, and try to fix them instead of removing them?
>>
>>Sure, the grep maintainers decided to pull them out and will go
>>trough the patches again.
>
>I wondered if I introduced this issue while porting the Fedora patches to
>Debian, so I tried Fedora's grep...which has the same issue.
>
>You can reproduce it with this simple command:
>echo foobar | grep -Fw ""
>
>This was introduced by the patch I named '64-egf-speedup.patch'
>
>You can fix it by changing the 'while (1)' by 'while (len)' (or by
>embedding this while loop in a 'if (len){...}', I don't know if there is a
>real difference, and what is the best way).
>Tim Waugh, who wrote the original patches, may have a better understanding
>of the grep's code.
>
>The testsuite still pass with this patch.
>
>BTW, I don't know if you received a mail I sent to <email address hidden>,
>which indicated that the additional patches (which I submitted because
>they helped passing the testsuite) were fixing: #209194 #218873 #226397
>#238167

I received it, thanks. I'll close the bugs.

>If you plan to re-introduce these patches, please tell me. While checking
>for this issue (#329876), I've seen that there was one issue fixed in a
>Fedora update, related to this patch:
>https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=161700
>I can update 64-egf-speedup.patch if you want.

Yes, please. I would like to reapply 64-egf-speedup.patch
(and 6[567]-*.patch) and an updated version will be very much
appreciated.

>Kind Regards,
>--
>Nekral

Regards,

Aníbal Monsalve Salazar
--
 .''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://debian.org/
  `- http://v7w.com/anibal

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Tue, 27 Sep 2005 23:53:41 +0200
From: Nicolas =?iso-8859-1?Q?Fran=E7ois?= <email address hidden>
To: =?iso-8859-1?Q?An=EDbal?= Monsalve Salazar <email address hidden>
Cc: <email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)

Hello,

On Tue, Sep 27, 2005 at 11:12:07AM +1000, An=EDbal Monsalve Salazar wrote=
:
> On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Sim=F3 wrote:
> >found 181378 2.5.1.ds2-1
> >thanks
> >
> >* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
> >
> >> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
> >> 66-match_icase.patch and 67-w.patch from debian/patches,
> >> closes: #329876.
> >
> > Those patches fixed a bug (and two merged) that had been opened for =
2
> > and a half years. I think it'd be useful if you tried to contact the
> > authors of the patches, and try to fix them instead of removing them=
?
>=20
> Sure, the grep maintainers decided to pull out them and will go
> trough the patches again.

I wondered if I introduced this issue while porting the Fedora patches to
Debian, so I tried Fedora's grep...which has the same issue.

You can reproduce it with this simple command:
echo foobar | grep -Fw ""

This was introduced by the patch I named '64-egf-speedup.patch'

You can fix it by changing the 'while (1)' by 'while (len)' (or by
embedding this while loop in a 'if (len){...}', I don't know if there is =
a
real difference, and what is the best way).
Tim Waugh, who wrote the original patches, may have a better understandin=
g
of the grep's code.

The testsuite still pass with this patch.

BTW, I don't know if you received a mail I sent to <email address hidden>=
rg,
which indicated that the additional patches (which I submitted because
they helped passing the testsuite) were fixing: #209194 #218873 #226397
#238167

If you plan to re-introduce these patches, please tell me. While checking
for this issue (#329876), I've seen that there was one issue fixed in a
Fedora update, related to this patch:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=3D161700
I can update 64-egf-speedup.patch if you want.

Kind Regards,
--=20
Nekral

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Wed, 28 Sep 2005 09:04:03 +1000
From: =?iso-8859-1?Q?An=EDbal?= Monsalve Salazar <email address hidden>
To: Nicolas =?iso-8859-1?Q?Fran=E7ois?= <email address hidden>
Cc: Santiago Ruano Rincon <email address hidden>,
 <email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)

--Affreb919SiI8I8E
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Sep 27, 2005 at 11:53:41PM +0200, Nicolas Fran=E7ois wrote:
>On Tue, Sep 27, 2005 at 11:12:07AM +1000, An=EDbal Monsalve Salazar wrote:
>>On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Sim=F3 wrote:
>>>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>>>
>>>> * Removed 64-egf-speedup.patch, 65-dfa-optional.patch,
>>>> 66-match_icase.patch and 67-w.patch from debian/patches,
>>>> closes: #329876.
>>>
>>> Those patches fixed a bug (and two merged) that had been opened for 2
>>> and a half years. I think it'd be useful if you tried to contact the
>>> authors of the patches, and try to fix them instead of removing them?
>>
>>Sure, the grep maintainers decided to pull them out and will go
>>trough the patches again.
>
>I wondered if I introduced this issue while porting the Fedora patches to
>Debian, so I tried Fedora's grep...which has the same issue.
>
>You can reproduce it with this simple command:
>echo foobar | grep -Fw ""
>
>This was introduced by the patch I named '64-egf-speedup.patch'
>
>You can fix it by changing the 'while (1)' by 'while (len)' (or by
>embedding this while loop in a 'if (len){...}', I don't know if there is a
>real difference, and what is the best way).
>Tim Waugh, who wrote the original patches, may have a better understanding
>of the grep's code.
>
>The testsuite still pass with this patch.
>
>BTW, I don't know if you received a mail I sent to <email address hidden>=
g,
>which indicated that the additional patches (which I submitted because
>they helped passing the testsuite) were fixing: #209194 #218873 #226397
>#238167

I received it, thanks. I'll close the bugs.

>If you plan to re-introduce these patches, please tell me. While checking
>for this issue (#329876), I've seen that there was one issue fixed in a
>Fedora update, related to this patch:
>https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=3D161700
>I can update 64-egf-speedup.patch if you want.

Yes, please. I would like to reapply 64-egf-speedup.patch
(and 6[567]-*.patch) and an updated version will be very much
appreciated.

>Kind Regards,
>--=20
>Nekral

Regards,

An=EDbal Monsalve Salazar
--
 .''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://debian.org/
  `- http://v7w.com/anibal

--Affreb919SiI8I8E
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDOc/jgY5NIXPNpFURAmsvAJ93T7y8IQhwI3ftrTx4fhv1UraLggCbBrPS
iySHQbDc9S878E+6DpSj9nE=
=8BP7
-----END PGP SIGNATURE-----

--Affreb919SiI8I8E--

Hello,

Please find attached an update for the 64-egf-speedup.patch patch.
The other patches did not need to be updated and can be found in the
#181378 log.

This update intend to fix:
echo foobar | grep -Fw ""
(which was hanging with the previous version)

echo test | LC_ALL=C grep -Fw test
echo x test x | LC_ALL=C grep -Fw test

which were not working and were fixed by Tim Waugh (original author of the
patches).

I intend to mail Tim Waugh about the first issue, to check if my fix is
correct/optimal. grep being slow on UTF-8 is not that critical, it may be
better to wait his answer before releasing it. I will CC the BTS.

Kind Regards,
--
Nekral

Hello,

Sorry for contacting you directly.

I'm trying to port you patch (grep-2.5.1-egf-speedup.patch) to Debian.
This patch triggered an issue when an empty pattern is used with the -Fw
options.
(see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=329876)

I tried the Fedora grep-2.5.1-48.2 binary, which suffers from the same
issue (on a Debian system, with a Debian libc and libpcre):
   echo foobar | grep -Fw ""
hangs (this could appear with the -Fwf options when the patterns file
contains an empty line).

Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
issue. However, I don't know if this is correct or optimal (I don't know
what should happen if we enter the loop with len>0 and len is then
decreased to 0; Maybe this should also be catched earlier).

Does it seems correct to you ?

Sorry I could not check if a Redhat system suffers from this (that's the
reson why I do not use the BTS) and thanks a lot for the impressive
speed-up of grep on an UTF-8 environment,
--
Nekral

Debian Bug Importer (debzilla) wrote :
Download full text (20.9 KiB)

Message-ID: <email address hidden>
Date: Wed, 28 Sep 2005 12:58:50 +0200
From: Nicolas =?iso-8859-1?Q?Fran=E7ois?= <email address hidden>
To: <email address hidden>
Cc: Santiago Ruano Rincon <email address hidden>
Subject: update for 64-egf-speedup.patch

--SUOF0GtieIMvvwua
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hello,

Please find attached an update for the 64-egf-speedup.patch patch.
The other patches did not need to be updated and can be found in the
#181378 log.

This update intend to fix:
echo foobar | grep -Fw ""
(which was hanging with the previous version)

echo test | LC_ALL=C grep -Fw test
echo x test x | LC_ALL=C grep -Fw test

which were not working and were fixed by Tim Waugh (original author of the
patches).

I intend to mail Tim Waugh about the first issue, to check if my fix is
correct/optimal. grep being slow on UTF-8 is not that critical, it may be
better to wait his answer before releasing it. I will CC the BTS.

Kind Regards,
--
Nekral

--SUOF0GtieIMvvwua
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="64-egf-speedup.patch"

--- src/search.c.orig 2005-09-06 20:53:35.000000000 +0200
+++ src/search.c 2005-09-06 22:12:36.000000000 +0200
@@ -18,9 +18,13 @@

 /* Written August 1992 by Mike Haertel. */

+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE 1
+#endif
 #ifdef HAVE_CONFIG_H
 # include <config.h>
 #endif
+#include <assert.h>
 #include <sys/types.h>
 #if defined HAVE_WCTYPE_H && defined HAVE_WCHAR_H && defined HAVE_MBRTOWC
 /* We can handle multibyte string. */
@@ -39,6 +43,9 @@
 #ifdef HAVE_LIBPCRE
 # include <pcre.h>
 #endif
+#ifdef HAVE_LANGINFO_CODESET
+# include <langinfo.h>
+#endif

 #define NCHAR (UCHAR_MAX + 1)

@@ -70,9 +77,10 @@
    call the regexp matcher at all. */
 static int kwset_exact_matches;

-#if defined(MBS_SUPPORT)
-static char* check_multibyte_string PARAMS ((char const *buf, size_t size));
-#endif
+/* UTF-8 encoding allows some optimizations that we can't otherwise
+ assume in a multibyte encoding. */
+static int using_utf8;
+
 static void kwsinit PARAMS ((void));
 static void kwsmusts PARAMS ((void));
 static void Gcompile PARAMS ((char const *, size_t));
@@ -84,6 +92,15 @@
 static size_t Pexecute PARAMS ((char const *, size_t, size_t *, int));

 void
+check_utf8 (void)
+{
+#ifdef HAVE_LANGINFO_CODESET
+ if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
+ using_utf8 = 1;
+#endif
+}
+
+void
 dfaerror (char const *mesg)
 {
   error (2, 0, mesg);
@@ -141,47 +158,6 @@
     }
 }

-#ifdef MBS_SUPPORT
-/* This function allocate the array which correspond to "buf".
- Then this check multibyte string and mark on the positions which
- are not singlebyte character nor the first byte of a multibyte
- character. Caller must free the array. */
-static char*
-check_multibyte_string(char const *buf, size_t size)
-{
- char *mb_properties = malloc(size);
- mbstate_t cur_state;
- wchar_t wc;
- int i;
- memset(&cur_state, 0, sizeof(mbstate_t));
- memset(mb_properties, 0, sizeof(char)*size);
- for (i = 0; i < size ;)
- {
- size_t mbclen;
- mbclen = mbrtowc(&wc, buf + i, size...

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Wed, 28 Sep 2005 13:26:27 +0200
From: Nicolas =?iso-8859-1?Q?Fran=E7ois?= <email address hidden>
To: <email address hidden>
Cc: <email address hidden>
Subject: grep hanging with -Fw and an empty pattern

Hello,

Sorry for contacting you directly.

I'm trying to port you patch (grep-2.5.1-egf-speedup.patch) to Debian.
This patch triggered an issue when an empty pattern is used with the -Fw
options.
(see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=329876)

I tried the Fedora grep-2.5.1-48.2 binary, which suffers from the same
issue (on a Debian system, with a Debian libc and libpcre):
   echo foobar | grep -Fw ""
hangs (this could appear with the -Fwf options when the patterns file
contains an empty line).

Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
issue. However, I don't know if this is correct or optimal (I don't know
what should happen if we enter the loop with len>0 and len is then
decreased to 0; Maybe this should also be catched earlier).

Does it seems correct to you ?

Sorry I could not check if a Redhat system suffers from this (that's the
reson why I do not use the BTS) and thanks a lot for the impressive
speed-up of grep on an UTF-8 environment,
--
Nekral

On Wed, Sep 28, 2005 at 01:26:27PM +0200, Nicolas François wrote:

> Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
> issue. However, I don't know if this is correct or optimal (I don't know
> what should happen if we enter the loop with len>0 and len is then
> decreased to 0; Maybe this should also be catched earlier).
>
> Does it seems correct to you ?

Yes, looks correct to me. Thanks.

Tim.
*/

Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Thu, 29 Sep 2005 13:22:39 +0100
From: Tim Waugh <email address hidden>
To: Nicolas =?iso-8859-1?Q?Fran=E7ois?= <email address hidden>
Cc: <email address hidden>
Subject: Re: grep hanging with -Fw and an empty pattern

--5sJP8czQtkNNPwfl
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Sep 28, 2005 at 01:26:27PM +0200, Nicolas Fran=E7ois wrote:

> Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
> issue. However, I don't know if this is correct or optimal (I don't know
> what should happen if we enter the loop with len>0 and len is then
> decreased to 0; Maybe this should also be catched earlier).
>=20
> Does it seems correct to you ?

Yes, looks correct to me. Thanks.

Tim.
*/

--5sJP8czQtkNNPwfl
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDO9yPLF+LYaF94FERAifjAKCnEEfsQ5Va46R1UZGe2sSp4xbxiwCfSI1X
9RQwtdl2S4kiAbGsTTMX7ks=
=yWGN
-----END PGP SIGNATURE-----

--5sJP8czQtkNNPwfl--

Source: grep
Source-Version: 2.5.1.ds2-2

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds2-2.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds2-2.diff.gz
grep_2.5.1.ds2-2.dsc
  to pool/main/g/grep/grep_2.5.1.ds2-2.dsc
grep_2.5.1.ds2-2_alpha.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_alpha.deb
grep_2.5.1.ds2-2_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_i386.deb
grep_2.5.1.ds2-2_sparc.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_sparc.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
 grep (2.5.1.ds2-2) unstable; urgency=low
 .
   * Patched 64-egf-speedup.patch with patch from Nicolas François
     <email address hidden>. Put 64-egf-speedup.patch,
     65-dfa-optional.patch, 66-match_icase.patch and 67-w.patch back
     in, closes: #181378, #206470, #224993.
   * Fixed "minor documentation syntax error", closes: #240239,
     #257900. Patches by Allard Hoeve <email address hidden> and Derrick
     'dman' Hudson <email address hidden>.
   * Fixed "info page not in main info menu", closes: #284676,
     #267718. Patches by Rui Tiago Cação Matos
     <email address hidden> and Paul Brook <email address hidden>.
Files:
 88b2af4b3578729420158583be03731f 660 utils required grep_2.5.1.ds2-2.dsc
 14e96467e8623210c797ec104ed9e3b2 21354 utils required grep_2.5.1.ds2-2.diff.gz
 e69a3fbbab86633594273203f7f2207e 139112 utils required grep_2.5.1.ds2-2_i386.deb
 76128b684a7deac71454c5f6b5697345 140514 utils required grep_2.5.1.ds2-2_sparc.deb
 01da865bef322c130f6f46abad86d1f9 147868 utils required grep_2.5.1.ds2-2_alpha.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDX1MXipBneRiAKDwRAkE4AKCuQ7V6POyqk3uqYL4c5ifTHLtu6ACdHk7e
Kowqh+yG6VdaC2w+ve8bhyc=
=sBND
-----END PGP SIGNATURE-----

Source: grep
Source-Version: 2.5.1.ds2-2

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds2-2.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds2-2.diff.gz
grep_2.5.1.ds2-2.dsc
  to pool/main/g/grep/grep_2.5.1.ds2-2.dsc
grep_2.5.1.ds2-2_alpha.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_alpha.deb
grep_2.5.1.ds2-2_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_i386.deb
grep_2.5.1.ds2-2_sparc.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_sparc.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
 grep (2.5.1.ds2-2) unstable; urgency=low
 .
   * Patched 64-egf-speedup.patch with patch from Nicolas François
     <email address hidden>. Put 64-egf-speedup.patch,
     65-dfa-optional.patch, 66-match_icase.patch and 67-w.patch back
     in, closes: #181378, #206470, #224993.
   * Fixed "minor documentation syntax error", closes: #240239,
     #257900. Patches by Allard Hoeve <email address hidden> and Derrick
     'dman' Hudson <email address hidden>.
   * Fixed "info page not in main info menu", closes: #284676,
     #267718. Patches by Rui Tiago Cação Matos
     <email address hidden> and Paul Brook <email address hidden>.
Files:
 88b2af4b3578729420158583be03731f 660 utils required grep_2.5.1.ds2-2.dsc
 14e96467e8623210c797ec104ed9e3b2 21354 utils required grep_2.5.1.ds2-2.diff.gz
 e69a3fbbab86633594273203f7f2207e 139112 utils required grep_2.5.1.ds2-2_i386.deb
 76128b684a7deac71454c5f6b5697345 140514 utils required grep_2.5.1.ds2-2_sparc.deb
 01da865bef322c130f6f46abad86d1f9 147868 utils required grep_2.5.1.ds2-2_alpha.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDX1MXipBneRiAKDwRAkE4AKCuQ7V6POyqk3uqYL4c5ifTHLtu6ACdHk7e
Kowqh+yG6VdaC2w+ve8bhyc=
=sBND
-----END PGP SIGNATURE-----

Source: grep
Source-Version: 2.5.1.ds2-2

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds2-2.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds2-2.diff.gz
grep_2.5.1.ds2-2.dsc
  to pool/main/g/grep/grep_2.5.1.ds2-2.dsc
grep_2.5.1.ds2-2_alpha.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_alpha.deb
grep_2.5.1.ds2-2_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_i386.deb
grep_2.5.1.ds2-2_sparc.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_sparc.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
 grep (2.5.1.ds2-2) unstable; urgency=low
 .
   * Patched 64-egf-speedup.patch with patch from Nicolas François
     <email address hidden>. Put 64-egf-speedup.patch,
     65-dfa-optional.patch, 66-match_icase.patch and 67-w.patch back
     in, closes: #181378, #206470, #224993.
   * Fixed "minor documentation syntax error", closes: #240239,
     #257900. Patches by Allard Hoeve <email address hidden> and Derrick
     'dman' Hudson <email address hidden>.
   * Fixed "info page not in main info menu", closes: #284676,
     #267718. Patches by Rui Tiago Cação Matos
     <email address hidden> and Paul Brook <email address hidden>.
Files:
 88b2af4b3578729420158583be03731f 660 utils required grep_2.5.1.ds2-2.dsc
 14e96467e8623210c797ec104ed9e3b2 21354 utils required grep_2.5.1.ds2-2.diff.gz
 e69a3fbbab86633594273203f7f2207e 139112 utils required grep_2.5.1.ds2-2_i386.deb
 76128b684a7deac71454c5f6b5697345 140514 utils required grep_2.5.1.ds2-2_sparc.deb
 01da865bef322c130f6f46abad86d1f9 147868 utils required grep_2.5.1.ds2-2_alpha.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDX1MXipBneRiAKDwRAkE4AKCuQ7V6POyqk3uqYL4c5ifTHLtu6ACdHk7e
Kowqh+yG6VdaC2w+ve8bhyc=
=sBND
-----END PGP SIGNATURE-----

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Wed, 26 Oct 2005 03:02:09 -0700
From: Anibal Monsalve Salazar <email address hidden>
To: <email address hidden>
Subject: Bug#181378: fixed in grep 2.5.1.ds2-2

Source: grep
Source-Version: 2.5.1.ds2-2

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds2-2.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds2-2.diff.gz
grep_2.5.1.ds2-2.dsc
  to pool/main/g/grep/grep_2.5.1.ds2-2.dsc
grep_2.5.1.ds2-2_alpha.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_alpha.deb
grep_2.5.1.ds2-2_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_i386.deb
grep_2.5.1.ds2-2_sparc.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_sparc.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
 grep (2.5.1.ds2-2) unstable; urgency=low
 .
   * Patched 64-egf-speedup.patch with patch from Nicolas François
     <email address hidden>. Put 64-egf-speedup.patch,
     65-dfa-optional.patch, 66-match_icase.patch and 67-w.patch back
     in, closes: #181378, #206470, #224993.
   * Fixed "minor documentation syntax error", closes: #240239,
     #257900. Patches by Allard Hoeve <email address hidden> and Derrick
     'dman' Hudson <email address hidden>.
   * Fixed "info page not in main info menu", closes: #284676,
     #267718. Patches by Rui Tiago Cação Matos
     <email address hidden> and Paul Brook <email address hidden>.
Files:
 88b2af4b3578729420158583be03731f 660 utils required grep_2.5.1.ds2-2.dsc
 14e96467e8623210c797ec104ed9e3b2 21354 utils required grep_2.5.1.ds2-2.diff.gz
 e69a3fbbab86633594273203f7f2207e 139112 utils required grep_2.5.1.ds2-2_i386.deb
 76128b684a7deac71454c5f6b5697345 140514 utils required grep_2.5.1.ds2-2_sparc.deb
 01da865bef322c130f6f46abad86d1f9 147868 utils required grep_2.5.1.ds2-2_alpha.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDX1MXipBneRiAKDwRAkE4AKCuQ7V6POyqk3uqYL4c5ifTHLtu6ACdHk7e
Kowqh+yG6VdaC2w+ve8bhyc=
=sBND
-----END PGP SIGNATURE-----

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Wed, 26 Oct 2005 03:02:09 -0700
From: Anibal Monsalve Salazar <email address hidden>
To: <email address hidden>
Subject: Bug#206470: fixed in grep 2.5.1.ds2-2

Source: grep
Source-Version: 2.5.1.ds2-2

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds2-2.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds2-2.diff.gz
grep_2.5.1.ds2-2.dsc
  to pool/main/g/grep/grep_2.5.1.ds2-2.dsc
grep_2.5.1.ds2-2_alpha.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_alpha.deb
grep_2.5.1.ds2-2_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_i386.deb
grep_2.5.1.ds2-2_sparc.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_sparc.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
 grep (2.5.1.ds2-2) unstable; urgency=low
 .
   * Patched 64-egf-speedup.patch with patch from Nicolas François
     <email address hidden>. Put 64-egf-speedup.patch,
     65-dfa-optional.patch, 66-match_icase.patch and 67-w.patch back
     in, closes: #181378, #206470, #224993.
   * Fixed "minor documentation syntax error", closes: #240239,
     #257900. Patches by Allard Hoeve <email address hidden> and Derrick
     'dman' Hudson <email address hidden>.
   * Fixed "info page not in main info menu", closes: #284676,
     #267718. Patches by Rui Tiago Cação Matos
     <email address hidden> and Paul Brook <email address hidden>.
Files:
 88b2af4b3578729420158583be03731f 660 utils required grep_2.5.1.ds2-2.dsc
 14e96467e8623210c797ec104ed9e3b2 21354 utils required grep_2.5.1.ds2-2.diff.gz
 e69a3fbbab86633594273203f7f2207e 139112 utils required grep_2.5.1.ds2-2_i386.deb
 76128b684a7deac71454c5f6b5697345 140514 utils required grep_2.5.1.ds2-2_sparc.deb
 01da865bef322c130f6f46abad86d1f9 147868 utils required grep_2.5.1.ds2-2_alpha.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDX1MXipBneRiAKDwRAkE4AKCuQ7V6POyqk3uqYL4c5ifTHLtu6ACdHk7e
Kowqh+yG6VdaC2w+ve8bhyc=
=sBND
-----END PGP SIGNATURE-----

Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Wed, 26 Oct 2005 03:02:09 -0700
From: Anibal Monsalve Salazar <email address hidden>
To: <email address hidden>
Subject: Bug#224993: fixed in grep 2.5.1.ds2-2

Source: grep
Source-Version: 2.5.1.ds2-2

We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:

grep_2.5.1.ds2-2.diff.gz
  to pool/main/g/grep/grep_2.5.1.ds2-2.diff.gz
grep_2.5.1.ds2-2.dsc
  to pool/main/g/grep/grep_2.5.1.ds2-2.dsc
grep_2.5.1.ds2-2_alpha.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_alpha.deb
grep_2.5.1.ds2-2_i386.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_i386.deb
grep_2.5.1.ds2-2_sparc.deb
  to pool/main/g/grep/grep_2.5.1.ds2-2_sparc.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
 grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
 grep (2.5.1.ds2-2) unstable; urgency=low
 .
   * Patched 64-egf-speedup.patch with patch from Nicolas François
     <email address hidden>. Put 64-egf-speedup.patch,
     65-dfa-optional.patch, 66-match_icase.patch and 67-w.patch back
     in, closes: #181378, #206470, #224993.
   * Fixed "minor documentation syntax error", closes: #240239,
     #257900. Patches by Allard Hoeve <email address hidden> and Derrick
     'dman' Hudson <email address hidden>.
   * Fixed "info page not in main info menu", closes: #284676,
     #267718. Patches by Rui Tiago Cação Matos
     <email address hidden> and Paul Brook <email address hidden>.
Files:
 88b2af4b3578729420158583be03731f 660 utils required grep_2.5.1.ds2-2.dsc
 14e96467e8623210c797ec104ed9e3b2 21354 utils required grep_2.5.1.ds2-2.diff.gz
 e69a3fbbab86633594273203f7f2207e 139112 utils required grep_2.5.1.ds2-2_i386.deb
 76128b684a7deac71454c5f6b5697345 140514 utils required grep_2.5.1.ds2-2_sparc.deb
 01da865bef322c130f6f46abad86d1f9 147868 utils required grep_2.5.1.ds2-2_alpha.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDX1MXipBneRiAKDwRAkE4AKCuQ7V6POyqk3uqYL4c5ifTHLtu6ACdHk7e
Kowqh+yG6VdaC2w+ve8bhyc=
=sBND
-----END PGP SIGNATURE-----

Matt Zimmerman (mdz) wrote :

*** Bug 24902 has been marked as a duplicate of this bug. ***

Daniel Robitaille (robitaille) wrote :

According to the Debian bug report, this has been fixed since October. Dapper seems to contains a version newer than the fixed one in Debian . And I did some tests in both Breezy and Dapper, and grep seems normally fast: 0m0.051s to search a 270k text document for example.

Chris Moore (dooglus) wrote :

Daniel, I'm not sure whether your comment is saying that grep is normally fast in both dapper and breezy, but it sounds like it might.

The bug does seem to have been fixed in dapper - grep with a UTF-8 locale in dapper now runs about 3 or 4 times slower than without a UTF-8 locale, whatever the size of the input file. This is relatively OK - grepping through 10 million lines takes 9.5 seconds instead of 2.1 seconds:

    (dapper) $ yes | head -9999999 | LC_ALL=en_UR.UTF-8 time grep . > /dev/null
    2.12user 0.01system 0:02.73elapsed [...]

    (dapper) $ yes | head -9999999 | LC_ALL=en_US.UTF-8 time grep . > /dev/null
    9.54user 0.03system 0:11.27elapsed [...]

On breezy however the bug remains. 'grep' isn't a fixed 3 or 4 times slower, but is slower by a factor that is proportional to the size of the input file. This means that grep on breezy in UTF-8 locale still runs in quadratic time, making it impossible to run grep on large files in some cases. This is not OK - grepping the same 10 million lines takes 20 weeks instead of 2.2 seconds:

    (breezy) $ yes | head -9999999 | LC_ALL=en_UR.UTF-8 time grep . > /dev/null
    2.21user 0.01system 0:02.80elapsed [...]

    (breezy) $ yes | head -9999999 | LC_ALL=en_US.UTF-8 time grep . > /dev/null
    this didn't finish yet, but will take something in the region of 20 WEEKS to run

In summary: fixed in dapper, still broken in breezy.

Daniel Robitaille (robitaille) wrote :

If it appears that it is fixed in Dapper, maybe it will be time to finally close this old bug.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

 affects /distros/ubuntu
 status fixreleased
 done

(Hmm, let's try that again, shall we?)

Daniel Robitaille writes ("[Bug 7906] grep is extremely slow with UTF-8"):
> If it appears that it is fixed in Dapper, maybe it will be time to
> finally close this old bug.

Indeed, thanks.

Ian.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.6 <http://mailcrypt.sourceforge.net/>

iD8DBQFD1kX+8jyP9GfyNQARAn4yAJ475vwjGaHDuiX9jgGDNjN13MOJKACdEaPI
XB5tsC4/uRHWn6owfK9WQ7c=
=X8SA
-----END PGP SIGNATURE-----

Changed in grep:
status: Unconfirmed → Fix Released
Daniel Robitaille (robitaille) wrote :

Fixed in Debian

Changed in grep:
status: Unconfirmed → Fix Released

unarchive 181378
reopen 181378
found 181378 2.5.3~dfsg-2
thanks

Aníbal Monsalve Salazar
--
http://v7w.com/anibal

Changed in grep:
status: Fix Released → New

# Automatically generated email from bts, devscripts version 2.10.7
tags 181378 - patch

It seems that etch version 2.5.1.ds2-6 is not slow, but 2.5.3~dfsg-2 is
very slow with UTF-8 (I'm on i386)

# Automatically generated email from bts, devscripts version 2.10.7
fixed 181378 2.5.1.ds2-6

# Automatically generated email from bts, devscripts version 2.10.7
found 181378 2.5.3~dfsg-1

tags 181378 patch
forcemerge 181378 442882
thanks

Hello,

Please find attached updated patches for grep-2.5.3~dfsg:
 * 64-egf-speedup.patch
   This provides the speedup when the DFA algorithm is not used.
   But the DFA algorithm is used for most grep execution.
   (So there are no speed improvements if 65-dfa-optional.patch is not
   applied)
 * 65-dfa-optional.patch
   This disables the DFA algorithm, which can be very slow in UTF-8
   environments. The DFA algorithm can be enabled with an environment
   variable.
   (This patch is not valid if 64-egf-speedup.patch is not applied)

These two patches are tightly coupled and must be applied together.

There used to be also two other patches in the 2.5.1, which improve the
results of the grep testsuite:
 * 66-match_icase.patch
   This patch fixes some some usage of the -i option.
   It could probably be applied without the previous patches.

 * 67-w.patch
   This patch fixes the -w option.
   This probably fixes issues introduced by the first two patches.

I tried to add a few comments in the header of the patches.

With the 4 patches applied, 3 tests fail in the grep testsuite, but the
results are better than an unpatched upstream.

It could be nice to have a patch to split the testsuite in two categories:
known working test case) and known broken test cases (i.e. in the spencer1
testsuite, I don't expect the handling of case insensitive matches for non
latin characters to be fixed in a near future).
This would allow to run the testsuite at build time and detect regressions
in later uploads.
There are currently too many test cases/sub cases that fail to consider
the testsuite as useful at build time.

I'm also concerned about the maintainability of these patches.
I will try reduce their size and comment them, but do not wait for this
for an upload (I won't have time in the next two weeks).

With these 4 patches applied, there are probably a few bugs in the BTS
which can be closed (obviously the "grep too slow" bugs, but you should
also check if the locale dependent bugs (or the bugs which involve the -i
or -w options) are still reproducible)

I will subscribe to the PTS for grep, but do not hesitate to ping me if
these patches broke grep.

Kind Regards,
--
Nekral

Download full text (26.5 KiB)

tags 181378 + pending
tags 442882 + pending
thanks

Hi,

The following is the diff for the grep 2.5.3~dfsg-2.1 NMU with
the patches by Nicolas François. As per Santiago's mail to d-devel[1]
and the appearantly only inadvertently lowered severity, I'm NMUing this
as an RC bug. According to my tests, the new version resolve the
regressions reported against grep 2.5.3~dfsg-1.

Kind regards

T.

1. http://lists.debian.org/debian-devel/2007/09/msg00946.html

--
Thomas Viehmann, <email address hidden>

diff -u grep-2.5.3~dfsg/debian/changelog grep-2.5.3~dfsg/debian/changelog
--- grep-2.5.3~dfsg/debian/changelog
+++ grep-2.5.3~dfsg/debian/changelog
@@ -1,3 +1,11 @@
+grep (2.5.3~dfsg-2.1) unstable; urgency=high
+
+ * Non-maintainer upload.
+ * Reinstate patches by Nicolas François <email address hidden>
+ Closes: #181378, #442882
+
+ -- Thomas Viehmann <email address hidden> Tue, 02 Oct 2007 23:02:35 +0200
+
 grep (2.5.3~dfsg-2) unstable; urgency=low

   * Removed 65-dfa-optional.patch. (Closes: #439827, #440195, #440342)
only in patch2:
unchanged:
--- grep-2.5.3~dfsg.orig/debian/patches/64-egf-speedup.patch
+++ grep-2.5.3~dfsg/debian/patches/64-egf-speedup.patch
@@ -0,0 +1,792 @@
+--- src/search.c.orig
++++ src/search.c
+@@ -18,10 +18,15 @@
+
+ /* Written August 1992 by Mike Haertel. */
+
++#ifndef _GNU_SOURCE
++# define _GNU_SOURCE 1
++#endif
+ #ifdef HAVE_CONFIG_H
+ # include <config.h>
+ #endif
+
++#include <assert.h>
++
+ #include <sys/types.h>
+
+ #include "mbsupport.h"
+@@ -43,6 +48,9 @@
+ #ifdef HAVE_LIBPCRE
+ # include <pcre.h>
+ #endif
++#ifdef HAVE_LANGINFO_CODESET
++# include <langinfo.h>
++#endif
+
+ #define NCHAR (UCHAR_MAX + 1)
+
+@@ -68,6 +76,19 @@
+ error (2, 0, _("memory exhausted"));
+ }
+
++/* UTF-8 encoding allows some optimizations that we can't otherwise
++ assume in a multibyte encoding. */
++static int using_utf8;
++
++void
++check_utf8 (void)
++{
++#ifdef HAVE_LANGINFO_CODESET
++ if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
++ using_utf8 = 1;
++#endif
++}
++
+ #ifndef FGREP_PROGRAM
+ /* DFA compiled regexp. */
+ static struct dfa dfa;
+@@ -134,49 +155,6 @@
+ }
+ #endif /* !FGREP_PROGRAM */
+
+-#ifdef MBS_SUPPORT
+-/* This function allocate the array which correspond to "buf".
+- Then this check multibyte string and mark on the positions which
+- are not single byte character nor the first byte of a multibyte
+- character. Caller must free the array. */
+-static char*
+-check_multibyte_string(char const *buf, size_t size)
+-{
+- char *mb_properties = xmalloc(size);
+- mbstate_t cur_state;
+- wchar_t wc;
+- int i;
+-
+- memset(&cur_state, 0, sizeof(mbstate_t));
+- memset(mb_properties, 0, sizeof(char)*size);
+-
+- for (i = 0; i < size ;)
+- {
+- size_t mbclen;
+- mbclen = mbrtowc(&wc, buf + i, size - i, &cur_state);
+-
+- if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
+- {
+- /* An invalid sequence, or a truncated multibyte character.
+- We treat it as a single byte character. */
+- mbclen = 1;
+- }
+- else if (match_icase)
+- {
+- if (iswupper((wint_t)wc))
+- {
+- wc = towlower((wint_t)wc);
+- wcrtomb(buf + i, wc,...

Changed in grep:
status: New → Fix Committed
Changed in grep:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.