grep is extremely slow with UTF-8
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
grep (Debian) |
Fix Released
|
Unknown
|
|||
grep (Ubuntu) |
Fix Released
|
Medium
|
Ian Jackson |
Bug Description
Automatically imported from Debian bug report #181378 http://
In Debian Bug tracker #181378, Michel Daenzer (daenzer) wrote : Depends on locale | #1 |
In Debian Bug tracker #181378, H. S. Teoh (hsteoh-quickfur) wrote : Identical bugs | #2 |
merge 181378 206470
thanks
These bugs appear to be the same (see latest messages in #181378).
As for the bugs themselves, could it be that the problem is caused by grep
localizing every input character, as opposed to localizing the regex and
then matching the resulting bytes? I haven't looked at the code to be
sure, but this is what immediately came to mind when I read about the
LC_CTYPE=C speed difference.
Translating every input character would, indeed, slow things down a lot. A
better alternative would be to localize the regex, match on a byte-by-byte
basis, and then localize the output only if it matches. However, this may
have pathological problems if multiple representations of the same
character are possible (e.g. Unicode combining diacritics vs. precomposed
characters). I'm not sure what the solution would be in this case.
T
--
If you look at a thing nine hundred and ninety-nine times, you are perfectly
safe; if you look at it the thousandth time, you are in frightful danger of
seeing it for the first time. -- G. K. Chesterton
In Debian Bug tracker #181378, jidanni (dan-jacobson) wrote : | #3 |
severity 206470 important
In Debian Bug tracker #181378, Michel Daenzer (daenzer) wrote : patch | #4 |
This patch seems to help; I extracted it from the src rpm at
http://
and tweaked one hunk for it to apply.
--
Earthling Michel Dänzer | Debian (powerpc), X and DRI developer
Software libre enthusiast | http://
In Debian Bug tracker #181378, Roland Illig (roland-illig) wrote : grep: ... and Perl is a thousand times faster ... | #5 |
Package: grep
Version: 2.5.1.ds1-2
Severity: normal
Followup-For: Bug #181378
I found the magic frontier of grep: DFAs with 1024 states.
Please make grep a little quicker or replace it completely with pcre or
another fast implementation, as far as POSIX allows it.
$ time egrep .\{1024,\} debug | wc
109 17987 169422
real 1m47.522s
user 1m28.960s
sys 0m3.680s
$ time egrep .\{1023,\} debug | wc
109 17987 169422
real 0m1.074s
user 0m0.940s
sys 0m0.100s
$ time perl -ne '/.{1024,}/ and print' debug | wc
109 17987 169422
real 0m0.077s
user 0m0.070s
sys 0m0.000s
-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux wwid 2.4.22-1-k7 #5 Sat Oct 4 14:11:12 EST 2003 i686
Locale: LANG=de_
Versions of packages grep depends on:
ii libc6 2.3.2.ds1-11 GNU C Library: Shared libraries an
-- no debconf information
In Debian Bug tracker #181378, Peter Moulder (peter-moulder) wrote : perl not fair comparison: perl gets "wrong" answer for utf-8 text | #6 |
Suppose UTF-8 LC_CTYPE.
$ (echo rôle; echo role) | grep 'r.le'
rôle
role
$ (echo rôle; echo role) | perl -ne '/r.le/ and print'
role
$ (echo rôle; echo role) | grep 'r..le'
$ (echo rôle; echo role) | perl -ne '/r..le/ and print'
rôle
(This is with perl_5.8.3-3, grep_2.5.1.ds1-2.)
Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).
I believe a better solution would be for grep to convert the character
regexp to an octet regexp. E.g. the character regexp "." (which I'll assume
for simplicity matches any character) might be translated to
(?:[\x00-
That translation assumes that an accented character formed by
composition is to be considered distinct from a single unicode character
(H. S. Teoh's example above). I'm not familiar with the unicode spec.
Maybe it's reasonable to consider them different. Otherwise, I believe
the translate-
longer translations.
However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.
pjrm.
In Debian Bug tracker #181378, Peter Moulder (peter-moulder) wrote : gprof; combining diacritical marks; octet regexp conversion | #7 |
According to gprof on grep compiled with -pg (without installing libc6-prof),
~all the time is spent in check_multibyte
In the case of utf-8, we don't need the return value of
check_multibyte
characters (ignoring the composition case) have (c & 0xc0) == 0x80 (i.e. bit7=1, bit6=0).
To handle the composition case, we add a test that the wide character
is a combining diacritical: "|| (c == 0xcc) || ((c == 0xcd) && (nextchar
<= 0xaf))".
Combining diacritical marks are a hassle for grep. According to
http://
be followed by more than one combining diacritical mark character.
Presumably, order doesn't matter, so grep 'a<string of n combining
diacritical mark characters>' can match n factorial different strings in
the haystack text, without counting use of precomposed characters.
The simplest way of handling this would be to convert to a canonical
form, say decomposed form with combining diacritical marks in sorted
order.
Note that I haven't checked the unicode standards on this point:
possibly order is to be considered significant, in which case the only
possible matches are decomposed vs use of precomposed character. This
would make the convert-
Another issue with decomposable characters is that we must use negative
lookahead tests: if searching for `o' then we must check that the
matched 'o' isn't followed by a combining diacritical mark character.
The alternative of canonicalizing to precomposed form instead of
decomposed form has its own expense: if there are 112 possible
diacritical mark characters, and characters can be followed by an
arbitrary selection of those, then we need an extra 112 bits per
canonical character to represent those. And even that only presence or
absence (rather than number or order) is significant for combining
diacritical marks, i.e. it assumes that e<macron>
considered equivalent to e<acute><macron>. If number and order are
significant then no finite number of bits suffices.
Note that grep doesn't currently handle combining diacritical marks:
$ printf 'e\xcc\x80\n'
è
$ printf 'e\xcc\x80\n' | grep 'è'
<nothing>
$ printf 'e\xcc\x80\n' | grep '^.$'
<nothing>
More remarks on the idea of converting character regexps to byte regexps
(see previous message).
First, note that it works for UTF-8, but not e.g. GBK, precisely because
in UTF-8 one can't mistake the middle of a character for the beginning
of one. E.g. in GBK, the string for 我我 contains the string for 椅;
there is no way to tell where the character beginnings are short of
something like what grep is already doing.
Is it worth adding special code for UTF-8 (probably sharable with other
UTF encodings) if we still need something like the current code to
handle GBK and other multi-byte encodings? Well, UTF-8 is likely to
become the primary encoding on Debian systems.
The example translation given for "." may discourage from its
complexity. However, we should note that many, perhaps most, common
regexps don't need any translation at all (ignoring chara...
In Debian Bug tracker #181378, jidanni (dan-jacobson) wrote : LC_CTYPE makes grep 800 times slower! | #8 |
reassign 224993 grep
severity 224993 grave
severity 206470 grave
merge 206470 224993
thanks
Somebody on gnu-utils-bug please fix the grep LC_CTYPE bug!
See it http://
Often greps that would take just a second have to be killed as the CPU
meter goes to max and minutes go by!
You guys in the western world don't notice it because you don't use
(certain) LC_CTYPEs.
In Debian Bug tracker #181378, Colin Watson (cjwatson) wrote : not release-critical | #9 |
severity 181378 important
thanks
--
Colin Watson [<email address hidden>]
Debian Bug Importer (debzilla) wrote : | #10 |
Automatically imported from Debian bug report #181378 http://
Debian Bug Importer (debzilla) wrote : | #11 |
Message-ID: <20030217151028
Date: Mon, 17 Feb 2003 23:10:28 +0800
From: Max Zou <email address hidden>
To: <email address hidden>
Subject: grep is extremely slow
--GvXjxJ+pjyke8COw
Content-Type: text/plain; charset=us-ascii
Content-
Content-
Package: grep
Version: 2.5.1-1
When I try to use the latest "grep" to search a pattern in a 100-KB file,=
=20
it is considerably slower than previous version of grep.
Here is a time comparison with grep v2.4.2-3 on the same
machine.
# using grep v2.4.2-3
$ time ./grep-old test file |wc
2232 6696 44261
real 0m0.058s
user 0m0.060s
sys 0m0.000s
# using grep v2.5.1-1
$ time grep test srcfp/a |wc
2232 6696 44261
real 0m10.497s
user 0m10.430s
sys 0m0.010s
Is there any problem with the algorithm used in the latest grep?
I am using Debian sid, kernel-2.4.18 and libc6 2.3.1-11 on
a PIII 700MHz machine with 384MB RAM.
Thanks!
--
regards
ZM
--GvXjxJ+pjyke8COw
Content-Type: application/
Content-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE+
9vEEfvukAdVO/
=IWjO
-----END PGP SIGNATURE-----
--GvXjxJ+
Debian Bug Importer (debzilla) wrote : | #12 |
Message-Id: <email address hidden>
Date: Mon, 11 Aug 2003 16:52:10 +0200
From: Michel Daenzer <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: Depends on locale
Package: grep
Version: 2.5.1-5
Followup-For: Bug #181378
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I experienced the same problem, but I just noticed that grep is fast
when the LC_ALL environment variable is set to C. Here's the output of
locale on my system:
LANG=en_US.UTF-8
LC_CTYPE=
LC_NUMERIC=
LC_TIME=
LC_COLLATE=C
LC_MONETARY=
LC_MESSAGES=
LC_PAPER=
LC_NAME=
LC_ADDRESS=
LC_TELEPHONE=
LC_MEASUREMENT=
LC_IDENTIFICATI
LC_ALL=
Seems like grep handles character encoding conversions inefficiently or
something?
- -- System Information:
Debian Release: testing/unstable
Architecture: powerpc
Kernel: Linux thor 2.4.20-
Locale: LANG=en_US.UTF-8, LC_CTYPE=
Versions of packages grep depends on:
ii libc6 2.3.1-16 GNU C Library: Shared libraries an
- -- no debconf information
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/
n9TZeES03gReAsL
=KnnW
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #13 |
Message-ID: <20030904190238
Date: Thu, 4 Sep 2003 15:02:38 -0400
From: "H. S. Teoh" <email address hidden>
To: <email address hidden>, <email address hidden>, <email address hidden>
Subject: Identical bugs
merge 181378 206470
thanks
These bugs appear to be the same (see latest messages in #181378).
As for the bugs themselves, could it be that the problem is caused by grep
localizing every input character, as opposed to localizing the regex and
then matching the resulting bytes? I haven't looked at the code to be
sure, but this is what immediately came to mind when I read about the
LC_CTYPE=C speed difference.
Translating every input character would, indeed, slow things down a lot. A
better alternative would be to localize the regex, match on a byte-by-byte
basis, and then localize the output only if it matches. However, this may
have pathological problems if multiple representations of the same
character are possible (e.g. Unicode combining diacritics vs. precomposed
characters). I'm not sure what the solution would be in this case.
T
--
If you look at a thing nine hundred and ninety-nine times, you are perfectly
safe; if you look at it the thousandth time, you are in frightful danger of
seeing it for the first time. -- G. K. Chesterton
Debian Bug Importer (debzilla) wrote : | #14 |
Message-ID: <email address hidden>
Date: Tue, 07 Oct 2003 07:33:06 +0800
From: Dan Jacobson <email address hidden>
To: <email address hidden>
severity 206470 important
Debian Bug Importer (debzilla) wrote : | #15 |
Message-Id: <email address hidden>
Date: Sat, 13 Dec 2003 16:19:43 +0100
From: Michel =?ISO-8859-
To: <email address hidden>
Subject: patch
--=-R5LE1TcFVaj
Content-Type: text/plain; charset=UTF-8
Content-
This patch seems to help; I extracted it from the src rpm at
http://
and tweaked one hunk for it to apply.
--=20
Earthling Michel D=C3=A4nzer | Debian (powerpc), X and DRI develop=
er
Software libre enthusiast | http://
--=-R5LE1TcFVaj
Content-
Content-Type: text/x-patch; name=56-
Content-
LS0tIHNyYy9ncmV
DQorKysgc3JjL2d
LTE4Niw3ICsxODY
Lw0KIHN0YXRpYyB
KTsNCi1zdGF0aWM
X3QsIHNpemVfdCA
Y2hhciBjb25zdCA
LCBpbnQpKTsNCiA
aWYgcmVxdWVzdGV
CiANCiBzdGF0aWM
aW0sIGludCBzZXA
aW50IHNlcCwgc3R
DQogICAgIHByaW5
QCAtNTM5LDcgKzU
ICAgICBzaXplX3Q
KCpleGVjdXRlKSA
ZSAoKG1hdGNoX29
CisJCQkJCSAmbWF
DQogCSAgY2hhciB
LDggQEANCiAJICB
CSAgICBpYmVnW2l
dCA9ICgqZXhlY3V
aWxlICgobWF0Y2h
ZSwNCisJCQkJCSA
ICAgew0KIAkgICA
MSw3ICs1OTQsOCB
ICAgd2hpbGUgKGx
IC0gYmVnLCAmbWF
X29mZnNldCA9ICg
ICAgJm1hdGNoX3N
ciBjb25zdCAqYiA
Debian Bug Importer (debzilla) wrote : | #16 |
Message-Id: <email address hidden>
Date: Wed, 18 Feb 2004 04:44:26 +0100
From: Roland Illig <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: grep: ... and Perl is a thousand times faster ...
Package: grep
Version: 2.5.1.ds1-2
Severity: normal
Followup-For: Bug #181378
I found the magic frontier of grep: DFAs with 1024 states.
Please make grep a little quicker or replace it completely with pcre or
another fast implementation, as far as POSIX allows it.
$ time egrep .\{1024,\} debug | wc
109 17987 169422
real 1m47.522s
user 1m28.960s
sys 0m3.680s
$ time egrep .\{1023,\} debug | wc
109 17987 169422
real 0m1.074s
user 0m0.940s
sys 0m0.100s
$ time perl -ne '/.{1024,}/ and print' debug | wc
109 17987 169422
real 0m0.077s
user 0m0.070s
sys 0m0.000s
-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux wwid 2.4.22-1-k7 #5 Sat Oct 4 14:11:12 EST 2003 i686
Locale: LANG=de_
Versions of packages grep depends on:
ii libc6 2.3.2.ds1-11 GNU C Library: Shared libraries an
-- no debconf information
Debian Bug Importer (debzilla) wrote : | #17 |
Message-id: <email address hidden>
Date: Tue, 29 Jun 2004 23:15:22 +1000
From: Peter Moulder <email address hidden>
To: <email address hidden>
Subject: perl not fair comparison: perl gets "wrong" answer for utf-8 text
Suppose UTF-8 LC_CTYPE.
$ (echo r=F4le; echo role) | grep 'r.le'
r=F4le
role
$ (echo r=F4le; echo role) | perl -ne '/r.le/ and print'
role
$ (echo r=F4le; echo role) | grep 'r..le'
$ (echo r=F4le; echo role) | perl -ne '/r..le/ and print'
r=F4le
(This is with perl_5.8.3-3, grep_2.5.1.ds1-2.)
Perl is using octet/byte regexps, whereas grep is using character
regexps. Although arguable, I believe users would prefer grep's
behaviour (other than its speed).
I believe a better solution would be for grep to convert the characte=
r
regexp to an octet regexp. E.g. the character regexp "." (which I'll=
assume
for simplicity matches any character) might be translated to
(?:[\x00-
That translation assumes that an accented character formed by
composition is to be considered distinct from a single unicode charac=
ter
(H. S. Teoh's example above). I'm not familiar with the unicode spec=
.
Maybe it's reasonable to consider them different. Otherwise, I belie=
ve
the translate-
longer translations.
However, I wonder if the problem is just that the conversion of the
input stream to wchars is inefficient. Off hand, I don't see why it
should make things so much slower.
pjrm.
Debian Bug Importer (debzilla) wrote : | #18 |
Message-id: <email address hidden>
Date: Wed, 30 Jun 2004 16:46:42 +1000
From: Peter Moulder <email address hidden>
To: <email address hidden>
Subject: gprof; combining diacritical marks; octet regexp conversion
According to gprof on grep compiled with -pg (without installing libc=
6-prof),
~all the time is spent in check_multibyte
In the case of utf-8, we don't need the return value of
check_multibyte
characters (ignoring the composition case) have (c & 0xc0) =3D=3D 0x8=
0 (i.e. bit7=3D1, bit6=3D0).
To handle the composition case, we add a test that the wide character
is a combining diacritical: "|| (c =3D=3D 0xcc) || ((c =3D=3D 0xcd) &=
& (nextchar
<=3D 0xaf))".
Combining diacritical marks are a hassle for grep. According to
http://
can
be followed by more than one combining diacritical mark character.
Presumably, order doesn't matter, so grep 'a<string of n combining
diacritical mark characters>' can match n factorial different strings=
in
the haystack text, without counting use of precomposed characters.
The simplest way of handling this would be to convert to a canonical
form, say decomposed form with combining diacritical marks in sorted
order.
Note that I haven't checked the unicode standards on this point:
possibly order is to be considered significant, in which case the onl=
y
possible matches are decomposed vs use of precomposed character. Thi=
s
would make the convert-
.
Another issue with decomposable characters is that we must use negati=
ve
lookahead tests: if searching for `o' then we must check that the
matched 'o' isn't followed by a combining diacritical mark character.
The alternative of canonicalizing to precomposed form instead of
decomposed form has its own expense: if there are 112 possible
diacritical mark characters, and characters can be followed by an
arbitrary selection of those, then we need an extra 112 bits per
canonical character to represent those. And even that only presence =
or
absence (rather than number or order) is significant for combining
diacritical marks, i.e. it assumes that e<macron>
be
considered equivalent to e<acute><macron>. If number and order are
significant then no finite number of bits suffices.
Note that grep doesn't currently handle combining diacritical marks:
$ printf 'e\xcc\x80\n'
=C3=A8
$ printf 'e\xcc\x80\n' | grep '=C3=A8'
<nothing>
$ printf 'e\xcc\x80\n' | grep '^.$'
<nothing>
More remarks on the idea of converting character regexps to byte rege=
xps
(see previous message).
First, note that it works for UTF-8, but not e.g. GBK, precisely beca=
use
in UTF-8 one can't mistake the middle of a character for the beginnin=
g
of one. E.g. in GBK, the string for =E6=88=91=E6=88=91 contains the =
string for =E6=A4=85;
there is no way to tell where the character beginnings are short of
something like what grep is already doing.
Is it worth adding special code for UTF-8 (probably sharable with oth=
er
UTF encodings) if we still need ...
Debian Bug Importer (debzilla) wrote : | #19 |
Message-ID: <email address hidden>
Date: Sat, 11 Sep 2004 04:33:28 +0800
From: Dan Jacobson <email address hidden>
To: <email address hidden>
Cc: <email address hidden>, <email address hidden>
Subject: LC_CTYPE makes grep 800 times slower!
reassign 224993 grep
severity 224993 grave
severity 206470 grave
merge 206470 224993
thanks
Somebody on gnu-utils-bug please fix the grep LC_CTYPE bug!
See it http://
Often greps that would take just a second have to be killed as the CPU
meter goes to max and minutes go by!
You guys in the western world don't notice it because you don't use
(certain) LC_CTYPEs.
Thom May (thombot) wrote : | #20 |
This is an absolutely absurd inflation of severity.
Debian Bug Importer (debzilla) wrote : | #21 |
Message-ID: <email address hidden>
Date: Sat, 11 Sep 2004 01:14:59 +0100
From: Colin Watson <email address hidden>
To: <email address hidden>
Subject: not release-critical
severity 181378 important
thanks
--
Colin Watson [<email address hidden>]
In Debian Bug tracker #181378, Justin Pryzby (justinpryzby-users) wrote : profile | #22 |
I don't know that I can add useful info here, but this just bit me
too.
$ wc -l /tmp/setuid;
50 /tmp/setuid
$ time ltrace grep -v /dev/ /tmp/setuid 2>&1 |LANG=C grep '^mbrtowc(' |wc -l
77989
real 0m14.802s
user 0m6.118s
sys 0m8.071s
$ calc 50/14.802
Justin
Debian Bug Importer (debzilla) wrote : | #23 |
Message-ID: <20041206152037
Date: Mon, 6 Dec 2004 10:20:37 -0500
From: Justin Pryzby <email address hidden>
To: <email address hidden>
Subject: profile
I don't know that I can add useful info here, but this just bit me
too.
$ wc -l /tmp/setuid;
50 /tmp/setuid
$ time ltrace grep -v /dev/ /tmp/setuid 2>&1 |LANG=C grep '^mbrtowc(' |wc -l
77989
real 0m14.802s
user 0m6.118s
sys 0m8.071s
$ calc 50/14.802
Justin
In Debian Bug tracker #181378, Simon Law (sfllaw) wrote : Red Hat's UTF-8 speedup patch | #24 |
tags 181378 +patch
thanks
Here is Fedora Core 3's patch to grep that makes it work quickly in
UTF-8 environments.
I haven't tested if it applies cleanly to Debian's grep. But if you
make me a co-maintainer, I'll happily spend a couple of hours merging
this and other useful Red Hat patches into our grep.
Simon
Debian Bug Importer (debzilla) wrote : | #25 |
Message-ID: <email address hidden>
Date: Wed, 8 Dec 2004 16:29:56 -0500
From: Simon Law <email address hidden>
To: <email address hidden>
Subject: Red Hat's UTF-8 speedup patch
--mP3DRpeJDSE+ciuQ
Content-Type: text/plain; charset=us-ascii
Content-
tags 181378 +patch
thanks
Here is Fedora Core 3's patch to grep that makes it work quickly in
UTF-8 environments.
I haven't tested if it applies cleanly to Debian's grep. But if you
make me a co-maintainer, I'll happily spend a couple of hours merging
this and other useful Red Hat patches into our grep.
Simon
--mP3DRpeJDSE+ciuQ
Content-Type: text/plain; charset=us-ascii
Content-
This patch is written by Tim Waugh <email address hidden> and is ripped
from Fedora Core 3's grep 2.5.1-31 package.
It is meant to cache results from mbrtowc(), which significantly speeds
up execution speed in UTF-8 locales.
-- Simon Law <email address hidden> Wed, 8 Dec 2004 16:28:03 -0500
--- grep-2.
+++ grep-2.
@@ -186,7 +186,8 @@
/* Functions we'll use to search. */
static void (*compile) PARAMS ((char const *, size_t));
-static size_t (*execute) PARAMS ((char const *, size_t, size_t *, int));
+static size_t (*execute) PARAMS ((char const *, size_t, struct mb_cache *,
+ size_t *, int));
/* Like error, but suppress the diagnostic if requested. */
static void
@@ -516,7 +517,7 @@
}
static void
-prline (char const *beg, char const *lim, int sep)
+prline (char const *beg, char const *lim, int sep, struct mb_cache *mb_cache)
{
if (out_file)
printf ("%s%c", filename, sep & filename_mask);
@@ -539,7 +540,8 @@
{
size_t match_size;
size_t match_offset;
- while ((match_offset = (*execute) (beg, lim - beg, &match_size, 1))
+ while ((match_offset = (*execute) (beg, lim - beg, mb_cache,
+ &match_size, 1))
!= (size_t) -1)
{
char const *b = beg + match_offset;
@@ -573,7 +575,8 @@
int i;
for (i = 0; i < lim - beg; i++)
ibeg[i] = tolower (beg[i]);
- while ((match_offset = (*execute) (ibeg, ilim-ibeg, &match_size, 1))
+ while ((match_offset = (*execute) (ibeg, ilim-ibeg, mb_cache,
+ &match_size, 1))
!= (size_t) -1)
{
char const *b = beg + match_offset;
@@ -591,7 +594,8 @@
lastout = lim;
return;
}
- while (lim-beg && (match_offset = (*execute) (beg, lim - beg, &match_size, 1))
+ while (lim-beg && (match_offset = (*execute) (beg, lim - beg, mb_cache,
+ &match_size, 1))
!= (size_t) -1)
{
char const *b = beg + match_offset;
@@ -619,7 +623,7 @@
/* Print pending lines of trailing context prior to LIM. Trailing context ends
at the next matching line when OUTLEFT is 0. */
static void
-prpending (char const *lim)
+prpending (char const *lim, struct mb_cache *mb_cache)
{
if (!lastout)
lastout = bufbeg;
@@ -629,9 +633,10 @@
size_t match_size;
--pending;
if (outleft
- || (((*execute) (lastout, nl - lastout, &match_size, 0) == (size_t) -1)
+ || (((*ex...
Matt Zimmerman (mdz) wrote : | #26 |
*** Bug 15162 has been marked as a duplicate of this bug. ***
Matt Zimmerman (mdz) wrote : | #27 |
<dooglus> the fedora guys fixed it ages ago (see patch, here:
http://
I haven't looked at the above patch myself, but I'm archiving it here
Chris Moore (dooglus) wrote : | #28 |
I have mentioned this bug to a few people, and the response I usually get is
"well, UTF handling is bound to be a bit slower, don't worry about it".
That's missing the point. This bug doesn't make grep 'a bit slower', it changes
it from using linear time to using quadratic time in some cases.
Here's an example, grepping through a million short lines of text:
yes | head -999999 | LC_ALL=C time grep . > /dev/null
it runs in 0.27 seconds on my laptop.
Here's the same example, doing the same work, but in a UTF-8 locale:
yes | head -999999 | LC_ALL=en_US.UTF-8 time grep . > /dev/null
it runs in about *40 hours* on the same laptop. It is around 500,000 times
slower than using the 'C' locale.
That's 40 hours to grep through a 2 megabyte file.
Some people have been unable to reproduce this severe slow-down. I suspect it
may be that they don't have the en_US.UTF-8 locale generated. You can find a
list of UTF-8 locales that have been generated in /etc/locale.gen.
I hope this bug can be fixed, because it really is quite a problem, especially
since the default locale in a ubuntu install is a UTF-8 one!
Note that the fedora fix referred to in the previous comment works around this
problem in the dfa code by disabling the dfa code when using a UTF-8 locale,
rather than by addressing the real root of the problem (which is that dfaexec()
in src/dfa.c does a complete scan of the remaining input buffer each time it is
called).
Matt Zimmerman (mdz) wrote : | #29 |
You said on IRC that you'd fixed the bug; perhaps you'd care to share your patch?
Chris Moore (dooglus) wrote : | #30 |
I was mistaken. My changes certainly sped things up, but broke more than they
fixed, sorry.
In Debian Bug tracker #181378, Nicolas François (nicolas-francois) wrote : | #31 |
Package: grep
Version: 2.5.1.ds1-5
Followup-For: Bug #181378
Hello,
I tried the gofast patche, and did not find a real improvement.
However, Fedora is now using a different patch, which improve dramaticaly
grep performances on an UTF-8 environment.
Please find attached the following patches:
* I put the original Fedora patches in the orig directory. The other
patches are updated for the Debian package.
* 64-egf-
It does most of the work. Here is the explanation, according to:
http://
> The full story behind this patch is that grep-2.5.1a does not handle
> UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a
> is:
> * whenever a buffer is parsed, go through the entire buffer deciding
> how many bytes make up each character
> * use this information when necessary
>
> This patch changes that to:
> * when information about how many bytes make up a character is needed,
> work it out on demand
>
> On the face of it, this is a small obvious improvement. In fact it is
> much better than that, because the original scheme would calculate
> character lengths several times for each buffer: in fact, one full
> pass for every single potential match!
* 65-dfa-
I'm not sure this one is really needed.
I've read the DFA algorithme is slow for UTF-8 and this patch disable
it in that case (and it can be forced enabled by setting an evirronment
variable)
* grep-2.
Fedora also added a test for UTF-8.
* 66-match_
* 67-w.patch
After testing the new UTF-8 tests, these too seems to be needed.
(It is not really related to the grep's speed, but these patches may
be interresting)
I tried a grep packages with all these patches, and for the following
command:
grep '^' /var/lib/
grep is more than 1500 faster on an UTF-8 environment.
(on my machine, it take less than 3/4s instead of more than 10 minutes!)
Also, I did not notice any regression, and grep is not dramatically
slower on the C locale.
These patches may be important for Etch since the transition to UTF-8 is
mentionned on the (unofficial) Etch TODO list:
http://
(And the French team is considering using UTF-8 for the default French
locale)
Thanks in advance,
--
Nekral
Debian Bug Importer (debzilla) wrote : | #32 |
Message-ID: <email address hidden>
Date: Wed, 7 Sep 2005 11:52:34 +0200
From: Nicolas =?iso-8859-
To: Debian Bug Tracking System <email address hidden>
Subject: grep is extremely slow with UTF-8
--EVF5PPMfhYS0aIcm
Content-Type: text/plain; charset=us-ascii
Content-
Package: grep
Version: 2.5.1.ds1-5
Followup-For: Bug #181378
Hello,
I tried the gofast patche, and did not find a real improvement.
However, Fedora is now using a different patch, which improve dramaticaly
grep performances on an UTF-8 environment.
Please find attached the following patches:
* I put the original Fedora patches in the orig directory. The other
patches are updated for the Debian package.
* 64-egf-
It does most of the work. Here is the explanation, according to:
http://
> The full story behind this patch is that grep-2.5.1a does not handle
> UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a
> is:
> * whenever a buffer is parsed, go through the entire buffer deciding
> how many bytes make up each character
> * use this information when necessary
>
> This patch changes that to:
> * when information about how many bytes make up a character is needed,
> work it out on demand
>
> On the face of it, this is a small obvious improvement. In fact it is
> much better than that, because the original scheme would calculate
> character lengths several times for each buffer: in fact, one full
> pass for every single potential match!
* 65-dfa-
I'm not sure this one is really needed.
I've read the DFA algorithme is slow for UTF-8 and this patch disable
it in that case (and it can be forced enabled by setting an evirronment
variable)
* grep-2.
Fedora also added a test for UTF-8.
* 66-match_
* 67-w.patch
After testing the new UTF-8 tests, these too seems to be needed.
(It is not really related to the grep's speed, but these patches may
be interresting)
I tried a grep packages with all these patches, and for the following
command:
grep '^' /var/lib/
grep is more than 1500 faster on an UTF-8 environment.
(on my machine, it take less than 3/4s instead of more than 10 minutes!)
Also, I did not notice any regression, and grep is not dramatically
slower on the C locale.
These patches may be important for Etch since the transition to UTF-8 is
mentionned on the (unofficial) Etch TODO list:
http://
(And the French team is considering using UTF-8 for the default French
locale)
Thanks in advance,
--
Nekral
--EVF5PPMfhYS0aIcm
Content-Type: application/
Content-
Content-
QlpoOTFBWSZTWXo
ABpKKIIlUSFAAAC
BIAAAAAAAAEQACQ
qnjTVP0pp6j1HlH
In Debian Bug tracker #181378, Santiago Ruano Rincón (santiago) wrote : Bug#181378: fixed in grep 2.5.1.ds1-6 | #33 |
Source: grep
Source-Version: 2.5.1.ds1-6
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
grep (2.5.1.ds1-6) unstable; urgency=low
.
* 64-egf-
67-w.patch speed up grep. Thanks to Nicolas François
<email address hidden> (Closes: #181378, #206470, #224993)
* Deleted the CVS directories
Files:
7797de5e94d5c6
caea29b0505d04
38ead74511b342
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDI8JagY5
wsu93RoCSrY292G
=XhKe
-----END PGP SIGNATURE-----
In Debian Bug tracker #181378, Santiago Ruano Rincón (santiago) wrote : Bug#206470: fixed in grep 2.5.1.ds1-6 | #34 |
Source: grep
Source-Version: 2.5.1.ds1-6
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
grep (2.5.1.ds1-6) unstable; urgency=low
.
* 64-egf-
67-w.patch speed up grep. Thanks to Nicolas François
<email address hidden> (Closes: #181378, #206470, #224993)
* Deleted the CVS directories
Files:
7797de5e94d5c6
caea29b0505d04
38ead74511b342
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDI8JagY5
wsu93RoCSrY292G
=XhKe
-----END PGP SIGNATURE-----
In Debian Bug tracker #181378, Santiago Ruano Rincón (santiago) wrote : Bug#224993: fixed in grep 2.5.1.ds1-6 | #35 |
Source: grep
Source-Version: 2.5.1.ds1-6
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
grep (2.5.1.ds1-6) unstable; urgency=low
.
* 64-egf-
67-w.patch speed up grep. Thanks to Nicolas François
<email address hidden> (Closes: #181378, #206470, #224993)
* Deleted the CVS directories
Files:
7797de5e94d5c6
caea29b0505d04
38ead74511b342
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDI8JagY5
wsu93RoCSrY292G
=XhKe
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #36 |
Message-Id: <email address hidden>
Date: Sat, 10 Sep 2005 22:47:05 -0700
From: Santiago Ruano Rincon <email address hidden>
To: <email address hidden>
Subject: Bug#181378: fixed in grep 2.5.1.ds1-6
Source: grep
Source-Version: 2.5.1.ds1-6
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
grep (2.5.1.ds1-6) unstable; urgency=low
.
* 64-egf-
67-w.patch speed up grep. Thanks to Nicolas François
<email address hidden> (Closes: #181378, #206470, #224993)
* Deleted the CVS directories
Files:
7797de5e94d5c6
caea29b0505d04
38ead74511b342
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDI8JagY5
wsu93RoCSrY292G
=XhKe
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #37 |
Message-Id: <email address hidden>
Date: Sat, 10 Sep 2005 22:47:05 -0700
From: Santiago Ruano Rincon <email address hidden>
To: <email address hidden>
Subject: Bug#206470: fixed in grep 2.5.1.ds1-6
Source: grep
Source-Version: 2.5.1.ds1-6
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
grep (2.5.1.ds1-6) unstable; urgency=low
.
* 64-egf-
67-w.patch speed up grep. Thanks to Nicolas François
<email address hidden> (Closes: #181378, #206470, #224993)
* Deleted the CVS directories
Files:
7797de5e94d5c6
caea29b0505d04
38ead74511b342
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDI8JagY5
wsu93RoCSrY292G
=XhKe
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #38 |
Message-Id: <email address hidden>
Date: Sat, 10 Sep 2005 22:47:05 -0700
From: Santiago Ruano Rincon <email address hidden>
To: <email address hidden>
Subject: Bug#224993: fixed in grep 2.5.1.ds1-6
Source: grep
Source-Version: 2.5.1.ds1-6
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Santiago Ruano Rincon <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Sat, 10 Sep 2005 01:52:04 -0500
Source: grep
Binary: grep
Architecture: source i386
Version: 2.5.1.ds1-6
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Santiago Ruano Rincon <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993
Changes:
grep (2.5.1.ds1-6) unstable; urgency=low
.
* 64-egf-
67-w.patch speed up grep. Thanks to Nicolas François
<email address hidden> (Closes: #181378, #206470, #224993)
* Deleted the CVS directories
Files:
7797de5e94d5c6
caea29b0505d04
38ead74511b342
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDI8JagY5
wsu93RoCSrY292G
=XhKe
-----END PGP SIGNATURE-----
Matt Zimmerman (mdz) wrote : | #39 |
It would be awfully nice to have this fixed for Breezy final, but I'm wary of
touching such a core package for the sake of a performance issue. Please look
over the patches and see if you can establish the amount of risk
Chris Moore (dooglus) wrote : | #40 |
This bug really does need fixing. grep isn't just slower, it's unusable in some cases. I show a case in comment 16 above
where grep takes 40 hours to grep for a very simple pattern in a 2 megabyte file.
If you make a 2 gigabyte file instead of a 2 megabyte file (1000 times bigger), the execution time doesn't multiply by
1000, but by a million, so it will take over 4,500 years to run. Just to grep through 2 gigabytes?
Note that (1) UTF locales are used by default (I think) and (2) grep is used by default crontabs. This means that the
problem will be exhibited in clean installs.
I first noticed the problem when my laptop started crashing at random. It turned out that there's something wrong with it
which causes it to power down if it gets hot. It was this grep bug which was causing it to overheat - grep is CPU bound
due to this bug.
When evaluating the risk of fixing the bug, please also look at the risk of leaving grep broken.
Matt Zimmerman (mdz) wrote : | #41 |
(In reply to comment #24)
> This bug really does need fixing. grep isn't just slower, it's unusable in
some cases. I show a case in comment 16 above
> where grep takes 40 hours to grep for a very simple pattern in a 2 megabyte file.
It is a genuine bug and should be fixed. That's why we have an open bug report
about it; no need to plead that case.
> Note that (1) UTF locales are used by default (I think) and (2) grep is used
by default crontabs. This means that the
> problem will be exhibited in clean installs.
All of this has been true in Ubuntu for nearly a year now, and I have seen no
reports of catastrophic failures, so while it certainly ought to be fixed if
possible, it does not seem like a crisis. When it comes to non-interactive
jobs, it is certainly better to have grep complete 10x more slowly than to have
it return incorrect results, so this change requires careful testing and review
even to be considered for inclusion in Breezy at such a late stage.
In Debian Bug tracker #181378, Dato Simó (dato) wrote : Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc) | #42 |
found 181378 2.5.1.ds2-1
thanks
* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
> * Removed 64-egf-
> 66-match_
> closes: #329876.
Those patches fixed a bug (and two merged) that had been opened for 2
and a half years. I think it'd be useful if you tried to contact the
authors of the patches, and try to fix them instead of removing them?
> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
> 51-dircategory-
> Documentation License from debian/copyright and debian/fdl.txt,
> closes: #281647.
Still, grep.1 remains, which (a) contains verbatim paragraphs from
grep.texi yet (b) comes in the upstream tarball with a license notice.
Does this mean that grep.1 is?:
- under the GFDL, so should be removed
- under the GPL (the general license of the tarball), despite
sharing contents with grep.texi
- undistributable, because it has no license attached
Cheers,
--
Adeodato Simó
EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621
Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
dozens.
-- Michel de Montaigne
Debian Bug Importer (debzilla) wrote : | #43 |
Message-ID: <email address hidden>
Date: Mon, 26 Sep 2005 20:04:24 +0200
From: Adeodato =?utf-8?
To: <email address hidden>,
Anibal Monsalve Salazar <email address hidden>
Cc: <email address hidden>, <email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)
found 181378 2.5.1.ds2-1
thanks
* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
> * Removed 64-egf-
> 66-match_
> closes: #329876.
Those patches fixed a bug (and two merged) that had been opened for 2
and a half years. I think it'd be useful if you tried to contact the
authors of the patches, and try to fix them instead of removing them?
> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
> 51-dircategory-
> Documentation License from debian/copyright and debian/fdl.txt,
> closes: #281647.
Still, grep.1 remains, which (a) contains verbatim paragraphs from
grep.texi yet (b) comes in the upstream tarball with a license notice.
Does this mean that grep.1 is?:
- under the GFDL, so should be removed
- under the GPL (the general license of the tarball), despite
sharing contents with grep.texi
- undistributable, because it has no license attached
Cheers,
--
Adeodato Simó
EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621
Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
dozens.
-- Michel de Montaigne
In Debian Bug tracker #181378, Anibal Monsalve Salazar (anibal) wrote : grep reopen #181378, #206470, #224993 | #44 |
reopen 181378
reopen 206470
reopen 224993
thanks
Anibal Monsalve Salazar
--
.''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://
`- http://
In Debian Bug tracker #181378, Anibal Monsalve Salazar (anibal) wrote : Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc) | #45 |
On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Simó wrote:
>found 181378 2.5.1.ds2-1
>thanks
>
>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>
>> * Removed 64-egf-
>> 66-match_
>> closes: #329876.
>
> Those patches fixed a bug (and two merged) that had been opened for 2
> and a half years. I think it'd be useful if you tried to contact the
> authors of the patches, and try to fix them instead of removing them?
Sure, the grep maintainers decided to pull out them and will go
trough the patches again.
I have bcc-ed #181378.
>> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
>> 51-dircategory-
>> Documentation License from debian/copyright and debian/fdl.txt,
>> closes: #281647.
>
> Still, grep.1 remains, which (a) contains verbatim paragraphs from
> grep.texi yet (b) comes in the upstream tarball with a license notice.
> Does this mean that grep.1 is?:
>
> - under the GFDL, so should be removed
grep.texi is the only documentation file under the GFDL whereas
grep.1 is not.
> - under the GPL (the general license of the tarball), despite
> sharing contents with grep.texi
grep.1 is covered by the license of the tarball which is the GPL.
> - undistributable, because it has no license attached
I don't think so. If grep.1 is undistributable so many others files
are.
grep.1 is not the only only file without an explicit license. Other
files without an explicit license are:
lib/alloca.c
lib/closeout.h
lib/hard-locale.h
lib/regex.h
lib/savedir.h
lib/xstrtol.h
po/cat-id-tbl.c
src/dosbuf.c
src/getpagesize.h
src/grepmat.c
src/vms_fab.c
src/vms_fab.h
vms/config_vms.h
config.h
> Cheers,
>
>--
>Adeodato Simó
> EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621
>
>Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
>dozens.
> -- Michel de Montaigne
Aníbal Monsalve Salazar
--
.''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://
`- http://
Debian Bug Importer (debzilla) wrote : | #46 |
Message-ID: <email address hidden>
Date: Tue, 27 Sep 2005 09:54:17 +1000
From: =?iso-8859-
To: <email address hidden>
Subject: grep reopen #181378, #206470, #224993
--YOiw+WO4Gc95oc3L
Content-Type: text/plain; charset=iso-8859-1
Content-
reopen 181378
reopen 206470
reopen 224993
thanks
Anibal Monsalve Salazar
--
.''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://
`- http://
--YOiw+WO4Gc95oc3L
Content-Type: application/
Content-
Content-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDOIopgY5
kTTX2IAEGQFTScW
=WCxZ
-----END PGP SIGNATURE-----
--YOiw+
Debian Bug Importer (debzilla) wrote : | #47 |
Message-ID: <email address hidden>
Date: Tue, 27 Sep 2005 11:12:07 +1000
From: =?iso-8859-
To: <email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)
--OkEUgNLVrkMgtt3o
Content-Type: text/plain; charset=iso-8859-1
Content-
Content-
On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Sim=F3 wrote:
>found 181378 2.5.1.ds2-1
>thanks
>
>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>
>> * Removed 64-egf-
>> 66-match_
>> closes: #329876.
>
> Those patches fixed a bug (and two merged) that had been opened for 2
> and a half years. I think it'd be useful if you tried to contact the
> authors of the patches, and try to fix them instead of removing them?
Sure, the grep maintainers decided to pull out them and will go
trough the patches again.
I have bcc-ed #181378.
>> * Removed grep.texi from upstream tarball, 50-rgrep-info.patch and
>> 51-dircategory-
>> Documentation License from debian/copyright and debian/fdl.txt,
>> closes: #281647.
>
> Still, grep.1 remains, which (a) contains verbatim paragraphs from
> grep.texi yet (b) comes in the upstream tarball with a license notice.
> Does this mean that grep.1 is?:
>
> - under the GFDL, so should be removed
grep.texi is the only documentation file under the GFDL whereas
grep.1 is not.
> - under the GPL (the general license of the tarball), despite
> sharing contents with grep.texi
grep.1 is covered by the license of the tarball which is the GPL.
> - undistributable, because it has no license attached
I don't think so. If grep.1 is undistributable so many others files
are.
grep.1 is not the only only file without an explicit license. Other
files without an explicit license are:
lib/alloca.c
lib/closeout.h
lib/hard-locale.h
lib/regex.h
lib/savedir.h
lib/xstrtol.h
po/cat-id-tbl.c
src/dosbuf.c
src/getpagesize.h
src/grepmat.c
src/vms_fab.c
src/vms_fab.h
vms/config_vms.h
config.h
> Cheers,
>
>--=20
>Adeodato Sim=F3
> EM: asp16 [ykwim] alu.ua.es | PK: DA6AE621
>=20
>Man is certainly stark mad; he cannot make a flea, yet he makes gods by the
>dozens.
> -- Michel de Montaigne
An=EDbal Monsalve Salazar
--
.''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://
`- http://
--OkEUgNLVrkMgtt3o
Content-Type: application/
Content-
Content-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDOJxmgY5
IMUFNZYKTEfOOL7
=4m7u
-----END PGP SIGNATURE-----
--OkEUgNLVrkMgt
In Debian Bug tracker #181378, Nicolas François (nicolas-francois) wrote : | #48 |
Hello,
On Tue, Sep 27, 2005 at 11:12:07AM +1000, Aníbal Monsalve Salazar wrote:
> On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Simó wrote:
> >found 181378 2.5.1.ds2-1
> >thanks
> >
> >* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
> >
> >> * Removed 64-egf-
> >> 66-match_
> >> closes: #329876.
> >
> > Those patches fixed a bug (and two merged) that had been opened for 2
> > and a half years. I think it'd be useful if you tried to contact the
> > authors of the patches, and try to fix them instead of removing them?
>
> Sure, the grep maintainers decided to pull out them and will go
> trough the patches again.
I wondered if I introduced this issue while porting the Fedora patches to
Debian, so I tried Fedora's grep...which has the same issue.
You can reproduce it with this simple command:
echo foobar | grep -Fw ""
This was introduced by the patch I named '64-egf-
You can fix it by changing the 'while (1)' by 'while (len)' (or by
embedding this while loop in a 'if (len){...}', I don't know if there is a
real difference, and what is the best way).
Tim Waugh, who wrote the original patches, may have a better understanding
of the grep's code.
The testsuite still pass with this patch.
BTW, I don't know if you received a mail I sent to <email address hidden>,
which indicated that the additional patches (which I submitted because
they helped passing the testsuite) were fixing: #209194 #218873 #226397
#238167
If you plan to re-introduce these patches, please tell me. While checking
for this issue (#329876), I've seen that there was one issue fixed in a
Fedora update, related to this patch:
https:/
I can update 64-egf-
Kind Regards,
--
Nekral
In Debian Bug tracker #181378, Anibal Monsalve Salazar (anibal) wrote : | #49 |
On Tue, Sep 27, 2005 at 11:53:41PM +0200, Nicolas François wrote:
>On Tue, Sep 27, 2005 at 11:12:07AM +1000, Aníbal Monsalve Salazar wrote:
>>On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Simó wrote:
>>>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>>>
>>>> * Removed 64-egf-
>>>> 66-match_
>>>> closes: #329876.
>>>
>>> Those patches fixed a bug (and two merged) that had been opened for 2
>>> and a half years. I think it'd be useful if you tried to contact the
>>> authors of the patches, and try to fix them instead of removing them?
>>
>>Sure, the grep maintainers decided to pull them out and will go
>>trough the patches again.
>
>I wondered if I introduced this issue while porting the Fedora patches to
>Debian, so I tried Fedora's grep...which has the same issue.
>
>You can reproduce it with this simple command:
>echo foobar | grep -Fw ""
>
>This was introduced by the patch I named '64-egf-
>
>You can fix it by changing the 'while (1)' by 'while (len)' (or by
>embedding this while loop in a 'if (len){...}', I don't know if there is a
>real difference, and what is the best way).
>Tim Waugh, who wrote the original patches, may have a better understanding
>of the grep's code.
>
>The testsuite still pass with this patch.
>
>BTW, I don't know if you received a mail I sent to <email address hidden>,
>which indicated that the additional patches (which I submitted because
>they helped passing the testsuite) were fixing: #209194 #218873 #226397
>#238167
I received it, thanks. I'll close the bugs.
>If you plan to re-introduce these patches, please tell me. While checking
>for this issue (#329876), I've seen that there was one issue fixed in a
>Fedora update, related to this patch:
>https:/
>I can update 64-egf-
Yes, please. I would like to reapply 64-egf-
(and 6[567]-*.patch) and an updated version will be very much
appreciated.
>Kind Regards,
>--
>Nekral
Regards,
Aníbal Monsalve Salazar
--
.''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://
`- http://
Debian Bug Importer (debzilla) wrote : | #50 |
Message-ID: <email address hidden>
Date: Tue, 27 Sep 2005 23:53:41 +0200
From: Nicolas =?iso-8859-
To: =?iso-8859-
Cc: <email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)
Hello,
On Tue, Sep 27, 2005 at 11:12:07AM +1000, An=EDbal Monsalve Salazar wrote=
:
> On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Sim=F3 wrote:
> >found 181378 2.5.1.ds2-1
> >thanks
> >
> >* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
> >
> >> * Removed 64-egf-
> >> 66-match_
> >> closes: #329876.
> >
> > Those patches fixed a bug (and two merged) that had been opened for =
2
> > and a half years. I think it'd be useful if you tried to contact the
> > authors of the patches, and try to fix them instead of removing them=
?
>=20
> Sure, the grep maintainers decided to pull out them and will go
> trough the patches again.
I wondered if I introduced this issue while porting the Fedora patches to
Debian, so I tried Fedora's grep...which has the same issue.
You can reproduce it with this simple command:
echo foobar | grep -Fw ""
This was introduced by the patch I named '64-egf-
You can fix it by changing the 'while (1)' by 'while (len)' (or by
embedding this while loop in a 'if (len){...}', I don't know if there is =
a
real difference, and what is the best way).
Tim Waugh, who wrote the original patches, may have a better understandin=
g
of the grep's code.
The testsuite still pass with this patch.
BTW, I don't know if you received a mail I sent to <email address hidden>=
rg,
which indicated that the additional patches (which I submitted because
they helped passing the testsuite) were fixing: #209194 #218873 #226397
#238167
If you plan to re-introduce these patches, please tell me. While checking
for this issue (#329876), I've seen that there was one issue fixed in a
Fedora update, related to this patch:
https:/
I can update 64-egf-
Kind Regards,
--=20
Nekral
Debian Bug Importer (debzilla) wrote : | #51 |
Message-ID: <email address hidden>
Date: Wed, 28 Sep 2005 09:04:03 +1000
From: =?iso-8859-
To: Nicolas =?iso-8859-
Cc: Santiago Ruano Rincon <email address hidden>,
<email address hidden>, <email address hidden>
Subject: Re: Accepted grep 2.5.1.ds2-1 (source i386 sparc)
--Affreb919SiI8I8E
Content-Type: text/plain; charset=iso-8859-1
Content-
Content-
On Tue, Sep 27, 2005 at 11:53:41PM +0200, Nicolas Fran=E7ois wrote:
>On Tue, Sep 27, 2005 at 11:12:07AM +1000, An=EDbal Monsalve Salazar wrote:
>>On Mon, Sep 26, 2005 at 08:04:24PM +0200, Adeodato Sim=F3 wrote:
>>>* Anibal Monsalve Salazar [Mon, 26 Sep 2005 05:47:06 -0700]:
>>>
>>>> * Removed 64-egf-
>>>> 66-match_
>>>> closes: #329876.
>>>
>>> Those patches fixed a bug (and two merged) that had been opened for 2
>>> and a half years. I think it'd be useful if you tried to contact the
>>> authors of the patches, and try to fix them instead of removing them?
>>
>>Sure, the grep maintainers decided to pull them out and will go
>>trough the patches again.
>
>I wondered if I introduced this issue while porting the Fedora patches to
>Debian, so I tried Fedora's grep...which has the same issue.
>
>You can reproduce it with this simple command:
>echo foobar | grep -Fw ""
>
>This was introduced by the patch I named '64-egf-
>
>You can fix it by changing the 'while (1)' by 'while (len)' (or by
>embedding this while loop in a 'if (len){...}', I don't know if there is a
>real difference, and what is the best way).
>Tim Waugh, who wrote the original patches, may have a better understanding
>of the grep's code.
>
>The testsuite still pass with this patch.
>
>BTW, I don't know if you received a mail I sent to <email address hidden>=
g,
>which indicated that the additional patches (which I submitted because
>they helped passing the testsuite) were fixing: #209194 #218873 #226397
>#238167
I received it, thanks. I'll close the bugs.
>If you plan to re-introduce these patches, please tell me. While checking
>for this issue (#329876), I've seen that there was one issue fixed in a
>Fedora update, related to this patch:
>https:/
>I can update 64-egf-
Yes, please. I would like to reapply 64-egf-
(and 6[567]-*.patch) and an updated version will be very much
appreciated.
>Kind Regards,
>--=20
>Nekral
Regards,
An=EDbal Monsalve Salazar
--
.''`. Debian GNU/Linux
: :' : Free Operating System
`. `' http://
`- http://
--Affreb919SiI8I8E
Content-Type: application/
Content-
Content-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDOc/
iySHQbDc9S878E+
=8BP7
-----END PGP SIGNATURE-----
--Affreb919SiI8
In Debian Bug tracker #181378, Nicolas François (nicolas-francois) wrote : update for 64-egf-speedup.patch | #52 |
Hello,
Please find attached an update for the 64-egf-
The other patches did not need to be updated and can be found in the
#181378 log.
This update intend to fix:
echo foobar | grep -Fw ""
(which was hanging with the previous version)
echo test | LC_ALL=C grep -Fw test
echo x test x | LC_ALL=C grep -Fw test
which were not working and were fixed by Tim Waugh (original author of the
patches).
I intend to mail Tim Waugh about the first issue, to check if my fix is
correct/optimal. grep being slow on UTF-8 is not that critical, it may be
better to wait his answer before releasing it. I will CC the BTS.
Kind Regards,
--
Nekral
In Debian Bug tracker #181378, Nicolas François (nicolas-francois) wrote : grep hanging with -Fw and an empty pattern | #53 |
Hello,
Sorry for contacting you directly.
I'm trying to port you patch (grep-2.
This patch triggered an issue when an empty pattern is used with the -Fw
options.
(see http://
I tried the Fedora grep-2.5.1-48.2 binary, which suffers from the same
issue (on a Debian system, with a Debian libc and libpcre):
echo foobar | grep -Fw ""
hangs (this could appear with the -Fwf options when the patterns file
contains an empty line).
Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
issue. However, I don't know if this is correct or optimal (I don't know
what should happen if we enter the loop with len>0 and len is then
decreased to 0; Maybe this should also be catched earlier).
Does it seems correct to you ?
Sorry I could not check if a Redhat system suffers from this (that's the
reson why I do not use the BTS) and thanks a lot for the impressive
speed-up of grep on an UTF-8 environment,
--
Nekral
Debian Bug Importer (debzilla) wrote : | #54 |
Message-ID: <email address hidden>
Date: Wed, 28 Sep 2005 12:58:50 +0200
From: Nicolas =?iso-8859-
To: <email address hidden>
Cc: Santiago Ruano Rincon <email address hidden>
Subject: update for 64-egf-
--SUOF0GtieIMvvwua
Content-Type: text/plain; charset=us-ascii
Content-
Hello,
Please find attached an update for the 64-egf-
The other patches did not need to be updated and can be found in the
#181378 log.
This update intend to fix:
echo foobar | grep -Fw ""
(which was hanging with the previous version)
echo test | LC_ALL=C grep -Fw test
echo x test x | LC_ALL=C grep -Fw test
which were not working and were fixed by Tim Waugh (original author of the
patches).
I intend to mail Tim Waugh about the first issue, to check if my fix is
correct/optimal. grep being slow on UTF-8 is not that critical, it may be
better to wait his answer before releasing it. I will CC the BTS.
Kind Regards,
--
Nekral
--SUOF0GtieIMvvwua
Content-Type: text/plain; charset=us-ascii
Content-
--- src/search.c.orig 2005-09-06 20:53:35.000000000 +0200
+++ src/search.c 2005-09-06 22:12:36.000000000 +0200
@@ -18,9 +18,13 @@
/* Written August 1992 by Mike Haertel. */
+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE 1
+#endif
#ifdef HAVE_CONFIG_H
# include <config.h>
#endif
+#include <assert.h>
#include <sys/types.h>
#if defined HAVE_WCTYPE_H && defined HAVE_WCHAR_H && defined HAVE_MBRTOWC
/* We can handle multibyte string. */
@@ -39,6 +43,9 @@
#ifdef HAVE_LIBPCRE
# include <pcre.h>
#endif
+#ifdef HAVE_LANGINFO_
+# include <langinfo.h>
+#endif
#define NCHAR (UCHAR_MAX + 1)
@@ -70,9 +77,10 @@
call the regexp matcher at all. */
static int kwset_exact_
-#if defined(
-static char* check_multibyte
-#endif
+/* UTF-8 encoding allows some optimizations that we can't otherwise
+ assume in a multibyte encoding. */
+static int using_utf8;
+
static void kwsinit PARAMS ((void));
static void kwsmusts PARAMS ((void));
static void Gcompile PARAMS ((char const *, size_t));
@@ -84,6 +92,15 @@
static size_t Pexecute PARAMS ((char const *, size_t, size_t *, int));
void
+check_utf8 (void)
+{
+#ifdef HAVE_LANGINFO_
+ if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
+ using_utf8 = 1;
+#endif
+}
+
+void
dfaerror (char const *mesg)
{
error (2, 0, mesg);
@@ -141,47 +158,6 @@
}
}
-#ifdef MBS_SUPPORT
-/* This function allocate the array which correspond to "buf".
- Then this check multibyte string and mark on the positions which
- are not singlebyte character nor the first byte of a multibyte
- character. Caller must free the array. */
-static char*
-check_
-{
- char *mb_properties = malloc(size);
- mbstate_t cur_state;
- wchar_t wc;
- int i;
- memset(&cur_state, 0, sizeof(mbstate_t));
- memset(
- for (i = 0; i < size ;)
- {
- size_t mbclen;
- mbclen = mbrtowc(&wc, buf + i, size...
Debian Bug Importer (debzilla) wrote : | #55 |
Message-ID: <email address hidden>
Date: Wed, 28 Sep 2005 13:26:27 +0200
From: Nicolas =?iso-8859-
To: <email address hidden>
Cc: <email address hidden>
Subject: grep hanging with -Fw and an empty pattern
Hello,
Sorry for contacting you directly.
I'm trying to port you patch (grep-2.
This patch triggered an issue when an empty pattern is used with the -Fw
options.
(see http://
I tried the Fedora grep-2.5.1-48.2 binary, which suffers from the same
issue (on a Debian system, with a Debian libc and libpcre):
echo foobar | grep -Fw ""
hangs (this could appear with the -Fwf options when the patterns file
contains an empty line).
Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
issue. However, I don't know if this is correct or optimal (I don't know
what should happen if we enter the loop with len>0 and len is then
decreased to 0; Maybe this should also be catched earlier).
Does it seems correct to you ?
Sorry I could not check if a Redhat system suffers from this (that's the
reson why I do not use the BTS) and thanks a lot for the impressive
speed-up of grep on an UTF-8 environment,
--
Nekral
In Debian Bug tracker #181378, Tim Waugh (twaugh) wrote : | #56 |
On Wed, Sep 28, 2005 at 01:26:27PM +0200, Nicolas François wrote:
> Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
> issue. However, I don't know if this is correct or optimal (I don't know
> what should happen if we enter the loop with len>0 and len is then
> decreased to 0; Maybe this should also be catched earlier).
>
> Does it seems correct to you ?
Yes, looks correct to me. Thanks.
Tim.
*/
Debian Bug Importer (debzilla) wrote : | #57 |
Message-ID: <email address hidden>
Date: Thu, 29 Sep 2005 13:22:39 +0100
From: Tim Waugh <email address hidden>
To: Nicolas =?iso-8859-
Cc: <email address hidden>
Subject: Re: grep hanging with -Fw and an empty pattern
--5sJP8czQtkNNPwfl
Content-Type: text/plain; charset=iso-8859-1
Content-
Content-
On Wed, Sep 28, 2005 at 01:26:27PM +0200, Nicolas Fran=E7ois wrote:
> Changing the 'while (1)' loop to a 'while (len)' loop in search.c fix this
> issue. However, I don't know if this is correct or optimal (I don't know
> what should happen if we enter the loop with len>0 and len is then
> decreased to 0; Maybe this should also be catched earlier).
>=20
> Does it seems correct to you ?
Yes, looks correct to me. Thanks.
Tim.
*/
--5sJP8czQtkNNPwfl
Content-Type: application/
Content-
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFDO9yPLF+
9RQwtdl2S4kiAbG
=yWGN
-----END PGP SIGNATURE-----
--5sJP8czQtkNNP
In Debian Bug tracker #181378, Anibal Monsalve Salazar (anibal) wrote : Bug#181378: fixed in grep 2.5.1.ds2-2 | #58 |
Source: grep
Source-Version: 2.5.1.ds2-2
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
grep (2.5.1.ds2-2) unstable; urgency=low
.
* Patched 64-egf-
<email address hidden>. Put 64-egf-
65-
in, closes: #181378, #206470, #224993.
* Fixed "minor documentation syntax error", closes: #240239,
#257900. Patches by Allard Hoeve <email address hidden> and Derrick
'dman' Hudson <email address hidden>.
* Fixed "info page not in main info menu", closes: #284676,
#267718. Patches by Rui Tiago Cação Matos
<email address hidden> and Paul Brook <email address hidden>.
Files:
88b2af4b357872
14e96467e86232
e69a3fbbab8663
76128b684a7dea
01da865bef322c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFDX1MXipB
Kowqh+yG6VdaC2w
=sBND
-----END PGP SIGNATURE-----
In Debian Bug tracker #181378, Anibal Monsalve Salazar (anibal) wrote : Bug#206470: fixed in grep 2.5.1.ds2-2 | #59 |
Source: grep
Source-Version: 2.5.1.ds2-2
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
grep (2.5.1.ds2-2) unstable; urgency=low
.
* Patched 64-egf-
<email address hidden>. Put 64-egf-
65-
in, closes: #181378, #206470, #224993.
* Fixed "minor documentation syntax error", closes: #240239,
#257900. Patches by Allard Hoeve <email address hidden> and Derrick
'dman' Hudson <email address hidden>.
* Fixed "info page not in main info menu", closes: #284676,
#267718. Patches by Rui Tiago Cação Matos
<email address hidden> and Paul Brook <email address hidden>.
Files:
88b2af4b357872
14e96467e86232
e69a3fbbab8663
76128b684a7dea
01da865bef322c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFDX1MXipB
Kowqh+yG6VdaC2w
=sBND
-----END PGP SIGNATURE-----
In Debian Bug tracker #181378, Anibal Monsalve Salazar (anibal) wrote : Bug#224993: fixed in grep 2.5.1.ds2-2 | #60 |
Source: grep
Source-Version: 2.5.1.ds2-2
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
grep (2.5.1.ds2-2) unstable; urgency=low
.
* Patched 64-egf-
<email address hidden>. Put 64-egf-
65-
in, closes: #181378, #206470, #224993.
* Fixed "minor documentation syntax error", closes: #240239,
#257900. Patches by Allard Hoeve <email address hidden> and Derrick
'dman' Hudson <email address hidden>.
* Fixed "info page not in main info menu", closes: #284676,
#267718. Patches by Rui Tiago Cação Matos
<email address hidden> and Paul Brook <email address hidden>.
Files:
88b2af4b357872
14e96467e86232
e69a3fbbab8663
76128b684a7dea
01da865bef322c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFDX1MXipB
Kowqh+yG6VdaC2w
=sBND
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #61 |
Message-Id: <email address hidden>
Date: Wed, 26 Oct 2005 03:02:09 -0700
From: Anibal Monsalve Salazar <email address hidden>
To: <email address hidden>
Subject: Bug#181378: fixed in grep 2.5.1.ds2-2
Source: grep
Source-Version: 2.5.1.ds2-2
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
grep (2.5.1.ds2-2) unstable; urgency=low
.
* Patched 64-egf-
<email address hidden>. Put 64-egf-
65-
in, closes: #181378, #206470, #224993.
* Fixed "minor documentation syntax error", closes: #240239,
#257900. Patches by Allard Hoeve <email address hidden> and Derrick
'dman' Hudson <email address hidden>.
* Fixed "info page not in main info menu", closes: #284676,
#267718. Patches by Rui Tiago Cação Matos
<email address hidden> and Paul Brook <email address hidden>.
Files:
88b2af4b357872
14e96467e86232
e69a3fbbab8663
76128b684a7dea
01da865bef322c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFDX1MXipB
Kowqh+yG6VdaC2w
=sBND
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #62 |
Message-Id: <email address hidden>
Date: Wed, 26 Oct 2005 03:02:09 -0700
From: Anibal Monsalve Salazar <email address hidden>
To: <email address hidden>
Subject: Bug#206470: fixed in grep 2.5.1.ds2-2
Source: grep
Source-Version: 2.5.1.ds2-2
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
grep (2.5.1.ds2-2) unstable; urgency=low
.
* Patched 64-egf-
<email address hidden>. Put 64-egf-
65-
in, closes: #181378, #206470, #224993.
* Fixed "minor documentation syntax error", closes: #240239,
#257900. Patches by Allard Hoeve <email address hidden> and Derrick
'dman' Hudson <email address hidden>.
* Fixed "info page not in main info menu", closes: #284676,
#267718. Patches by Rui Tiago Cação Matos
<email address hidden> and Paul Brook <email address hidden>.
Files:
88b2af4b357872
14e96467e86232
e69a3fbbab8663
76128b684a7dea
01da865bef322c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFDX1MXipB
Kowqh+yG6VdaC2w
=sBND
-----END PGP SIGNATURE-----
Debian Bug Importer (debzilla) wrote : | #63 |
Message-Id: <email address hidden>
Date: Wed, 26 Oct 2005 03:02:09 -0700
From: Anibal Monsalve Salazar <email address hidden>
To: <email address hidden>
Subject: Bug#224993: fixed in grep 2.5.1.ds2-2
Source: grep
Source-Version: 2.5.1.ds2-2
We believe that the bug you reported is fixed in the latest version of
grep, which is due to be installed in the Debian FTP archive:
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
grep_2.
to pool/main/
A summary of the changes between this version and the previous one is
attached.
Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.
Debian distribution maintenance software
pp.
Anibal Monsalve Salazar <email address hidden> (supplier of updated grep package)
(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Format: 1.7
Date: Wed, 26 Oct 2005 19:14:35 +1000
Source: grep
Binary: grep
Architecture: source i386 alpha sparc
Version: 2.5.1.ds2-2
Distribution: unstable
Urgency: low
Maintainer: Anibal Monsalve Salazar <email address hidden>
Changed-By: Anibal Monsalve Salazar <email address hidden>
Description:
grep - GNU grep, egrep and fgrep
Closes: 181378 206470 224993 240239 257900 267718 284676
Changes:
grep (2.5.1.ds2-2) unstable; urgency=low
.
* Patched 64-egf-
<email address hidden>. Put 64-egf-
65-
in, closes: #181378, #206470, #224993.
* Fixed "minor documentation syntax error", closes: #240239,
#257900. Patches by Allard Hoeve <email address hidden> and Derrick
'dman' Hudson <email address hidden>.
* Fixed "info page not in main info menu", closes: #284676,
#267718. Patches by Rui Tiago Cação Matos
<email address hidden> and Paul Brook <email address hidden>.
Files:
88b2af4b357872
14e96467e86232
e69a3fbbab8663
76128b684a7dea
01da865bef322c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFDX1MXipB
Kowqh+yG6VdaC2w
=sBND
-----END PGP SIGNATURE-----
Matt Zimmerman (mdz) wrote : | #64 |
*** Bug 24902 has been marked as a duplicate of this bug. ***
Daniel Robitaille (robitaille) wrote : | #65 |
According to the Debian bug report, this has been fixed since October. Dapper seems to contains a version newer than the fixed one in Debian . And I did some tests in both Breezy and Dapper, and grep seems normally fast: 0m0.051s to search a 270k text document for example.
Chris Moore (dooglus) wrote : | #66 |
Daniel, I'm not sure whether your comment is saying that grep is normally fast in both dapper and breezy, but it sounds like it might.
The bug does seem to have been fixed in dapper - grep with a UTF-8 locale in dapper now runs about 3 or 4 times slower than without a UTF-8 locale, whatever the size of the input file. This is relatively OK - grepping through 10 million lines takes 9.5 seconds instead of 2.1 seconds:
(dapper) $ yes | head -9999999 | LC_ALL=en_UR.UTF-8 time grep . > /dev/null
2.12user 0.01system 0:02.73elapsed [...]
(dapper) $ yes | head -9999999 | LC_ALL=en_US.UTF-8 time grep . > /dev/null
9.54user 0.03system 0:11.27elapsed [...]
On breezy however the bug remains. 'grep' isn't a fixed 3 or 4 times slower, but is slower by a factor that is proportional to the size of the input file. This means that grep on breezy in UTF-8 locale still runs in quadratic time, making it impossible to run grep on large files in some cases. This is not OK - grepping the same 10 million lines takes 20 weeks instead of 2.2 seconds:
(breezy) $ yes | head -9999999 | LC_ALL=en_UR.UTF-8 time grep . > /dev/null
2.21user 0.01system 0:02.80elapsed [...]
(breezy) $ yes | head -9999999 | LC_ALL=en_US.UTF-8 time grep . > /dev/null
this didn't finish yet, but will take something in the region of 20 WEEKS to run
In summary: fixed in dapper, still broken in breezy.
Daniel Robitaille (robitaille) wrote : | #67 |
If it appears that it is fixed in Dapper, maybe it will be time to finally close this old bug.
Ian Jackson (ijackson) wrote : Re: [Bug 7906] grep is extremely slow with UTF-8 | #68 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
affects /distros/ubuntu
status fixreleased
done
(Hmm, let's try that again, shall we?)
Daniel Robitaille writes ("[Bug 7906] grep is extremely slow with UTF-8"):
> If it appears that it is fixed in Dapper, maybe it will be time to
> finally close this old bug.
Indeed, thanks.
Ian.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.6 <http://
iD8DBQFD1kX+
XB5tsC4/
=X8SA
-----END PGP SIGNATURE-----
Changed in grep: | |
status: | Unconfirmed → Fix Released |
Daniel Robitaille (robitaille) wrote : | #69 |
Fixed in Debian
Changed in grep: | |
status: | Unconfirmed → Fix Released |
In Debian Bug tracker #181378, Anibal Monsalve Salazar (anibal) wrote : Re: Bug#439827: Patch 65-dfa-optional.patch causes grep regressions | #70 |
unarchive 181378
reopen 181378
found 181378 2.5.3~dfsg-2
thanks
Aníbal Monsalve Salazar
--
http://
Changed in grep: | |
status: | Fix Released → New |
In Debian Bug tracker #181378, Touko Korpela (tkorpela) wrote : tagging 181378 | #71 |
# Automatically generated email from bts, devscripts version 2.10.7
tags 181378 - patch
In Debian Bug tracker #181378, Touko Korpela (tkorpela) wrote : sid+lenny slow, etch is OK | #72 |
It seems that etch version 2.5.1.ds2-6 is not slow, but 2.5.3~dfsg-2 is
very slow with UTF-8 (I'm on i386)
In Debian Bug tracker #181378, Touko Korpela (tkorpela) wrote : fixed 181378 in 2.5.1.ds2-6 | #73 |
# Automatically generated email from bts, devscripts version 2.10.7
fixed 181378 2.5.1.ds2-6
In Debian Bug tracker #181378, Touko Korpela (tkorpela) wrote : found 181378 in 2.5.3~dfsg-1 | #74 |
# Automatically generated email from bts, devscripts version 2.10.7
found 181378 2.5.3~dfsg-1
In Debian Bug tracker #181378, Nicolas François (nicolas-francois) wrote : Re: Bug#181378: update for 64-egf-speedup.patch | #75 |
tags 181378 patch
forcemerge 181378 442882
thanks
Hello,
Please find attached updated patches for grep-2.5.3~dfsg:
* 64-egf-
This provides the speedup when the DFA algorithm is not used.
But the DFA algorithm is used for most grep execution.
(So there are no speed improvements if 65-dfa-
applied)
* 65-dfa-
This disables the DFA algorithm, which can be very slow in UTF-8
environments. The DFA algorithm can be enabled with an environment
variable.
(This patch is not valid if 64-egf-
These two patches are tightly coupled and must be applied together.
There used to be also two other patches in the 2.5.1, which improve the
results of the grep testsuite:
* 66-match_
This patch fixes some some usage of the -i option.
It could probably be applied without the previous patches.
* 67-w.patch
This patch fixes the -w option.
This probably fixes issues introduced by the first two patches.
I tried to add a few comments in the header of the patches.
With the 4 patches applied, 3 tests fail in the grep testsuite, but the
results are better than an unpatched upstream.
It could be nice to have a patch to split the testsuite in two categories:
known working test case) and known broken test cases (i.e. in the spencer1
testsuite, I don't expect the handling of case insensitive matches for non
latin characters to be fixed in a near future).
This would allow to run the testsuite at build time and detect regressions
in later uploads.
There are currently too many test cases/sub cases that fail to consider
the testsuite as useful at build time.
I'm also concerned about the maintainability of these patches.
I will try reduce their size and comment them, but do not wait for this
for an upload (I won't have time in the next two weeks).
With these 4 patches applied, there are probably a few bugs in the BTS
which can be closed (obviously the "grep too slow" bugs, but you should
also check if the locale dependent bugs (or the bugs which involve the -i
or -w options) are still reproducible)
I will subscribe to the PTS for grep, but do not hesitate to ping me if
these patches broke grep.
Kind Regards,
--
Nekral
In Debian Bug tracker #181378, Thomas Viehmann (tv-beamnet) wrote : grep: diff for NMU version 2.5.3~dfsg-2.1 | #76 |
tags 181378 + pending
tags 442882 + pending
thanks
Hi,
The following is the diff for the grep 2.5.3~dfsg-2.1 NMU with
the patches by Nicolas François. As per Santiago's mail to d-devel[1]
and the appearantly only inadvertently lowered severity, I'm NMUing this
as an RC bug. According to my tests, the new version resolve the
regressions reported against grep 2.5.3~dfsg-1.
Kind regards
T.
1. http://
--
Thomas Viehmann, <email address hidden>
diff -u grep-2.
--- grep-2.
+++ grep-2.
@@ -1,3 +1,11 @@
+grep (2.5.3~dfsg-2.1) unstable; urgency=high
+
+ * Non-maintainer upload.
+ * Reinstate patches by Nicolas François <email address hidden>
+ Closes: #181378, #442882
+
+ -- Thomas Viehmann <email address hidden> Tue, 02 Oct 2007 23:02:35 +0200
+
grep (2.5.3~dfsg-2) unstable; urgency=low
* Removed 65-dfa-
only in patch2:
unchanged:
--- grep-2.
+++ grep-2.
@@ -0,0 +1,792 @@
+--- src/search.c.orig
++++ src/search.c
+@@ -18,10 +18,15 @@
+
+ /* Written August 1992 by Mike Haertel. */
+
++#ifndef _GNU_SOURCE
++# define _GNU_SOURCE 1
++#endif
+ #ifdef HAVE_CONFIG_H
+ # include <config.h>
+ #endif
+
++#include <assert.h>
++
+ #include <sys/types.h>
+
+ #include "mbsupport.h"
+@@ -43,6 +48,9 @@
+ #ifdef HAVE_LIBPCRE
+ # include <pcre.h>
+ #endif
++#ifdef HAVE_LANGINFO_
++# include <langinfo.h>
++#endif
+
+ #define NCHAR (UCHAR_MAX + 1)
+
+@@ -68,6 +76,19 @@
+ error (2, 0, _("memory exhausted"));
+ }
+
++/* UTF-8 encoding allows some optimizations that we can't otherwise
++ assume in a multibyte encoding. */
++static int using_utf8;
++
++void
++check_utf8 (void)
++{
++#ifdef HAVE_LANGINFO_
++ if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
++ using_utf8 = 1;
++#endif
++}
++
+ #ifndef FGREP_PROGRAM
+ /* DFA compiled regexp. */
+ static struct dfa dfa;
+@@ -134,49 +155,6 @@
+ }
+ #endif /* !FGREP_PROGRAM */
+
+-#ifdef MBS_SUPPORT
+-/* This function allocate the array which correspond to "buf".
+- Then this check multibyte string and mark on the positions which
+- are not single byte character nor the first byte of a multibyte
+- character. Caller must free the array. */
+-static char*
+-check_
+-{
+- char *mb_properties = xmalloc(size);
+- mbstate_t cur_state;
+- wchar_t wc;
+- int i;
+-
+- memset(&cur_state, 0, sizeof(mbstate_t));
+- memset(
+-
+- for (i = 0; i < size ;)
+- {
+- size_t mbclen;
+- mbclen = mbrtowc(&wc, buf + i, size - i, &cur_state);
+-
+- if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
+- {
+- /* An invalid sequence, or a truncated multibyte character.
+- We treat it as a single byte character. */
+- mbclen = 1;
+- }
+- else if (match_icase)
+- {
+- if (iswupper(
+- {
+- wc = towlower(
+- wcrtomb(buf + i, wc,...
Changed in grep: | |
status: | New → Fix Committed |
Changed in grep: | |
status: | Fix Committed → Fix Released |
Package: grep
Version: 2.5.1-5
Followup-For: Bug #181378
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I experienced the same problem, but I just noticed that grep is fast
when the LC_ALL environment variable is set to C. Here's the output of
locale on my system:
LANG=en_US.UTF-8 "en_US. UTF-8" "en_US. UTF-8" "en_US. UTF-8" "en_US. UTF-8" en_US.UTF- 8 "en_US. UTF-8" "en_US. UTF-8" "en_US. UTF-8" "en_US. UTF-8" "en_US. UTF-8" ON="en_ US.UTF- 8"
LC_CTYPE=
LC_NUMERIC=
LC_TIME=
LC_COLLATE=C
LC_MONETARY=
LC_MESSAGES=
LC_PAPER=
LC_NAME=
LC_ADDRESS=
LC_TELEPHONE=
LC_MEASUREMENT=
LC_IDENTIFICATI
LC_ALL=
Seems like grep handles character encoding conversions inefficiently or
something?
- -- System Information: ben8-xfs- lolat #18 Wed Aug 6 10:56:56 CEST 2003 ppc en_US.UTF- 8
Debian Release: testing/unstable
Architecture: powerpc
Kernel: Linux thor 2.4.20-
Locale: LANG=en_US.UTF-8, LC_CTYPE=
Versions of packages grep depends on:
ii libc6 2.3.1-16 GNU C Library: Shared libraries an
- -- no debconf information
-----BEGIN PGP SIGNATURE-----
N62aWoGvjmrbsgA RArZVAJ4iVqUDpt DeldcvNgA2DlWmo OuXnwCffIQ5 5IkS9ock=
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/
n9TZeES03gReAsL
=KnnW
-----END PGP SIGNATURE-----