egrep: U+D56D (항) breaks ^/$ matching
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
grep (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
In theory the regular expression ^.*$ should match any and every string, including empty strings, but this specific Korean character U+D56D (항), which I was unlucky enough to have one of my scripts come across, breaks the expected behavior in egrep:
$ echo '' | egrep '^.*$'; echo $?
0
$ echo 'foo' | egrep '^.*$'; echo $?
foo
0
$ echo 'bar' | egrep '^.*$'; echo $?
bar
0
$ echo 'の名' | egrep '^.*$'; echo $?
の名
0
$ echo '항' | egrep '^.*$'; echo $?
1
Have I lost my mind...or should I go buy a lottery ticket? Here are some rambling one-liners to illustrate the behavior further.
# An attempt to match the pattern ^.*$ (beginning of string, anything, end of string) against this Korean character fails:
$ echo '항' | egrep '^.*$'; echo $?
1
# As you can see here a match works when the $ is dropped from the pattern:
$ echo '항' | egrep '^.*'; echo $?
항
0
# Also using the -P flag from grep instead of -E correctly matches the original pattern:
$ echo '항' | grep -P '^.*$'; echo $?
항
0
# Sending a different Korean character (U+C720) to the same original pattern works as expected as well:
$ echo '유' | egrep '^.*$'; echo $?
유
0
# Combining the two leads to the original failure mentioned:
$ echo '항유' | egrep '^.*$'; echo $?
1
# And reversing the order of the combination does not effect the outcome:
$ echo '유항' | egrep '^.*$'; echo $?
1
# But dropping the $ from the pattern gives the expected match:
$ echo '유항' | egrep '^.*'; echo $?
유항
0
# Dropping the ^ from the pattern also gives the expected match:
$ echo '유항' | egrep '.*$'; echo $?
유항
0
# Surrounding U+D56D with U+C720 does not alter the behavior:
$ echo '유항유' | egrep '^.*$'; echo $?
1
# But again dropping U+D56D (항) from the input string returns egrep to the expected behavior:
$ echo '유유' | egrep '^.*$'; echo $?
유유
0
# And to make it very clear what the input is, here I'm using python to give a raw dump of the input:
$ echo '유항유' | python -c 'import sys; print(repr(
b'\\uc720\
# My grep/egrep version:
$ grep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https:/
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https:/
$ egrep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https:/
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https:/
# My bash version
$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
=======
If somebody could explain this behavior I would appreciate it. If it could be fixed, even better. In the meantime I think I will prefer 'grep -P' over 'egrep' when I expect strings to contain Korean text. In this contrived example the '^' and '$' didn't make a lot of sense, but I thought it would be best to provide the simplest possible reproduction case rather than spell out my full use case.
ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: grep 3.4-1
ProcVersionSign
Uname: Linux 5.4.0-65-generic x86_64
NonfreeKernelMo
ApportVersion: 2.20.11-
Architecture: amd64
CasperMD5CheckR
Date: Mon Feb 15 17:10:42 2021
InstallationDate: Installed on 2020-01-22 (389 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
SourcePackage: grep
UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago)