egrep: U+D56D (항) breaks ^/$ matching

Bug #1915738 reported by Cyle Riggs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grep (Ubuntu)
New
Undecided
Unassigned

Bug Description

In theory the regular expression ^.*$ should match any and every string, including empty strings, but this specific Korean character U+D56D (항), which I was unlucky enough to have one of my scripts come across, breaks the expected behavior in egrep:

$ echo '' | egrep '^.*$'; echo $?

0
$ echo 'foo' | egrep '^.*$'; echo $?
foo
0
$ echo 'bar' | egrep '^.*$'; echo $?
bar
0
$ echo 'の名' | egrep '^.*$'; echo $?
の名
0
$ echo '항' | egrep '^.*$'; echo $?
1

Have I lost my mind...or should I go buy a lottery ticket? Here are some rambling one-liners to illustrate the behavior further.

# An attempt to match the pattern ^.*$ (beginning of string, anything, end of string) against this Korean character fails:
$ echo '항' | egrep '^.*$'; echo $?
1

# As you can see here a match works when the $ is dropped from the pattern:
$ echo '항' | egrep '^.*'; echo $?

0

# Also using the -P flag from grep instead of -E correctly matches the original pattern:
$ echo '항' | grep -P '^.*$'; echo $?

0

# Sending a different Korean character (U+C720) to the same original pattern works as expected as well:
$ echo '유' | egrep '^.*$'; echo $?

0

# Combining the two leads to the original failure mentioned:
$ echo '항유' | egrep '^.*$'; echo $?
1

# And reversing the order of the combination does not effect the outcome:
$ echo '유항' | egrep '^.*$'; echo $?
1

# But dropping the $ from the pattern gives the expected match:
$ echo '유항' | egrep '^.*'; echo $?
유항
0

# Dropping the ^ from the pattern also gives the expected match:
$ echo '유항' | egrep '.*$'; echo $?
유항
0

# Surrounding U+D56D with U+C720 does not alter the behavior:
$ echo '유항유' | egrep '^.*$'; echo $?
1

# But again dropping U+D56D (항) from the input string returns egrep to the expected behavior:
$ echo '유유' | egrep '^.*$'; echo $?
유유
0

# And to make it very clear what the input is, here I'm using python to give a raw dump of the input:
$ echo '유항유' | python -c 'import sys; print(repr(sys.stdin.read().encode("unicode-escape")))'
b'\\uc720\\ud56d\\uc720\\n'

# My grep/egrep version:
$ grep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ egrep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

# My bash version
$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

===========================

If somebody could explain this behavior I would appreciate it. If it could be fixed, even better. In the meantime I think I will prefer 'grep -P' over 'egrep' when I expect strings to contain Korean text. In this contrived example the '^' and '$' didn't make a lot of sense, but I thought it would be best to provide the simplest possible reproduction case rather than spell out my full use case.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: grep 3.4-1
ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78
Uname: Linux 5.4.0-65-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair wl
ApportVersion: 2.20.11-0ubuntu27.16
Architecture: amd64
CasperMD5CheckResult: skip
Date: Mon Feb 15 17:10:42 2021
InstallationDate: Installed on 2020-01-22 (389 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
SourcePackage: grep
UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago)

Revision history for this message
Cyle Riggs (beardedfoo) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.