regexec/regcomp fails on regular expression containing UTF-8 multi-byte characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
eglibc (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
I want to do a regular expression match on UTF-8 formatted strings.
A simple example is matching a string consisting of 1 or 2 uppercase characters, including Ä,Ë,Ï,Ö,Ü.
The extended regular expression I use is:
'^[A-ZÄ-Ü]{1,2}$'
Expected behaviour:
Input Expect
------------------
Ä Match
ÄB Match
ABC Fail
Test using grep works OK:
$ echo Ä |grep -E '^[A-ZÄ-Ü]{1,2}$'
Ä
$ echo ÄB |grep -E '^[A-ZÄ-Ü]{1,2}$'
ÄB
$ echo ABC |grep -E '^[A-ZÄ-Ü]{1,2}$'
The same test using a simple test program using regex/regcomp:
$ ./regex Ä '^[A-ZÄ-Ü]{1,2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{1,2}$)
$ ./regex ÄB '^[A-ZÄ-Ü]{1,2}$'
MISS (ÄB) (^[A-ZÄ-Ü]{1,2}$)
$ ./regex ABC '^[A-ZÄ-Ü]{1,2}$'
MISS (ABC) (^[A-ZÄ-Ü]{1,2}$)
It seems that the single symbol Ä counts as two symbols here, because this works:
$ ./regex Ä '^[A-ZÄ-Ü]{2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{2}$)
Additional information:
$ lsb_release -rd
Description: Ubuntu 14.04.2 LTS
Release: 14.04
libc6:amd64 version2.
Locale: en_US.UTF-8.
ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: libc6 2.19-0ubuntu6.5
ProcVersionSign
Uname: Linux 3.13.0-35-gatso x86_64
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Mar 4 11:51:24 2015
Dependencies:
gcc-4.9-base 4.9.1-0ubuntu1
libc6 2.19-0ubuntu6.5
libgcc1 1:4.9.1-0ubuntu1
multiarch-support 2.19-0ubuntu6.5
InstallationDate: Installed on 2014-09-26 (158 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
SourcePackage: eglibc
UpgradeStatus: No upgrade log present (probably fresh install)