mawk counts characters incorrectly

Bug #1462737 reported by Jarno Suni on 2015-06-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mawk (Ubuntu)
Undecided
Unassigned

Bug Description

$ echo ä | mawk '{print length($0)}'
outputs 2. I expect 1.

$ echo äo | mawk '{print match($0,"o")}'
outputs 3. I expect 2.

Probably this is due to the internal representation of UTF-8 characters; mawk counts bytes instead of characters. gawk works similarly, if -b option is used.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: mawk 1.3.3-17ubuntu2
ProcVersionSignature: Ubuntu 3.13.0-53.89-lowlatency 3.13.11-ckt19
Uname: Linux 3.13.0-53-lowlatency x86_64
ApportVersion: 2.14.1-0ubuntu3.11
Architecture: amd64
CurrentDesktop: XFCE
Date: Sun Jun 7 15:52:26 2015
Dependencies:
 gcc-4.9-base 4.9.1-0ubuntu1
 libc6 2.19-0ubuntu6.6
 libgcc1 1:4.9.1-0ubuntu1
 multiarch-support 2.19-0ubuntu6.6
EcryptfsInUse: Yes
InstallationDate: Installed on 2014-09-21 (259 days ago)
InstallationMedia: Ubuntu-Studio 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.1)
SourcePackage: mawk
UpgradeStatus: No upgrade log present (probably fresh install)

Jarno Suni (jarnos) wrote :
Jarno Suni (jarnos) wrote :

I guess it is design. I think some operations are faster, if you count bytes instead of characters. There could be an option to allow mawk count characters, though.

description: updated
Jarno Suni (jarnos) wrote :

Or better, it should work same way as gawk, i.e. treat all input data as single-byte characters, only if -b or --characters-as-bytes option is used.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments