cut gets confused with UTF-8 characters

Bug #91175 reported by Noah Slater on 2007-03-10
This bug report is a duplicate of:  Bug #875713: cut fails to handle correctly utf-8. Edit Remove
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
coreutils (Ubuntu)
Wishlist
Unassigned

Bug Description

Binary package hint: coreutils

GNU cut gets confused about character boundaries with UTF-8 encoded files.

An example, as they (almost) say, is worth a thousand words:

nslater@hinata: ~ $ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
nslater@hinata: ~ $ cat foo.txt
She said “I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 10-
“I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 11-
��I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 12-
�I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 13-
I think I found a bug.”

Micah Cowan (micahcowan) on 2007-05-17
Changed in coreutils:
importance: Undecided → Medium
status: Unconfirmed → Confirmed
C de-Avillez (hggdh2) wrote :

Marking as wishlist, per upstream comments. Anyone interested can re-work the patches to make them acceptable upstream.

Changed in coreutils:
importance: Medium → Wishlist
status: Confirmed → Triaged
Kamil Dudka (kdudka) wrote :

note: already fixed in Fedora coreutils

ar barzh paour (j-p-b-p) wrote :

la commande
echo "tañva"|cut -c1-4
donne
tañ
au lieu de tañv

LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers