cut gets confused with UTF-8 characters

Bug #91175 reported by Noah Slater
This bug report is a duplicate of:  Bug #875713: cut fails to handle correctly utf-8. Edit Remove
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
coreutils (Ubuntu)
Triaged
Wishlist
Unassigned

Bug Description

Binary package hint: coreutils

GNU cut gets confused about character boundaries with UTF-8 encoded files.

An example, as they (almost) say, is worth a thousand words:

nslater@hinata: ~ $ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
nslater@hinata: ~ $ cat foo.txt
She said “I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 10-
“I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 11-
��I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 12-
�I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 13-
I think I found a bug.”

Revision history for this message
Noah Slater (nslater) wrote :
Micah Cowan (micahcowan)
Changed in coreutils:
importance: Undecided → Medium
status: Unconfirmed → Confirmed
Revision history for this message
C de-Avillez (hggdh2) wrote :

Marking as wishlist, per upstream comments. Anyone interested can re-work the patches to make them acceptable upstream.

Changed in coreutils:
importance: Medium → Wishlist
status: Confirmed → Triaged
Revision history for this message
Kamil Dudka (kdudka) wrote :

note: already fixed in Fedora coreutils

Revision history for this message
ar barzh paour (j-p-b-p) wrote :

la commande
echo "tañva"|cut -c1-4
donne
tañ
au lieu de tañv

LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.