Ubuntu
coreutils package

cut gets confused with UTF-8 characters

Bug #91175 reported by Noah Slater on 2007-03-10

This bug report is a duplicate of: Bug #875713: cut fails to handle correctly utf-8. Edit Remove

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	coreutils (Ubuntu)	Triaged	Wishlist	Unassigned

Bug Description

Binary package hint: coreutils

GNU cut gets confused about character boundaries with UTF-8 encoded files.

An example, as they (almost) say, is worth a thousand words:

nslater@hinata: ~ $ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
nslater@hinata: ~ $ cat foo.txt
She said “I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 10-
“I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 11-
��I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 12-
�I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 13-
I think I found a bug.”

Revision history for this message

Noah Slater (nslater) wrote on 2007-03-10:

Looks relevant:

http://linuxfromscratch.org/pipermail/lfs-dev/2006-August/058090.html

Micah Cowan (micahcowan) on 2007-05-17

Changed in coreutils:
importance:	Undecided → Medium
status:	Unconfirmed → Confirmed

Revision history for this message

C de-Avillez (hggdh2) wrote on 2008-09-26:

Marking as wishlist, per upstream comments. Anyone interested can re-work the patches to make them acceptable upstream.

Changed in coreutils:
importance:	Medium → Wishlist
status:	Confirmed → Triaged

Revision history for this message

Kamil Dudka (kdudka) wrote on 2008-10-14:

note: already fixed in Fedora coreutils

Revision history for this message

ar barzh paour (j-p-b-p) wrote on 2012-11-29:

la commande
echo "tañva"|cut -c1-4
donne
tañ
au lieu de tañv

LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=