cut gets confused with UTF-8 characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
coreutils (Ubuntu) |
Wishlist
|
Unassigned |
Bug Description
Binary package hint: coreutils
GNU cut gets confused about character boundaries with UTF-8 encoded files.
An example, as they (almost) say, is worth a thousand words:
nslater@hinata: ~ $ locale
LANG=en_US.UTF-8
LC_CTYPE=
LC_NUMERIC=
LC_TIME=
LC_COLLATE=
LC_MONETARY=
LC_MESSAGES=
LC_PAPER=
LC_NAME=
LC_ADDRESS=
LC_TELEPHONE=
LC_MEASUREMENT=
LC_IDENTIFICATI
LC_ALL=
nslater@hinata: ~ $ cat foo.txt
She said “I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 10-
“I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 11-
��I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 12-
�I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 13-
I think I found a bug.”
Noah Slater (nslater) wrote : | #1 |
Changed in coreutils: | |
importance: | Undecided → Medium |
status: | Unconfirmed → Confirmed |
C de-Avillez (hggdh2) wrote : | #2 |
Marking as wishlist, per upstream comments. Anyone interested can re-work the patches to make them acceptable upstream.
Changed in coreutils: | |
importance: | Medium → Wishlist |
status: | Confirmed → Triaged |
Kamil Dudka (kdudka) wrote : | #3 |
note: already fixed in Fedora coreutils
ar barzh paour (j-p-b-p) wrote : | #4 |
la commande
echo "tañva"|cut -c1-4
donne
tañ
au lieu de tañv
LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE=
LC_NUMERIC=
LC_TIME=
LC_COLLATE=
LC_MONETARY=
LC_MESSAGES=
LC_PAPER=
LC_NAME=
LC_ADDRESS=
LC_TELEPHONE=
LC_MEASUREMENT=
LC_IDENTIFICATI
LC_ALL=
Looks relevant:
http:// linuxfromscratc h.org/pipermail /lfs-dev/ 2006-August/ 058090. html