cut gets confused with UTF-8 characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
coreutils (Ubuntu) |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
Binary package hint: coreutils
GNU cut gets confused about character boundaries with UTF-8 encoded files.
An example, as they (almost) say, is worth a thousand words:
nslater@hinata: ~ $ locale
LANG=en_US.UTF-8
LC_CTYPE=
LC_NUMERIC=
LC_TIME=
LC_COLLATE=
LC_MONETARY=
LC_MESSAGES=
LC_PAPER=
LC_NAME=
LC_ADDRESS=
LC_TELEPHONE=
LC_MEASUREMENT=
LC_IDENTIFICATI
LC_ALL=
nslater@hinata: ~ $ cat foo.txt
She said “I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 10-
“I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 11-
��I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 12-
�I think I found a bug.”
nslater@hinata: ~ $ cat foo.txt | cut --characters 13-
I think I found a bug.”
Changed in coreutils: | |
importance: | Undecided → Medium |
status: | Unconfirmed → Confirmed |
Looks relevant:
http:// linuxfromscratc h.org/pipermail /lfs-dev/ 2006-August/ 058090. html