Ubuntu
grep package

grep does not work for UTF-16 files

Bug #374807 reported by Jeremy Hooks on 2009-05-11

16

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	grep (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

Binary package hint: grep

Release:
Description: Ubuntu 9.04
Release: 9.04

Package:
    grep:
      Installed: 2.5.3~dfsg-6ubuntu1
      Candidate: 2.5.3~dfsg-6ubuntu1
      Version table:
     *** 2.5.3~dfsg-6ubuntu1 0
            500 http://gb.archive.ubuntu.com jaunty/main Packages
            100 /var/lib/dpkg/status

When grep-ing a UTF-16 file, I expected results for the search pattern I was using. However, no matches were found (using grep without options and 'grep -hi').

I am not sure what program initially created the file as I received them via email from a Windows user. I have attached part of the file for testing (I have gzipped the file to reduce any risk of the browser mangling it). 'file' returns the filetype as 'Little-endian UTF-16 Unicode character data, with CRLF, CR line terminators'. I have attached a gzip extract of the file (just the first ten lines returned from 'head').

Other text utilities such as cat, less, head, tail and vim have no problem dealing with the file. So far as I have found, only grep cannot handle the file.

Revision history for this message

Jeremy Hooks (jeremy-hooks) wrote on 2009-05-11:

#1

Example file for testing Edit (432 bytes, text/plain)

Revision history for this message

João Cruz Jr (joaoacj-gmail) wrote on 2010-07-08:

#2

Affects 10.04 release too.

Revision history for this message

richud (richud.com) wrote on 2010-09-14:

#3

its a well know problem with grep that really should be fixed.
I cannot search UTF-16 XML files which is really annoying.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-01-28:

#4

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grep (Ubuntu):
status:	New → Confirmed

Revision history for this message

Dmitry (rusdmitry) wrote on 2015-08-25:

#5

This is not surprising since 'grep' is a standard POSIX utility. It uses POSIX locales (http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html#tag_20_55_08). So if you read the POSIX standard carefully, then you are going to find out the following: UTF-16 and UTF-32 cannot be supported in POSIX locales because these encoding forms imply using 2-byte and 4-byte code-units respectively making the encoding of '/' and '.' nonconforming.
Quoting http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html:

"Conforming implementations shall support one or more coded character sets. Each supported locale shall include the portable character set, which is the set of symbolic names for characters in Portable Character Set.

...
POSIX.1-2008 places only the following requirements on the encoded values of the characters in the portable character set:

...

The encoded values associated with <slash> and <period> shall be invariant across all locales supported by the implementation.

The encoded values associated with the members of the portable character set are each represented in a single byte. Moreover, if the value is stored in an object of C-language type char, it is guaranteed to be positive (except the NUL, which is always zero)."

Another issue is that sizeof(wchar_t) is implementation defined. My tests on Ubuntu show that sizeof(wchar_t) returns 4 (bytes) and you need some other data type to store UTF-16 code units in a portable way.

I would say that this should not be fixed: you should use iconv in a pipeline to do the appropriate grepping with UTF-8 (though this might be resource-intensive for large XML files).

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Example file for testing Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.