grep switches into binary mode while processing a text file under the C locale

Bug #1547466 reported by Stefan Bader on 2016-02-19
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
grep (Debian)
Fix Released
Unknown
grep (Ubuntu)
High
Martin Pitt
Xenial
High
Martin Pitt
Yakkety
High
Martin Pitt

Bug Description

I noticed this staring to happen in Xenial about two days ago. When running sbuild (or now the buildd, too), the build breaks when trying to compile a generated file. I traced the problem down to grep suddenly acting weird. When not having any language set (or a non-UTF8 mode) it will start printing some lines of a source file and then suddenly end that by printing "Binary file ... matches".

With the attached file, the difference can be observed (running Xenial):

LANG=C grep -v xxx grant_table.h

and

LANG=C.UTF-8 grep -v xxx grant_table.h

SRU INFORMATION
===============
Upstream fixes:
 - http://git.savannah.gnu.org/cgit/grep.git/commit/?id=d8a366218 (but depends on previous patches and is not sufficient by itself)
 - http://git.savannah.gnu.org/cgit/grep.git/commit/?id=d8a366218 (tests+doc)

Test case:

Call grep on a file or a string with non-ASCII characters in the C locale:
    $ echo 'héll☺ ≥x' | LC_ALL=C grep .
In xenial this just shows "Binary file (standard input) matches", with the fix it should show the actual input string (with some garbled output of course as the UTF-8 chars cannot be displayed in C)

Regression potential: grep is being used in tons of places; during xenial we had to fix/put a "use grep -a" workaround into a lot of packages to fix the fallout from grep 2.23 which introduced this. That said, as a result of "Binary file matches" does not give any more information than the actual string match, and scripts which get along with this answer most likely just check the exit code anyway (which does not change), the risk is bearable.

We will soon do a test rebuild in yakkety with gcc-6 and grep 2.25, and will sift through the results to identify new FTBFS that are due to grep 2.25. This SRU should not be released until this happens.

Stefan Bader (smb) wrote :
Changed in grep (Ubuntu):
importance: Undecided → High
Brian Murray (brian-murray) wrote :

I was unable to recreate the problem given the test case provided and the following version of grep:

 $ apt-cache policy grep
grep:
  Installed: 2.23-1
  Candidate: 2.23-1
  Version table:
 *** 2.23-1 500
        500 http://mirrors.cat.pdx.edu/ubuntu xenial/main amd64 Packages
        100 /var/lib/dpkg/status

Changed in grep (Ubuntu):
status: New → Incomplete
Stefan Bader (smb) wrote :

Brian, did you set a non-UTF language?

apt-cache policy grep
grep:
  Installed: 2.23-1
  Candidate: 2.23-1
  Version table:
 *** 2.23-1 500
        500 http://de.archive.ubuntu.com/ubuntu xenial/main amd64 Packages
        100 /var/lib/dpkg/status

LANG=C grep -v xxx grant_table.h
...
 * table are identified by grant references. A grant reference is an
 * integer, which indexes into the grant table. It acts as a
 * capability which the grantee can use to perform operations on the
Binary file grant_table.h matches

Changed in grep (Ubuntu):
status: Incomplete → New
Brian Murray (brian-murray) wrote :

Yes, I followed the test case you gave in the description. Perhaps this is related to bug 1535458 and uploading the file to LP changed the file type?

Stefan Bader (smb) wrote :

I was wondering the same when you could not reported that it does not happen for you. So I did use wget to pull it to another host from LP. The result was what I posted in comment #3. That system was a fresh install (not an upgrade) using the default system language (though German keyboard and timezone). Buildds and sbuild in Xenial chroots would fail to build the related package (Xen) , so that environment saw the same. I worked around that for now by forcing the build to be done in C.UTF-8.

I still marked the bug high because if that happens to any other package (should they use grep) it might cause unexplainable FTBS or even go unnoticed (if the result is not checked, that happened in my case only because the build uses grep as part of a chain that creates dynamic C files).

Stefan Bader (smb) wrote :

Should add, when this first happened I asked someone else on the team to check the package build. That was someone in the US, so there at least would be no special keyboard setting, only the system language which defaults to en_US.UTF-8.

Stefan Bader (smb) wrote :

The current version of grep was published on 2016-02-18 which could be around when sbuild broke for me. Did not notice immediately as I was working on something so I used the chroots directly (which works as there I have a UTF-8 setting there which sbuild deliberately drops).
I downgraded grep in my chroot to 2.22-1ubuntu2 and that causes the issue to go away. So it is definitely the new version of grep in my environment.

Olivier Tilloy (osomon) wrote :

This seems to be the root cause of bug #1551145 too.

Stefan Bader (smb) wrote :

Here is some upstream discussion about it for reference: http://lists.gnu.org/archive/html/bug-grep/2016-02/msg00047.html

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grep (Ubuntu):
status: New → Confirmed
tags: added: wily
Colin Watson (cjwatson) wrote :

Just in case it confuses anyone, the fix for https://bugs.launchpad.net/launchpad-buildd/+bug/1552791 was rolled out at the end of last week, but you can no doubt still reproduce this in a local sbuild instance or similar.

Stefan Bader (smb) wrote :

While it is good to have the builders fixed, there is one worrying aspect to this: this may happen to anything using grep and modifies the output in an unexpected way while still returning a zero return code. For example:

echo -e "Hello\nWörld" | LANG=C grep -v xxx; echo $?

will no longer return both input lines (note the German umlaut ö) but stop at the second line (even if there were more), print a "binary ... matches" and return with 0. So having the grep passing its output to another pipe, there is no way to tell it is going wrong. Given that it is not uncommon to set a LANG=C in scripts (as one then knows what output language text messages will have) and potentially might be used to parse file lists (where file names may have special characters) it is hard to predict how much breakage this may cause.

tags: added: rls-x-incoming
bizdelnick (bizdelnick) wrote :

This bug is likely to be fixed upstream: http://git.savannah.gnu.org/cgit/grep.git/commit/?id=d8a366218f0b44a52c0b212d65d9ebb04e46b3dc

Please apply the fix to Xenial package. This bug breaks innumerable scripts that were working correctly in C locale earlier.

bizdelnick (bizdelnick) wrote :

I'm sorry for the wrong commit link. The real fix is gnulib update: http://git.savannah.gnu.org/cgit/grep.git/commit/?id=9746ea261225558a5ad150936dbe822ede565304
The commit I mentioned above just documents changes and adds a test.

Changed in grep (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
Martin Pitt (pitti) wrote :

Ah, thanks for pointing out. I understood it as the changes in grep 2.23 were deliberate, and during xenial this caused quite a lot of fallout which we fixed.

Reverting the behavior for C makes sense. Thus now we still need to find the actual fix in gnulib. Applying the doc and test to grep itself is still worthwhile though and should absolutely be part of the SRU.

Martin Pitt (pitti) wrote :

This is fixed in grep 2.25, which is in yakkety.

Changed in grep (Ubuntu Xenial):
assignee: nobody → Martin Pitt (pitti)
Changed in grep (Ubuntu Yakkety):
status: Confirmed → Fix Released
description: updated
Changed in grep (Ubuntu Xenial):
status: New → In Progress
Martin Pitt (pitti) on 2016-04-27
description: updated
Martin Pitt (pitti) wrote :

I tried to backport http://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=b7bc3c1a4 ; this by itself does not apply as this changes the files lib/hard-locale.[ch] (which aren't present at all in grep 2.24) and also has a lot of changes to files that aren't contained in grep. I got a backport now, but this still does not fix the bug.

I spent some half an hour making this work, but the longer I do this the less I have faith in the result. Therefore my recommendation is to drop this hackery-patchery and just upgrade xenial to grep 2.25 instead. The complete changelog is:

** Bug fixes

  In the C or POSIX locale, grep now treats all bytes as valid
  characters even if the C runtime library says otherwise. The
  revised behavior is more compatible with the original intent of
  POSIX, and the next release of POSIX will likely make this official.
  [bug introduced in grep-2.23]

  grep -Pz no longer mistakenly diagnoses patterns like [^a] that use
  negated character classes. [bug introduced in grep-2.24]

  grep -oz now uses null bytes, not newlines, to terminate output lines.
  [bug introduced in grep-2.5]

** Improvements

  grep now outputs details more consistently when reporting a write error.
  E.g., "grep: write error: No space left on device" rather than just
  "grep: write error".

(the first item is the fix for this bug). The other bug fixes are desirable for xenial as well, and the improvement seems harmless (and nice) enough to include it too.

Martin Pitt (pitti) on 2016-04-27
description: updated
Changed in grep (Debian):
status: Unknown → Fix Released
teo1978 (teo8976) wrote :

> Therefore my recommendation is to drop this hackery-patchery and just upgrade xenial to grep 2.25 instead.

And what about wily??
Wily hasn't reached EOL (plus xenial shouldn't have been released in the first place, bricking people's computers on upgrade or leaving them with a non working mouse just to name a couple of critical regressions I've read about) and this bug is huge.

teo1978 [2016-04-28 20:26 -0000]:
> And what about wily??

This bug does not affect wily at all. It was introduced in grep 2.23
and still present in 2.24, earlier/later versions are not affected.
Wily has 2.21.

Well, I have wily and I am observing this bug all the time (starting from a few months ago), or at least I was told that what I'm observing was a duplicate of this bug.

What I see is that when I grep text files, randomly (but the same files will consistently produce the same results) text files are processed like binary files, meaning that the output is
  Binary file xxxx.txt matches
rather than the line of text with the match.

Most of the times, text files that are incorrectly processed as binary are ISO-8859-whatever encoded while files that are matched like text as expected are utf8-encoded (my locale is utf8). However, I tried with two files I created from scratch, one utf-8 and the other iso-8859, and I couldn't reproduce the issue at will.

teo1978 (teo8976) wrote :

@Brian Murray, I resubscribed you because you marked this issue as duplicate of #1535458, I asked you if you could confirm because that seems doubtful and you didn't reply, and now at 1535458 they say it only affects xenial, while this one I am observing on wily.

teo1978 (teo8976) wrote :

Sorry, meant to post this on the other bug

teo1978 [2016-04-28 22:08 -0000]:
> Most of the times, text files that are incorrectly processed as binary
> are ISO-8859-whatever encoded while files that are matched like text as
> expected are utf8-encoded (my locale is utf8).

This has been the case for a long time. If you try to show non-UTF-8
data in an UTF-8 locale you'll just see garbage (or other encoding
mismatches).

This bug is about switching to binary mode in the 'C' locale only.

Hello Stefan, or anyone else affected,

Accepted grep into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grep/2.25-1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in grep (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed

> This has been the case for a long time.

Nope. Just a few months.

> If you try to show non-UTF-8
> data in an UTF-8 locale you'll just see garbage (or other encoding
> mismatches)

That doesn't mean that the file should be processed as binary. Also, previous to the regression, grep would work as expected. I guess it might fail to find matches of non-ascii characters encoded in a non-utf8 encoding (though I don't see why it couldn't decode each file according to its encoding and match the contents unicode-wise), but when grepping for "foo" it would find matches for the string "foo" both in utf8-encoded files and in iso8859-encoded files.

> This bug is about switching to binary mode in the 'C' locale only.

Then I wonder why somebody marked the one I reported as duplicate of this one

Changed in grep (Ubuntu Xenial):
importance: Undecided → High
tags: added: regression-release
Stefan Bader (smb) wrote :

Verified in Xenial.

grep-2.24-1:
#> echo Hello Wörld|LANG=C grep Hello
Binary file (standard input) matches

grep-2.25-1~16.04.1:
echo Hello Wörld|LANG=C grep Hello
Hello Wörld

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grep - 2.25-1~16.04.1

---------------
grep (2.25-1~16.04.1) xenial-proposed; urgency=medium

  * New upstream release.
    - Don't switch into binary mode when encountering non-ASCII characters
      and running in the C locale. (LP: #1547466)

 -- Martin Pitt <email address hidden> Thu, 28 Apr 2016 08:01:38 +0200

Changed in grep (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for grep has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Martin Pitt (pitti) on 2016-06-10
summary: - grep switches into binary mode while processing a text file
+ grep switches into binary mode while processing a text file under the C
+ locale
Moses Moore (moses-ubuntu) wrote :

I just tripped over something like this with grep v3.1-2 (Ubuntu 17.10 "artful"). the LC_ALL setting did not make a difference, and grep 3.1 passes the "test case" described in the bug description.

I have many text files, but one of them had a string of 49 \x00 chars. What confused me was the switch into binary mode happens on line 1,128,254, but the string of nulls is on line 1,128,436.
My method to find the triggering data was : `split -n 2 events.log` repeatedly: testing each half, discarding the half that passes and repeating until I had a failing text file I can manually examine.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.