valgrind false positives on gcc-generated string routines

Bug #852760 reported by Chris Bainbridge on 2011-09-17
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Valgrind
New
Medium
valgrind (ALT Linux)
New
Medium
valgrind (Fedora)
Unknown
Unknown
valgrind (Ubuntu)
Undecided
Unassigned

Bug Description

#include <string.h>
#include <stdio.h>
#include <stdlib.h>

main()
{
    char *a = malloc(1);
    a[0] = '\0';
    printf("%lu\n", (unsigned long)strlen(a));
}

Compile with "gcc -O2" and run valgrind.

==5977== Invalid read of size 4
==5977== at 0x400494: main (x.c:9)
==5977== Address 0x51ce040 is 0 bytes inside a block of size 1 alloc'd
==5977== at 0x4C28F9F: malloc (vg_replace_malloc.c:236)
==5977== by 0x40048D: main (x.c:7)

This bug report relates to two (closed invalid) bug reports in gcc bugzilla.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47522
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44183

PR47522 includes a runable example in the first comment.

the issue appears to be that vectorization can result in code that loads elements beyond the last element of an allocated array. However, these loads will only happen for unaligned data, where access to the last+1 element can't trigger a page fault or other side effects (according to my interpretation of comments by gcc developers) and are never used. As such, this is considered valid.

Since this kind of code will be produced increasingly by gcc, especially for numerical codes (whenever vectorization triggers, essentially) it would be great to have this somehow dealt with in valgrind.

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47522#c4
>
> I think valgrind should simply special-case these kind of out of bounds
> checks based on the instruction that was used.

Great. Why don't you tell me then how I am supposed to differentiate
between a vector load that is deliberately out of bounds vs one that is
out of bounds by accident, so I can emit an error for the latter but
not for the former?

(In reply to comment #1)

> Great. Why don't you tell me then how I am supposed to differentiate
> between a vector load that is deliberately out of bounds vs one that is
> out of bounds by accident, so I can emit an error for the latter but
> not for the former?

Hey.... I'm a user, you're the developer ;-)

I'm really not the right person to ask. I guess there are some signatures... it is a vector load, with at least one element that is still part of an allocated array. Additionally, based on alignment the 'offending load(s)' can not cross a page boundary. Finally, the loaded byte(s) propagate as uninitialized data, but never trigger the 'used uninitialized error'. I suppose that you might get more details in the gcc bugzilla.

Can you objdump -d the loop containing the complained-about load,
and post the results?

Download full text (22.8 KiB)

So the valgrind message I have is:

==12860== Invalid read of size 8
==12860== at 0x400A38: integrate_gf_npbc_ (in /data03/vondele/bugs/valgrind/a.out)
==12860== by 0x40245B: main (in /data03/vondele/bugs/valgrind/a.out)
==12860== Address 0x58e9e40 is 0 bytes after a block of size 272 alloc'd
==12860== at 0x4C26C3A: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==12860== by 0x402209: main (in /data03/vondele/bugs/valgrind/a.out)

The corresponding asm from objdump is:

0000000000400720 <integrate_gf_npbc_>:
  400720: 41 57 push %r15
  400722: 41 56 push %r14
  400724: 41 55 push %r13
  400726: 41 54 push %r12
  400728: 49 89 fc mov %rdi,%r12
  40072b: 31 ff xor %edi,%edi
  40072d: 55 push %rbp
  40072e: 53 push %rbx
  40072f: 48 83 ec 50 sub $0x50,%rsp
  400733: 49 63 18 movslq (%r8),%rbx
  400736: 45 8b 09 mov (%r9),%r9d
  400739: 48 89 54 24 20 mov %rdx,0x20(%rsp)
  40073e: 49 63 50 04 movslq 0x4(%r8),%rdx
  400742: 48 89 74 24 a0 mov %rsi,-0x60(%rsp)
  400747: 49 63 70 08 movslq 0x8(%r8),%rsi
  40074b: 48 8b 84 24 b0 00 00 mov 0xb0(%rsp),%rax
  400752: 00
  400753: 48 83 c2 01 add $0x1,%rdx
  400757: 48 29 da sub %rbx,%rdx
  40075a: 48 0f 48 d7 cmovs %rdi,%rdx
  40075e: 48 89 54 24 f0 mov %rdx,-0x10(%rsp)
  400763: 49 63 50 0c movslq 0xc(%r8),%rdx
  400767: 48 8b 6c 24 f0 mov -0x10(%rsp),%rbp
  40076c: 48 83 c2 01 add $0x1,%rdx
  400770: 48 29 f2 sub %rsi,%rdx
  400773: 48 0f af 54 24 f0 imul -0x10(%rsp),%rdx
  400779: 48 85 d2 test %rdx,%rdx
  40077c: 48 0f 49 fa cmovns %rdx,%rdi
  400780: 48 89 da mov %rbx,%rdx
  400783: 48 01 db add %rbx,%rbx
  400786: 48 0f af ee imul %rsi,%rbp
  40078a: 48 89 7c 24 c0 mov %rdi,-0x40(%rsp)
  40078f: 48 f7 da neg %rdx
  400792: 49 63 78 10 movslq 0x10(%r8),%rdi
  400796: 48 01 f6 add %rsi,%rsi
  400799: 48 f7 d3 not %rbx
  40079c: 48 f7 d6 not %rsi
  40079f: 48 89 5c 24 b0 mov %rbx,-0x50(%rsp)
  4007a4: 44 89 4c 24 cc mov %r9d,-0x34(%rsp)
  4007a9: 48 89 74 24 10 mov %rsi,0x10(%rsp)
  4007ae: 48 8b b4 24 88 00 00 mov 0x88(%rsp),%rsi
  4007b5: 00
  4007b6: 48 29 ea sub %rbp,%rdx
  4007b9: 48 8b 6c 24 c0 mov -0x40(%rsp),%rbp
  4007be: 48 8d 1c 3f lea (%rdi,%rdi,1),%rbx
  4007c2: 8b 36 mov (%rsi),%esi
  4007c4: 48 0f af ef im...

This seems to me like a bug in gcc. From the following analysis
(start reading at 0x400a38), the value loaded from memory is never
used -- xmm12 is completely overwritten by subsequent instructions,
either in the post-loop block, or in the first instruction of the
next iteration.

==12860== Invalid read of size 8
==12860== at 0x400A38: integrate_gf_npbc_

  # def xmm12 (low half loaded, high half zeroed)
  4009d8: f2 44 0f 10 24 16 movsd (%rsi,%rdx,1),%xmm12
  4009de: 41 83 c6 01 add $0x1,%r14d
  4009e2: f2 0f 10 31 movsd (%rcx),%xmm6
  4009e6: 66 44 0f 16 64 16 08 movhpd 0x8(%rsi,%rdx,1),%xmm12
  4009ed: f2 41 0f 10 04 17 movsd (%r15,%rdx,1),%xmm0
  4009f3: 66 0f 16 71 08 movhpd 0x8(%rcx),%xmm6
  4009f8: 66 41 0f 28 dc movapd %xmm12,%xmm3
  4009fd: f2 44 0f 10 61 10 movsd 0x10(%rcx),%xmm12
  400a03: 66 0f 28 ce movapd %xmm6,%xmm1
  400a07: 66 41 0f 16 44 17 08 movhpd 0x8(%r15,%rdx,1),%xmm0
  400a0e: 66 44 0f 16 61 18 movhpd 0x18(%rcx),%xmm12
  400a14: f2 0f 10 33 movsd (%rbx),%xmm6
  400a18: 66 0f 28 d0 movapd %xmm0,%xmm2
  400a1c: 48 83 c2 10 add $0x10,%rdx
  400a20: 66 41 0f 14 cc unpcklpd %xmm12,%xmm1
  400a25: 66 0f 16 73 08 movhpd 0x8(%rbx),%xmm6
  400a2a: f2 44 0f 10 63 10 movsd 0x10(%rbx),%xmm12
  400a30: 48 83 c1 20 add $0x20,%rcx
  400a34: 66 0f 28 c6 movapd %xmm6,%xmm0

  # load high half xmm12 (error reported here). low half unchanged.
  400a38: 66 44 0f 16 63 18 movhpd 0x18(%rbx),%xmm12
  400a3e: 66 0f 28 f1 movapd %xmm1,%xmm6
  400a42: 66 0f 59 ca mulpd %xmm2,%xmm1
  400a46: 48 83 c3 20 add $0x20,%rbx
  400a4a: 41 39 ee cmp %ebp,%r14d

  # reads low half xmm12 only
  400a4d: 66 41 0f 14 c4 unpcklpd %xmm12,%xmm0
  400a52: 66 0f 59 f3 mulpd %xmm3,%xmm6
  400a56: 66 0f 59 d8 mulpd %xmm0,%xmm3
  400a5a: 66 0f 58 f9 addpd %xmm1,%xmm7
  400a5e: 66 0f 59 c2 mulpd %xmm2,%xmm0
  400a62: 66 44 0f 58 de addpd %xmm6,%xmm11
  400a67: 66 0f 58 eb addpd %xmm3,%xmm5
  400a6b: 66 0f 58 e0 addpd %xmm0,%xmm4
  400a6f: 0f 82 63 ff ff ff jb 4009d8 # (loop head)

  400a75: 66 0f 28 c4 movapd %xmm4,%xmm0
  400a79: 8b 54 24 a8 mov -0x58(%rsp),%edx

  # def xmm12 (overwrite both halves)
  400a7d: 66 44 0f 28 e7 movapd %xmm7,%xmm12

Similar testcase is gcc's own libcpp/lex.c optimization, which also can access a few bytes after malloced area, as long as at least one byte in the value read is from within the malloced area. See search_line_* routines in lex.c, not just SSE4.2/SSE2, but also even the generic C version actually does this.
I guess valgrind could mark somehow the extra bytes as undefined content and propagate it through following arithmetic instructions, complain only if some conditional jump was made solely on the undefined bits or if the undefined bits were stored somewhere (or similar heuristics).

(In reply to comment #5)
> This seems to me like a bug in gcc.

Unfortunately, I'm an asm novice, so I can't tell. I see Jakub is on the CC as well, so maybe he can judge?

Alternatively, I can reopen
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47522
and refer here?

(In reply to comment #6)
> Similar testcase is gcc's own libcpp/lex.c optimization, which also can access
> a few bytes after malloced area, as long as at least one byte in the value read
> is from within the malloced area.

Those loops are (effectively) vectorised while loops, in which you use
standard carry-chain propagation tricks to ensure that the stopping
condition for the loop does not rely on the data from beyond the malloced
area. It is not possible to vectorise them without such over-reading.

By contrast, Joost's loop (and anything gcc can vectorise) are countable
loops: the trip count is known (at run time) before the loop begins. It
is always possible to vectorise such a loop without generating memory
over reads, by having a vector loop to do (trip_count / vector_width)
iterations, and a scalar fixup loop to do the final (trip_count % vector_width)
iterations.

> I guess valgrind could mark somehow the extra bytes as undefined content and
> propagate it through following arithmetic instructions, complain only if some
> conditional jump was made solely on the undefined bits or if the undefined bits
> were stored somewhere (or similar heuristics).

Well, maybe .. but Memcheck is too slow already. I don't want to junk it up
with expensive and complicated heuristics that are irrelevant for 99.9% of
the loads it will encounter.

If you can show me some way to identify just the loads that need special
treatment, then maybe. I don't see how to identify them, though.

Another simple testcase: https://bugzilla.redhat.com/show_bug.cgi?id=678518
I don't think 99% above is the right figure, at least with recent gcc generated code these false positives are just way too common. We disable a bunch of them in glibc through a suppression file or overloading the strops implementations,
but when gcc inlines those there is no way to get rid of the false positives.

Can't valgrind just start tracking in more details whether the bytes are actually used or not when memcheck sees a suspect read (in most cases just an aligned read where at least the first byte is still in the allocated region and perhaps some further ones aren't)? Force then retranslation of the bb it was used in or something similar?

(In reply to comment #9)
> I don't think 99% above is the right figure, at least with recent
> gcc generated

What version of gcc?

The
#include <stdlib.h>
#include <string.h>

__attribute__((noinline)) void
foo (void *p)
{
  memcpy (p, "0123456789abcd", 15);
}

int
main (void)
{
  void *p = malloc (15);
  foo (p);
  return strlen (p) - 14;
}
testcase where strlen does this is expanded that way with GCC 4.6 (currently used e.g. in Fedora 15) with default options, but e.g. 4.5 or even earlier versions expand this the same way with -O2 -minline-all-stringops.

I can see this problem isn't going to go away (alas); and we are
seeing similar things on icc generated code. I'll look into it,
but that won't happen for at least a couple of weeks.

tags: added: oneiric
Changed in valgrind (ALT Linux):
importance: Unknown → Medium
status: Unknown → New

Isn't this exactly the problem that "--partial-loads-ok" is meant to address? (cf. bug 294285)

http://valgrind.org/docs/manual/mc-manual.html#opt.partial-loads-ok

Alessandro Ghedini (ghedo) wrote :

I can reproduce this with GCC 4.6, but not with GCC 4.7 or Clang.

Could this bug be the same issue?: bug 301922

Changed in valgrind:
importance: Unknown → Medium
status: Unknown → New

Hi all,

I think I'm also seeing false positives because of vectorization, that unfortunately decreases the usefulness of valgrind. Below is a minimal working example that reproduces problems with std::string. The code is basically extracted from a library I was using (casacore 1.5) and in my software it generates a lot of incorrect "invalid read"s, although the library seems to be valid (although inherriting from string would not be my preferred solution). I hope this example is of use for evaluating the problem further.

#include <string>
#include <iostream>
#include <cstring>

#include <malloc.h>

class StringAlt : public std::string
{
public:
  StringAlt(const char *c) : std::string(c) { }
  void operator=(const char *c) { std::string::operator=(c); }
};

typedef StringAlt StringImp;
//typedef std::string StringImp; //<-- replacing prev with this also solves issue

int main(int argc, char *argv[])
{
  const char *a1 = "blaaa";
  char *a2 = strdup(a1);
  a2[2] = 0;
  StringImp s(a1);
  std::cout << "Assign A2\n";
  s = a2;
  std::cout << s << '\n';

  std::cout << "Assign A1\n";
  s = a1;
  std::cout << s << '\n';

  char *a3 = strdup(s.c_str());
  std::cout << "Assign A3\n";
  s = a3;
  std::cout << s << '\n';

  free(a2);
  free(a3);
}

Compiled with g++ Debian 4.7.1-2, with "-O2" or "-O3" results in the error below. With "-O0", it works fine. Changing the order of statements can also cause the error to disappear, which makes it very hard to debug. Output:

Assign A2
bl
Assign A1
blaaa
Assign A3
==20872== Invalid read of size 4
==20872== at 0x400C5C: main (in /home/anoko/projects/test/test)
==20872== Address 0x59550f4 is 4 bytes inside a block of size 6 alloc'd
==20872== at 0x4C28BED: malloc (vg_replace_malloc.c:263)
==20872== by 0x564D911: strdup (strdup.c:43)
==20872== by 0x400C46: main (in /home/anoko/projects/test/test)
==20872==
blaaa

(In reply to comment #15)
Try to rebuild the library with -fno-builtin-strdup, chances are it will make valgrind working again.

-fno-builtin-strdup does indeed get rid of the valgrind message.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.