simd woes

Bug #539632 reported by Nikodemus Siivola
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Won't Fix
Medium
Unassigned

Bug Description

Send by <email address hidden> to sbcl-bugs:

Hello!

Don't know whether it's SBCL's bug or mine, but still...

My first attempt with SIMD is a (silly?) library:

#include <emmintrin.h>
void himd_sqrt (double* x, int len)
{
  double __attribute__((aligned(16))) *xa=_mm_malloc(2*sizeof(double),16);
  int i, l1 = len%2 ? len - 1 : len;
  for (i=0; i<l1; i+=2) {
    xa[0]=x[i];
    xa[1]=x[i+1];
    _mm_store_pd(xa, _mm_sqrt_pd(_mm_load_pd(xa)));
    x[i]=xa[0];
    x[i+1]=xa[1];
  }
  if (len%2) {
    xa[0]=x[l1];
    _mm_store_pd(xa, _mm_sqrt_pd(_mm_load_pd(xa)));
    x[l1]=xa[0];
  }
  _mm_free(xa);
}

I compile it with
gcc -msse2 -shared himd.c -o libhimd.so

and then try to use from inside SBCL:

(require 'cffi)
(cffi:define-foreign-library himd
  (:unix (:or "libhimd.so"))
  (t (:default "libhimd")))

(cffi:use-foreign-library himd)

(cffi:defcfun "himd_sqrt" :void (p :pointer) (len :int))
(setf x (cffi:foreign-alloc :double :initial-contents '(1.0d0 2.0d0 3.0d0)))
(himd-sqrt x 3)

It dies with Unhandled memory fault at #x0.
According to printf's, it happens at _mm_load_pd
At the same time, C test

#include <stdio.h>

void himd_sqrt (double* x, int len);

int main (int argc, char* argv[])
{
  double x[]={1, 2, 3};
  himd_sqrt(x,3);
  printf("%lf %lf %lf\n",x[0],x[1],x[2]);
  return 0;
}

compiled with
gcc -lhimd test.c

says
1.000000 1.414214 1.732051

What happens?
My system is Ubuntu 8.10.

Thanks,
Dmitri

P.S. And if someone could be so kind as to advise me some way to
quickly sqrt each element of an array (with GSL, BLAS, etc.), it would
be great.

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

Is this a 32 or 64 bit SBCL? If the first, I'm guessing that the C-side SIMD code isn't happy with the stack alignment.

Revision history for this message
Dmitri Hrapof (hrapof) wrote : Re: [Bug 539632] Re: simd woes

Цитирую Nikodemus Siivola <email address hidden>:

> Is this a 32 or 64 bit SBCL? If the first, I'm guessing that the C-side
> SIMD code isn't happy with the stack alignment.

Sorry, I forget to mention it. 32 bit.
Is there anything I can do with the stack alignment?

Sincerely yours,
Dmitri

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

I can confirm the symptom on x86, but can't swear on my stack-alignment supposition yet. I can also confirm that this works on x86-64, though, so that's one datapoint.

If my guess is correct, then getting x86 to align stack to 16 bytes should not be too hard: IIRC we already have to do that on Darwin, so the code should exist already.

Changed in sbcl:
status: New → Confirmed
Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

(Of course you can also align stack in an assembly trampoline as well.)

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

Yeah, it's the stack alignment.

In src/compiler/x86/macros.lisp replace the DEFMACRO ALIGN-STACK-POINTER with

(defmacro align-stack-pointer (tn)
  ;; 16 byte alignment needed for SIMD
  ;; Darwin ABI specifies this, and seems like
  ;; a sensible thing to do elsewhere too.
  `(inst and ,tn #xfffffff0))

and you should be good to go -- as soon as you rebuild SBCL. I'll check if any other places need touching for more principled 16-byte alignment support and commit this later this week.

Changed in sbcl:
status: Confirmed → In Progress
importance: Undecided → Medium
assignee: nobody → Nikodemus Siivola (nikodemus)
Revision history for this message
Nathan Froyd (froydnj) wrote :

That doesn't seem right. The stack is 4-byte aligned on non-Darwin x86. If the C code needs 16-byte alignment for temporaries or whatever, then the C code gets to ensure that, not some random caller of the C code.

Revision history for this message
Nathan Froyd (froydnj) wrote :

Also, if you're going to compile code for a shared library, you need to make sure it's compiled -fPIC. It looks like what's happening, given _mm_malloc:

static __inline void *
_mm_malloc (size_t size, size_t alignment)
{
  void *ptr;
  if (alignment == 1)
    return malloc (size);
  if (alignment == 2 || (sizeof (void *) == 8 && alignment == 4))
    alignment = sizeof (void *);
  if (posix_memalign (&ptr, alignment, size) == 0)
    return ptr;
  else
    return ((void *)0);
}

is that your call to posix_memalign either isn't going through the PLT and therefore you're getting a bogus return value and thus returning NULL from _mm_malloc or posix_memalign is failing in some other way and _mm_malloc is returning 0. I think the stack alignment is a red herring.

Revision history for this message
Nathan Froyd (froydnj) wrote :

...and compiling:

gcc -O2 -msse2 -shared -o libhimd.so himd.c

(notice the lack of -fPIC) and objdump'ing:

objdump -d libhimd.so

gives:

00000200 <himd_sqrt>:
 200: 55 push %ebp
 201: 89 e5 mov %esp,%ebp
 203: 57 push %edi
 204: 8d 45 e4 lea -0x1c(%ebp),%eax
 207: 56 push %esi
 208: 53 push %ebx
 209: 83 ec 2c sub $0x2c,%esp
 20c: 8b 75 0c mov 0xc(%ebp),%esi
 20f: 8b 5d 08 mov 0x8(%ebp),%ebx
 212: 89 f7 mov %esi,%edi
 214: c7 44 24 08 10 00 00 movl $0x10,0x8(%esp)
 21b: 00
 21c: c7 44 24 04 10 00 00 movl $0x10,0x4(%esp)
 223: 00
 224: 89 04 24 mov %eax,(%esp)
 227: e8 fc ff ff ff call 228 <himd_sqrt+0x28>
 22c: 31 d2 xor %edx,%edx
 22e: 85 c0 test %eax,%eax
 230: 8d 46 ff lea -0x1(%esi),%eax
 233: 0f 44 55 e4 cmove -0x1c(%ebp),%edx
 237: 83 e7 01 and $0x1,%edi
 23a: 0f 45 f0 cmovne %eax,%esi
 23d: 85 f6 test %esi,%esi
 23f: 7e 34 jle 275 <himd_sqrt+0x75>

Notice that the call at 227 is *not* calling posix_memalign, but some random address in the same function. So the stack at the point of the call looks like:

sp: %ebp-28
sp+4: 16
sp+8: 16

and %eax is a pointer into the stack. After the call, you have:

sp: <some return address>
sp+4: %ebp-28
sp+8: 16
sp+12: 16

and %eax (what would be the return value from posix_memalign if the call had actually happened) is non-zero. So _mm_malloc returns 0, which gets assigned to xa, which leads to the load from xa faulting.

What's not clear to me is how this ever worked in the first place. Probably something to do with compiling without optimization. I'd be willing to bet that if you compiled the shared library with -O2 (and not -fPIC), things would fall over C-side and Lisp-side.

I don't think this counts as a bug we get to fix, not even the stack alignment changes Nikodemus suggested (which might just paper over C-side problems).

Revision history for this message
Dmitri Hrapof (hrapof) wrote :

Nathan Froyd пишет:
> I don't think this counts as a bug we get to fix, not even the stack
> alignment changes Nikodemus suggested (which might just paper over
> C-side problems).
Thank you for pointing me to -fPIC option!
However it doesn't seem to be the only cause of the problem.
With -fPIC and without the macro change I still get the error.

And without _mm_malloc:

void himd_sqrt (double* x, int len)
{
  double __attribute__((aligned(16))) xa[2];
  int i, l1 = len%2 ? len - 1 : len;
  for (i=0; i<l1; i+=2) {
    xa[0]=x[i];
    xa[1]=x[i+1];
    _mm_store_pd(xa, _mm_sqrt_pd(_mm_load_pd(xa)));
    x[i]=xa[0];
    x[i+1]=xa[1];
  }
  if (len%2) {
    xa[0]=x[l1];
    _mm_store_pd(xa, _mm_sqrt_pd(_mm_load_pd(xa)));
    x[l1]=xa[0];
  }
}

Revision history for this message
Nathan Froyd (froydnj) wrote :

Lovely. I'm surprised it doesn't work with -fPIC and _mm_malloc; I don't think stack alignment should matter in that case. With the aligned local variable, it's obvious that GCC's generating code that assumes 16-byte alignment of the stack.

There are two options here:

1. Compile with -mpreferred-stack-boundary=2. This will force GCC to compile code that adheres to the ABI and therefore properly adjusts its own stack pointer.

2. Wait for Nikodemus's fix to go in.

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

> Lovely.  I'm surprised it doesn't work with -fPIC and _mm_malloc; I
> don't think stack alignment should matter in that case.  With the
> aligned local variable, it's obvious that GCC's generating code that
> assumes 16-byte alignment of the stack.

It seems this has been the case since GCC 2.95.1:

 http://lkml.indiana.edu/hypermail/linux/kernel/9909.2/0563.html

Revision history for this message
Dmitri Hrapof (hrapof) wrote :

Nathan Froyd пишет:
> 1. Compile with -mpreferred-stack-boundary=2.
That didn't help, either.
gcc -msse2 -shared -fPIC -mpreferred-stack-boundary=2 -O2
> 2. Wait for Nikodemus's fix to go in
No need to wait - I'm using CVS version anyway due to bug #523612
If the fix will break GSL, GTK, Cairo or OpenGL bindings, I'll let you know.
Thanks for the fix and advice! That's the power of the open source!

And by the way: I suspect I'm doing SSE totally wrong, but it's still
two times faster than typed CL or C without _mm_sqrt_pd (the last two
being on par).

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

Apropos, I suspect that on the long term we would like to use 16-byte alignment on x86 anyways:

* Since we have do it for foreign code on Darwin, doing it for all platforms will simplify code.

* In order to use SSE-stuff on the lisp side on x86 we need to align our own stack to 16 bytes as well. (And array data, etc.)

Revision history for this message
Nathan Froyd (froydnj) wrote :

Compiling with -mpreferred-stack-boundary=2 works for me here (Ubuntu 9.10, I think, gcc 4.4.1 + Ubuntu patches). Be sure to reload the shared library after recompiling. Inspect at the generated source; if it doesn't look something like:

himd_sqrt:
 pushl %ebp
 movl %esp, %ebp
 andl $-16, %esp
 pushl %esi
 pushl %ebx
 subl $24, %esp

(the important bit is the andl) then something's not right on your side.

I don't understand the bit about "two times faster than typed CL or C without _mm_sqrt_pd"; I'd expect it to be two times faster. Did you mean that typed CL and C without _mm_sqrt_pd are faster than the SIMD code? If you look at the generated assembly, it looks like there's a lot of extraneous memory shuffling going on. I'd recommend eliminating the temporary and using unaligned loads all the time or doing the necessary twiddling to use aligned loads in the inner loop with cleanup prologues/epilogues. Or use a __m128 temporary and load/store directly into/from the hi/lo parts.

Also -mfpmath=sse should be used to ensure use of SSE loads/stores--though it appears to be slightly broken in Ubuntu's GCC at the moment and you'll want to use -mfpmath=sse -mno-80387. (The option is fixed upstream, at least.)

I don't know that stack alignment has been consistently broken since GCC 2.95.3, but the default has apparently been 16 bytes for a while:

http://groups.google.com/group/ia32-abi/browse_frm/thread/4f9b3e5069943bf1
http://gcc.gnu.org/ml/gcc-patches/2006-09/msg00252.html

Revision history for this message
Dmitri Hrapof (hrapof) wrote :

Nathan Froyd wrote:
> Inspect at the generated source; if
> it doesn't look something like:
>
Will check in the evening.
> I don't understand the bit about "two times faster than typed CL or C
> without _mm_sqrt_pd";
I tried to say it's my first try and I feel I do "a lot of extraneous
memory shuffling".
> I'd expect it to be two times faster.
And in spite of all my clumsiness SSE is faster, so the "bug" is worth
fixing. :)
Sorry for not expressing myself more clearly.

Revision history for this message
Dmitri Hrapof (hrapof) wrote :
Download full text (3.2 KiB)

dmitri@dmitri-laptop:~$ gcc -msse2 -mpreferred-stack-boundary=2 -fPIC
-O2 -shared himd.c -o libhimd.so
dmitri@dmitri-laptop:~$ objdump -d libhimd.so

000003f0 <himd_sqrt>:
  3f0: 55 push %ebp
  3f1: 89 e5 mov %esp,%ebp
  3f3: 57 push %edi
  3f4: 56 push %esi
  3f5: 83 ec 10 sub $0x10,%esp
  3f8: 8b 45 0c mov 0xc(%ebp),%eax
  3fb: 8b 4d 08 mov 0x8(%ebp),%ecx
  3fe: 89 c7 mov %eax,%edi
  400: 83 e7 01 and $0x1,%edi
  403: 8d 50 ff lea -0x1(%eax),%edx
  406: 0f 44 d0 cmove %eax,%edx
  409: 85 d2 test %edx,%edx
  40b: 7e 34 jle 441 <himd_sqrt+0x51>
  40d: 8d 75 e8 lea -0x18(%ebp),%esi
  410: 31 c0 xor %eax,%eax
  412: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
  418: dd 04 c1 fldl (%ecx,%eax,8)
  41b: dd 5d e8 fstpl -0x18(%ebp)
  41e: dd 44 c1 08 fldl 0x8(%ecx,%eax,8)
  422: dd 5d f0 fstpl -0x10(%ebp)
  425: 66 0f 51 06 sqrtpd (%esi),%xmm0
  429: 66 0f 29 06 movapd %xmm0,(%esi)
  42d: dd 45 e8 fldl -0x18(%ebp)
  430: dd 1c c1 fstpl (%ecx,%eax,8)
  433: dd 45 f0 fldl -0x10(%ebp)
  436: dd 5c c1 08 fstpl 0x8(%ecx,%eax,8)
  43a: 83 c0 02 add $0x2,%eax
  43d: 39 c2 cmp %eax,%edx
  43f: 7f d7 jg 418 <himd_sqrt+0x28>
  441: 85 ff test %edi,%edi
  443: 74 17 je 45c <himd_sqrt+0x6c>
  445: 8d 04 d1 lea (%ecx,%edx,8),%eax
  448: dd 00 fldl (%eax)
  44a: dd 5d e8 fstpl -0x18(%ebp)
  44d: 66 0f 51 45 e8 sqrtpd -0x18(%ebp),%xmm0
  452: 66 0f 29 45 e8 movapd %xmm0,-0x18(%ebp)
  457: dd 45 e8 fldl -0x18(%ebp)
  45a: dd 18 fstpl (%eax)
  45c: 83 c4 10 add $0x10,%esp
  45f: 5e pop %esi
  460: 5f pop %edi
  461: 5d pop %ebp
  462: c3 ret

dmitri@dmitri-laptop:~$ sudo cp libhimd.so /usr/lib
dmitri@dmitri-laptop:~$ ldd a.out
 linux-gate.so.1 => (0xb7f77000)
 libhimd.so => /usr/lib/libhimd.so (0xb7f41000)
 libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7de3000)
 /lib/ld-linux.so.2 (0xb7f5d000)
dmitri@dmitri-laptop:~$ ./a.out
1.000000 1.414214 1.732051
dmitri@dmitri-laptop:~$ gcc -v
Используются внутренние спецификации.
Целевая архитектура: i486-linux-gnu
Параметры конфигурации: ../src/configure -v --with-pkgversion='Ubuntu
4.3.2-1ubuntu12'
--with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-targets=all --enable-checking=release
--build=i486-lin...

Read more...

Revision history for this message
James Y Knight (foom) wrote :

http://gcc.gnu.org/ml/gcc/2007-12/msg00503.html
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33721

Yeah, it's been broken forever. So long that instead of fixing it by default that they added an argument to fix it if you want. I think that basically means the de-facto ABI is that the stack is aligned to 16 bytes on x86, despite what the ABI actually says.

If you want GCC to follow the ABI, you can use this argument or per-function attribute:
       -mstackrealign
           Realign the stack at entry. On the Intel x86, the -mstackrealign
           option will generate an alternate prologue and epilogue that realigns
           the runtime stack if necessary. This supports mixing legacy codes that
           keep a 4-byte aligned stack with modern codes that keep a 16-byte stack
           for SSE compatibility. See also the attribute
           "force_align_arg_pointer", applicable to individual functions.

Revision history for this message
Nikodemus Siivola (nikodemus) wrote :

Notes to self:

Call path ALLOCATION -> ALLOCATION-TRAMP -> alloc_tramp -> alloc doesn't align the stack on x86-64.

%ALIEN-FUNCALL doesn't restore stack pointer properly on x86-64.

Changed in sbcl:
status: In Progress → Triaged
assignee: Nikodemus Siivola (nikodemus) → nobody
Revision history for this message
Douglas Katzman (dougk) wrote :

I decree that we stop worrying about 32-bit x86.

Changed in sbcl:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.