Comment 14 for bug 539632

Revision history for this message
Nathan Froyd (froydnj) wrote :

Compiling with -mpreferred-stack-boundary=2 works for me here (Ubuntu 9.10, I think, gcc 4.4.1 + Ubuntu patches). Be sure to reload the shared library after recompiling. Inspect at the generated source; if it doesn't look something like:

himd_sqrt:
 pushl %ebp
 movl %esp, %ebp
 andl $-16, %esp
 pushl %esi
 pushl %ebx
 subl $24, %esp

(the important bit is the andl) then something's not right on your side.

I don't understand the bit about "two times faster than typed CL or C without _mm_sqrt_pd"; I'd expect it to be two times faster. Did you mean that typed CL and C without _mm_sqrt_pd are faster than the SIMD code? If you look at the generated assembly, it looks like there's a lot of extraneous memory shuffling going on. I'd recommend eliminating the temporary and using unaligned loads all the time or doing the necessary twiddling to use aligned loads in the inner loop with cleanup prologues/epilogues. Or use a __m128 temporary and load/store directly into/from the hi/lo parts.

Also -mfpmath=sse should be used to ensure use of SSE loads/stores--though it appears to be slightly broken in Ubuntu's GCC at the moment and you'll want to use -mfpmath=sse -mno-80387. (The option is fixed upstream, at least.)

I don't know that stack alignment has been consistently broken since GCC 2.95.3, but the default has apparently been 16 bytes for a while:

http://groups.google.com/group/ia32-abi/browse_frm/thread/4f9b3e5069943bf1
http://gcc.gnu.org/ml/gcc-patches/2006-09/msg00252.html