Compiling with -mpreferred-stack-boundary=2 works for me here (Ubuntu 9.10, I think, gcc 4.4.1 + Ubuntu patches). Be sure to reload the shared library after recompiling. Inspect at the generated source; if it doesn't look something like:
(the important bit is the andl) then something's not right on your side.
I don't understand the bit about "two times faster than typed CL or C without _mm_sqrt_pd"; I'd expect it to be two times faster. Did you mean that typed CL and C without _mm_sqrt_pd are faster than the SIMD code? If you look at the generated assembly, it looks like there's a lot of extraneous memory shuffling going on. I'd recommend eliminating the temporary and using unaligned loads all the time or doing the necessary twiddling to use aligned loads in the inner loop with cleanup prologues/epilogues. Or use a __m128 temporary and load/store directly into/from the hi/lo parts.
Also -mfpmath=sse should be used to ensure use of SSE loads/stores--though it appears to be slightly broken in Ubuntu's GCC at the moment and you'll want to use -mfpmath=sse -mno-80387. (The option is fixed upstream, at least.)
I don't know that stack alignment has been consistently broken since GCC 2.95.3, but the default has apparently been 16 bytes for a while:
Compiling with -mpreferred- stack-boundary= 2 works for me here (Ubuntu 9.10, I think, gcc 4.4.1 + Ubuntu patches). Be sure to reload the shared library after recompiling. Inspect at the generated source; if it doesn't look something like:
himd_sqrt:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
pushl %esi
pushl %ebx
subl $24, %esp
(the important bit is the andl) then something's not right on your side.
I don't understand the bit about "two times faster than typed CL or C without _mm_sqrt_pd"; I'd expect it to be two times faster. Did you mean that typed CL and C without _mm_sqrt_pd are faster than the SIMD code? If you look at the generated assembly, it looks like there's a lot of extraneous memory shuffling going on. I'd recommend eliminating the temporary and using unaligned loads all the time or doing the necessary twiddling to use aligned loads in the inner loop with cleanup prologues/ epilogues. Or use a __m128 temporary and load/store directly into/from the hi/lo parts.
Also -mfpmath=sse should be used to ensure use of SSE loads/stores- -though it appears to be slightly broken in Ubuntu's GCC at the moment and you'll want to use -mfpmath=sse -mno-80387. (The option is fixed upstream, at least.)
I don't know that stack alignment has been consistently broken since GCC 2.95.3, but the default has apparently been 16 bytes for a while:
http:// groups. google. com/group/ ia32-abi/ browse_ frm/thread/ 4f9b3e5069943bf 1 gcc.gnu. org/ml/ gcc-patches/ 2006-09/ msg00252. html
http://