(In reply to comment #15) > (In reply to comment #14) > > The undeniable advantage of the C code is that the compiler will just use the > > right construct depending on the target architecture. No need to #ifdef > > __thumb__ or whatever. > > Oh, certainly. C code is always preferable to assembly for that very reason, > unless you need to do something that C code can't express efficiently (though > that isn't the case here). > > I'm still not entirely convinced about the SP-fiddling though. It seems unsafe > to me. For instance, how do you know that the stack_space array is at the > bottom of your frame? Because it is the only thing that is allocated on stack in the function. But we can probably use __builtin_alloca instead if you're uncomfortable with that, it generates the same code. > C doesn't guarantee anything there (even though you > declare it last). Yes, it probably works, but it feels unsafe. That's why the > original was in assembly, so it was clear what was happening. > > In the case of writing to SP in the asm block: Yes, the compiler knows you've > clobbered it, but C mandates that you have a stack and I'd be surprised if the > compiler copes well with you doing that. If it isn't using a frame pointer, for > example, it won't know how to wind back the stack pointer. Perhaps writing to > SP in this way forces it to use a frame pointer, but none of that behaviour is > documented (as far as I know) so it's likely to break between even minor > compiler revisions. I don't know if -fomit-frame-pointer is supposed to work on gcc arm, but i can't get gcc to stop using the frame pointer. The function is always marked with "frame_needed = 1", while other functions from the same source are marked as "frame_needed = 0", and as such don't use a frame pointer (even without -fomit-frame-pointer ; I guess it is included in -O2). Anyways, I can think of several other approaches using a bit more assembly, but still leaving most of the work to the C code, I'll check what kind of code get generated with that. Here is the assembly i get with the already attached C code (forcing -fno-inline), with an armv4t target: stmfd sp!, {r4, r5, r6, r7, r8, r9, fp, lr} mov r5, r0 add fp, sp, #28 mov r6, r1 mov r0, r2 mov r1, r3 mov r7, r2 mov r8, r3 bl _ZL18invoke_count_wordsjP13nsXPTCVariant(PLT) mov r3, r0, asl #2 add r3, r3, #18 bic r3, r3, #7 sub sp, sp, r3 mov r1, r7 mov r2, r8 add r0, sp, #4 bl _ZL20invoke_copy_to_stackPjjP13nsXPTCVariant(PLT) mov r4, sp add r3, sp, #16 #APP @ 200 "xptcinvoke_arm.cpp" 1 mov sp, r3 @ 0 "" 2 mov r0, r5 ldr r3, [r4, #12] ldr ip, [r5, #0] ldmib r4, {r1, r2} @ phole ldm ldr ip, [ip, r6, asl #2] mov lr, pc bx ip sub sp, fp, #28 ldmfd sp!, {r4, r5, r6, r7, r8, r9, fp, lr} bx lr Here it uses bl for the function calls and bx for the method invoke. With a armv5t target, the end changes to the following: ldmib r4, {r1, r2} @ phole ldm mov lr, pc ldr pc, [ip, r6, asl #2] sub sp, fp, #28 ldmfd sp!, {r4, r5, r6, r7, r8, r9, fp, pc} Without the assembly, this is how the first three 32-bits words are read from the stack: ldr r3, [sp, #12] ldr ip, [r5, #0] ldmib sp, {r1, r2} @ phole ldm In the end, sp would point on the third 32-bits word, instead of the fourth. Plus, the code is highly dependent on optimization level, while with the sp fiddling, the generated code is safe in all cases I tested.