(In reply to comment #15)
> (In reply to comment #14)
> > The undeniable advantage of the C code is that the compiler will just use the
> > right construct depending on the target architecture. No need to #ifdef
> > __thumb__ or whatever.
> 
> Oh, certainly. C code is always preferable to assembly for that very reason,
> unless you need to do something that C code can't express efficiently (though
> that isn't the case here).
> 
> I'm still not entirely convinced about the SP-fiddling though. It seems unsafe
> to me. For instance, how do you know that the stack_space array is at the
> bottom of your frame?

Because it is the only thing that is allocated on stack in the function. But we can probably use __builtin_alloca instead if you're uncomfortable with that, it generates the same code.

> C doesn't guarantee anything there (even though you
> declare it last). Yes, it probably works, but it feels unsafe. That's why the
> original was in assembly, so it was clear what was happening.
> 
> In the case of writing to SP in the asm block: Yes, the compiler knows you've
> clobbered it, but C mandates that you have a stack and I'd be surprised if the
> compiler copes well with you doing that. If it isn't using a frame pointer, for
> example, it won't know how to wind back the stack pointer. Perhaps writing to
> SP in this way forces it to use a frame pointer, but none of that behaviour is
> documented (as far as I know) so it's likely to break between even minor
> compiler revisions.

I don't know if -fomit-frame-pointer is supposed to work on gcc arm, but i can't get gcc to stop using the frame pointer. The function is always marked with "frame_needed = 1", while other functions from the same source are marked as "frame_needed = 0", and as such don't use a frame pointer (even without -fomit-frame-pointer ; I guess it is included in -O2).

Anyways, I can think of several other approaches using a bit more assembly, but still leaving most of the work to the C code, I'll check what kind of code get generated with that.

Here is the assembly i get with the already attached C code (forcing -fno-inline), with an armv4t target:
        stmfd   sp!, {r4, r5, r6, r7, r8, r9, fp, lr}
        mov     r5, r0
        add     fp, sp, #28
        mov     r6, r1
        mov     r0, r2
        mov     r1, r3
        mov     r7, r2
        mov     r8, r3
        bl      _ZL18invoke_count_wordsjP13nsXPTCVariant(PLT)
        mov     r3, r0, asl #2
        add     r3, r3, #18
        bic     r3, r3, #7
        sub     sp, sp, r3
        mov     r1, r7
        mov     r2, r8
        add     r0, sp, #4
        bl      _ZL20invoke_copy_to_stackPjjP13nsXPTCVariant(PLT)
        mov     r4, sp
        add     r3, sp, #16
#APP
@ 200 "xptcinvoke_arm.cpp" 1
        mov        sp, r3
@ 0 "" 2
        mov     r0, r5
        ldr     r3, [r4, #12]
        ldr     ip, [r5, #0]
        ldmib   r4, {r1, r2}    @ phole ldm
        ldr     ip, [ip, r6, asl #2]
        mov     lr, pc
        bx      ip
        sub     sp, fp, #28
        ldmfd   sp!, {r4, r5, r6, r7, r8, r9, fp, lr}
        bx      lr

Here it uses bl for the function calls and bx for the method invoke.

With a armv5t target, the end changes to the following:
        ldmib   r4, {r1, r2}    @ phole ldm
        mov     lr, pc
        ldr     pc, [ip, r6, asl #2]
        sub     sp, fp, #28
        ldmfd   sp!, {r4, r5, r6, r7, r8, r9, fp, pc}

Without the assembly, this is how the first three 32-bits words are read from the stack:
        ldr     r3, [sp, #12]
        ldr     ip, [r5, #0]
        ldmib   sp, {r1, r2}    @ phole ldm

In the end, sp would point on the third 32-bits word, instead of the fourth. Plus, the code is highly dependent on optimization level, while with the sp fiddling, the generated code is safe in all cases I tested.