Comment 2 for bug 745743

Revision history for this message
Andrew Stubbs (ams-codesourcery) wrote :

At -O3, I would expect the compiler to produce an inlined memset like this. It should be faster code, at the expense of space.

At -Os, I would expect this to produce a call to memset, but it doesn't, so there is a bug there.

The above code is probably the most speed-efficient way to do this on older ARM architectures that did not support unaligned stores.

However, as Michael says, more up-to-date architectures can do it like this:

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        pkhbt r3, r3, r3, LSL #16
        str r3, [r0, #12]
        str r3, [r0, #8]
        str r3, [r0, #4]
        str r3, [r0, #0]
        bx lr

or even like this (although this increases register allocation complexity):

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        pkhbt r3, r3, r3, LSL #16
        mov r3, r4
        strd r3, r4, [r0, #8]
        strd r3, r4, [r0, #0]
        bx lr

Where NEON is available, and not too expensive (i.e. not A8, probably), this might be an option (again with increased register allocation complexity):

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        vdup.16 q0, r3
        vst1.64 {D0, D1}, [r0]

(I think I'm got that syntax right, but if not, it should at least give the idea.)