At -O3, I would expect the compiler to produce an inlined memset like this. It should be faster code, at the expense of space.
At -Os, I would expect this to produce a call to memset, but it doesn't, so there is a bug there.
The above code is probably the most speed-efficient way to do this on older ARM architectures that did not support unaligned stores.
However, as Michael says, more up-to-date architectures can do it like this:
ldrsh r3, [r0, #0] lsls r3, r3, #3 pkhbt r3, r3, r3, LSL #16 str r3, [r0, #12] str r3, [r0, #8] str r3, [r0, #4] str r3, [r0, #0] bx lr
or even like this (although this increases register allocation complexity):
ldrsh r3, [r0, #0] lsls r3, r3, #3 pkhbt r3, r3, r3, LSL #16 mov r3, r4 strd r3, r4, [r0, #8] strd r3, r4, [r0, #0] bx lr
Where NEON is available, and not too expensive (i.e. not A8, probably), this might be an option (again with increased register allocation complexity):
ldrsh r3, [r0, #0] lsls r3, r3, #3 vdup.16 q0, r3 vst1.64 {D0, D1}, [r0]
(I think I'm got that syntax right, but if not, it should at least give the idea.)
At -O3, I would expect the compiler to produce an inlined memset like this. It should be faster code, at the expense of space.
At -Os, I would expect this to produce a call to memset, but it doesn't, so there is a bug there.
The above code is probably the most speed-efficient way to do this on older ARM architectures that did not support unaligned stores.
However, as Michael says, more up-to-date architectures can do it like this:
ldrsh r3, [r0, #0]
lsls r3, r3, #3
pkhbt r3, r3, r3, LSL #16
str r3, [r0, #12]
str r3, [r0, #8]
str r3, [r0, #4]
str r3, [r0, #0]
bx lr
or even like this (although this increases register allocation complexity):
ldrsh r3, [r0, #0]
lsls r3, r3, #3
pkhbt r3, r3, r3, LSL #16
mov r3, r4
strd r3, r4, [r0, #8]
strd r3, r4, [r0, #0]
bx lr
Where NEON is available, and not too expensive (i.e. not A8, probably), this might be an option (again with increased register allocation complexity):
ldrsh r3, [r0, #0]
lsls r3, r3, #3
vdup.16 q0, r3
vst1.64 {D0, D1}, [r0]
(I think I'm got that syntax right, but if not, it should at least give the idea.)