Linaro GCC

Avoid emitting multiple stores to consecutive addresses

Bug #745743 reported by Revital Eres on 2011-03-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Linaro GCC	Triaged	Low	Unassigned

Bug Description

The following code, inspired from hot code in DENbecnch mp2decoddata1/dec_idct.c, contains consecutive stores that might be avoided using memset.

The code was compiled with Linaro GCC 4.6-2011.03-dev1 with flags -mcpu=cortex-a9 -mtune=cortex-a9 -mthumb -S -c -O3

foo:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        uxth r3, r3
        strh r3, [r0, #14] @ movhi
        strh r3, [r0, #12] @ movhi
        strh r3, [r0, #10] @ movhi
        strh r3, [r0, #8] @ movhi
        strh r3, [r0, #6] @ movhi
        strh r3, [r0, #4] @ movhi
        strh r3, [r0, #2] @ movhi
        strh r3, [r0, #0] @ movhi
        bx lr

the program:

void
foo (short *arr)
{
arr[0] = arr[1] = arr[2] = arr[3] = arr[4] = arr[5] = arr[6] = arr[7] =
(short) arr[0] << 3;

}

Tags:

Revital Eres (eres) on 2011-03-30

summary:	- Avoid emitting consecutive stores + Avoid emitting stores to consecutive addresses
summary:	- Avoid emitting stores to consecutive addresses + Avoid emitting multiply stores to consecutive addresses

Revision history for this message

Michael Hope (michaelh1) wrote on 2011-03-30: Re: Avoid emitting multiply stores to consecutive addresses

Note that ARMv7 supports unaligned access so a inline block set using 32 bit stores might be better than a call to memset().

Michael Hope (michaelh1) on 2011-03-31

tags:

added: size speed task

Andrew Stubbs (ams-codesourcery) on 2011-04-05

summary:

- Avoid emitting multiply stores to consecutive addresses
+ Avoid emitting multiple stores to consecutive addresses

Revision history for this message

Andrew Stubbs (ams-codesourcery) wrote on 2011-04-05:

At -O3, I would expect the compiler to produce an inlined memset like this. It should be faster code, at the expense of space.

At -Os, I would expect this to produce a call to memset, but it doesn't, so there is a bug there.

The above code is probably the most speed-efficient way to do this on older ARM architectures that did not support unaligned stores.

However, as Michael says, more up-to-date architectures can do it like this:

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        pkhbt r3, r3, r3, LSL #16
        str r3, [r0, #12]
        str r3, [r0, #8]
        str r3, [r0, #4]
        str r3, [r0, #0]
        bx lr

or even like this (although this increases register allocation complexity):

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        pkhbt r3, r3, r3, LSL #16
        mov r3, r4
        strd r3, r4, [r0, #8]
        strd r3, r4, [r0, #0]
        bx lr

Where NEON is available, and not too expensive (i.e. not A8, probably), this might be an option (again with increased register allocation complexity):

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        vdup.16 q0, r3
        vst1.64 {D0, D1}, [r0]

(I think I'm got that syntax right, but if not, it should at least give the idea.)

Revision history for this message

Michael Hope (michaelh1) wrote on 2011-04-12:

I had a quick play. The uxth goes away with the extension elimination pass. In a micro benchmark, storing in decreasing order takes 3.0 s and increasing order 2.5 s. I didn't try Andrew's suggestions.

Revision history for this message

Ira Rosen (irar) wrote on 2011-04-12: AUTO: Ira Rosen is out of the office. (returning 17/04/2011)

I am out of the office until 17/04/2011.

Note: This is an automated response to your message "[Bug 745743] Re:
Avoid emitting multiple stores to consecutive addresses" sent on 12/4/11
13:17:42.

This is the only notification you will receive while this person is away.

Revision history for this message

Revital Eres (eres) wrote on 2011-05-24:

load and store multiple instructions (LDM and STM) and load and store double-word (LDRD/STRD) must be aligned
to at least a word boundary. (according to Cortex™-A Series book) So the version with strd should not be used for unknown alignment.

Ulrich Weigand (uweigand) on 2012-05-31