Avoid emitting multiple stores to consecutive addresses

Bug #745743 reported by Revital Eres on 2011-03-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro GCC
Triaged
Low
Unassigned

Bug Description

The following code, inspired from hot code in DENbecnch mp2decoddata1/dec_idct.c, contains consecutive stores that might be avoided using memset.

The code was compiled with Linaro GCC 4.6-2011.03-dev1 with flags -mcpu=cortex-a9 -mtune=cortex-a9 -mthumb -S -c -O3

foo:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        uxth r3, r3
        strh r3, [r0, #14] @ movhi
        strh r3, [r0, #12] @ movhi
        strh r3, [r0, #10] @ movhi
        strh r3, [r0, #8] @ movhi
        strh r3, [r0, #6] @ movhi
        strh r3, [r0, #4] @ movhi
        strh r3, [r0, #2] @ movhi
        strh r3, [r0, #0] @ movhi
        bx lr

the program:

void
foo (short *arr)
{
  arr[0] = arr[1] = arr[2] = arr[3] = arr[4] = arr[5] = arr[6] = arr[7] =
    (short) arr[0] << 3;

}

Revital Eres (eres) on 2011-03-30
summary: - Avoid emitting consecutive stores
+ Avoid emitting stores to consecutive addresses
summary: - Avoid emitting stores to consecutive addresses
+ Avoid emitting multiply stores to consecutive addresses

Note that ARMv7 supports unaligned access so a inline block set using 32 bit stores might be better than a call to memset().

Michael Hope (michaelh1) on 2011-03-31
tags: added: size speed task
summary: - Avoid emitting multiply stores to consecutive addresses
+ Avoid emitting multiple stores to consecutive addresses

At -O3, I would expect the compiler to produce an inlined memset like this. It should be faster code, at the expense of space.

At -Os, I would expect this to produce a call to memset, but it doesn't, so there is a bug there.

The above code is probably the most speed-efficient way to do this on older ARM architectures that did not support unaligned stores.

However, as Michael says, more up-to-date architectures can do it like this:

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        pkhbt r3, r3, r3, LSL #16
        str r3, [r0, #12]
        str r3, [r0, #8]
        str r3, [r0, #4]
        str r3, [r0, #0]
        bx lr

or even like this (although this increases register allocation complexity):

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        pkhbt r3, r3, r3, LSL #16
        mov r3, r4
        strd r3, r4, [r0, #8]
        strd r3, r4, [r0, #0]
        bx lr

Where NEON is available, and not too expensive (i.e. not A8, probably), this might be an option (again with increased register allocation complexity):

        ldrsh r3, [r0, #0]
        lsls r3, r3, #3
        vdup.16 q0, r3
        vst1.64 {D0, D1}, [r0]

(I think I'm got that syntax right, but if not, it should at least give the idea.)

Michael Hope (michaelh1) wrote :

I had a quick play. The uxth goes away with the extension elimination pass. In a micro benchmark, storing in decreasing order takes 3.0 s and increasing order 2.5 s. I didn't try Andrew's suggestions.

I am out of the office until 17/04/2011.

Note: This is an automated response to your message "[Bug 745743] Re:
Avoid emitting multiple stores to consecutive addresses" sent on 12/4/11
13:17:42.

This is the only notification you will receive while this person is away.

Revital Eres (eres) wrote :

load and store multiple instructions (LDM and STM) and load and store double-word (LDRD/STRD) must be aligned
to at least a word boundary. (according to Cortex™-A Series book) So the version with strd should not be used for unknown alignment.

Changed in gcc-linaro:
status: New → Triaged
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.