Poor code for VFP array stores

Bug #640518 reported by Andrew Stubbs on 2010-09-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro GCC
Fix Released
Medium
Ramana Radhakrishnan

Bug Description

GCC generates extremely poor code for stores to arrays when targeting Cortex-A8. For example:

void foo(float *p, float * q)
{
  int i;

  for (i = 0; i < 100; i++)
    {
      p[i] = q[i] + 1.0f;
    }
}

Compiled with -mcpu=cortex-a8 -mfloat-abi=softfp gives:
        fmrs ip, s1
        str ip, [r0, r3] @ float
Going via core registers incurs a 20 cycle stall on the A8, so is about the worst possible choice.

[CodeSourcery Tracker ID #9072]

Michael Hope (michaelh1) on 2010-09-28
tags: added: speed task
Ramana Radhakrishnan (ramana) wrote :

The problem here is that :

auto-inc-dec prefers the PRE_INC form for the memory reference in question
because ivopts ended up generating preferring the PRE_INC idiom for such
loops rather than that POST_INC idiom. This ends up being because we end up
allowing the PRE_INC form in legitimate_address_p for VFP stores.

This trivial prototype patch - completely untested and unbenchmarked ends up
generating the code which is probably the best we can do under these circumstances.

Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c (revision 171075)
+++ gcc/config/arm/arm.c (working copy)
@@ -5503,6 +5503,11 @@
        && (mode == DImode
     || (mode == DFmode && (TARGET_SOFT_FLOAT || TARGET_VFP))));

+ if (TARGET_HARD_FLOAT
+ && ((code == PRE_INC) || (code == PRE_DEC))
+ && ((mode == SFmode) || (mode == DFmode)))
+ return 0;
+
   if (code == POST_INC || code == PRE_DEC
       || ((code == PRE_INC || code == POST_DEC)
    && (use_ldrd || GET_MODE_SIZE (mode) <= 4)))

foo:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        mov r3, #0
        fconsts s14, #112
.L2:
        fldmias r1!, {s15}
        add r3, r3, #1
        cmp r3, #100
        fadds s15, s15, s14
        fstmias r0!, {s15}
        bne .L2
        bx lr

cheers
Ramana

Changed in gcc-linaro:
assignee: nobody → Ramana Radhakrishnan (ramana)
Ramana Radhakrishnan (ramana) wrote :

More cases that can be improved / looked at.

void food (double *f, double *g)
{
  int i;
  for (i = 0; i < 100; i++)
    {
      f[i] = g[i] + 1.0;
    }
}

FSF trunk and Linaro 4.5 ends up generating

food:
 @ args = 0, pretend = 0, frame = 0
 @ frame_needed = 0, uses_anonymous_args = 0
 @ link register save eliminated.
 sub r1, r1, #8
 sub r0, r0, #8
 mov r3, #0
 fconstd d6, #112
 stmfd sp!, {r4, r5}
.L6:
 add r1, r1, #8
 add r3, r3, #1
 fldd d7, [r1, #0]
 cmp r3, #100
 faddd d7, d7, d6
 fmrrd r4, r5, d7
 strd r4, [r0, #8]!
 bne .L6
 ldmfd sp!, {r4, r5}
 bx lr
 .size food, .-food

This could well be equivalent to the same kind of code generation as above.

float foo (float *x, int i)
{
  *(x + i) = *(x + i) + 1.0f;
  return *(x + i);
}

We end up generating

 add r3, r0, r1, asl #2
 fconsts s15, #112
 flds s14, [r3, #0]
 fadds s15, s14, s15
 fmrs r3, s15
 str r3, [r0, r1, asl #2] @ float
 fmrs r0, s15
 bx lr

FSF trunk looks better with

         add r1, r0, r1, asl #2
 fconsts s15, #112
 flds s14, [r1, #0]
 fadds s15, s14, s15
 fsts s15, [r1, #0]
 fmrs r0, s15
 bx lr

Ramana Radhakrishnan (ramana) wrote :

Now fixed upstream.

Changed in gcc-linaro:
status: New → In Progress
Changed in gcc-linaro:
status: In Progress → Fix Committed
milestone: none → 4.7-2012.07
Changed in gcc-linaro:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers