Bug #656373 “Try -fsched-pressure for ARM” : Bugs : Linaro GCC

Michael Hope (michaelh1) on 2010-10-15

tags:

added: speed task

Revision history for this message

Ramana Radhakrishnan (ramana) wrote on 2011-03-04:

#1

Is there any other information on what platform and what benchmarks this showed a regression in performance ?

Ramana

Revision history for this message

Andrew Stubbs (ams-codesourcery) wrote on 2011-03-04:

#2

Download full text (3.4 KiB)

The benchmarking was done using EEMBC autcor00data_2, hence the figures can't be posted here. The following is Daniel Jacobowitz's summary, except that I've replaced the numbers with relative descriptions. The compiler being tested was a relatively unmodified 4.5.0, last July.

-fvariable-expansion-in-unroller: <speed as baseline>

No code change - expected - loop unrolling is not on by default
on this branch yet.

-funroll-loops: <large speed improvement>

-funroll-loops -fvariable-expansion-in-unroller: <considerably worse than -funroll-loops on it own, but better than baseline>

-fsched-pressure: <speed as baseline>

No pressure? No code change.

-funroll-loops -fsched-pressure: <worse than "-funroll-loops -fvariable-expansion-in-unroller">

Frame size drops from 56 bytes to 16 bytes. But performance goes down.
Why?

  The code does not look all that different; there are fewer spills
  and different register allocation choices. I did not find anywhere
  in the function that this introduced a memory access. But, of
  course, the schedules are completely different.

  Without -fsched-pressure no smulbb in the unrolled loop depends on
  an ldrh immediately before it. With -fsched-pressure two of the
  eight multiplies depend on a load in the previous instruction. I'm
  reasonably sure that will stall. So, that's probably why it is
  slower. Both of those ldrh instructions have a destination register
  that was used in the instruction before the ldrh, so they could not
  be scheduled higher. e.g.:

        ldrh r8, [r9, #2]
        ldrh r9, [r6], #2
        smulbb sl, r9, sl
        add sl, ip, sl, asr r4
        ldrh ip, [r6, #2]
        smulbb r6, ip, r8

  Conversely, the spills without -fsched-pressure did not hurt much,
  because they were around the large unrolled block. The time of the
  block dominated the cost of the spills.

-funroll-loops -fvariable-expansion-in-unroller -fsched-pressure: <better, but still worse that -funroll-loops alone>

  Here the win is straightforward. There were spills in the middle of
  the large block without -fsched-pressure. With it, there are none.
  There are still smulbb's which depend on the previous instruction
  (up to four, from two without variable expansion).

  There were four spills around the block with -funroll-loops; one
  with -funroll-loops -fsched-pressure; two with -funroll-loops
  -fvariable-expansion-in-unroller -fsched-pressure. The four spilled
  variables are i (inner loop index), lag (outer loop index),
  LastIndex (data size - lag, bound of the inner loop), and InputData
  (array base, function invariant). Derived IVs are used instead of
  these in the inner loop, even though they are used at the source
  level in every iteration. Good spill choice.

Conclusions:

  * -fvariable-expansion-in-unroller increases register pressure. We
    should not turn it on without -fsched-pressure. Once that's
    resolved, it could be a big help.

  * -fsched-pressure should be more aggressive when there is a
    spillable register that is live across an expensive block but not
    used inside it. That's easy to say, but I have no idea how to
    do it....

The benchmarking was done using EEMBC autcor00data_2, hence the figures can't be posted here. The following is Daniel Jacobowitz's summary, except that I've replaced the numbers with relative descriptions. The compiler being tested was a relatively unmodified 4.5.0, last July.

-fvariable-expansion-in-unroller: <speed as baseline>

No code change - expected - loop unrolling is not on by default
  on this branch yet.

-funroll-loops: <large speed improvement>

-funroll-loops -fvariable-expansion-in-unroller: <considerably worse than -funroll-loops on it own, but better than baseline>

-fsched-pressure: <speed as baseline>

No pressure?  No code change.

-funroll-loops -fsched-pressure: <worse than "-funroll-loops -fvariable-expansion-in-unroller">

Frame size drops from 56 bytes to 16 bytes.  But performance goes down.
  Why?

The code does not look all that different; there are fewer spills
  and different register allocation choices.  I did not find anywhere
  in the function that this introduced a memory access.  But, of
  course, the schedules are completely different.

Without -fsched-pressure no smulbb in the unrolled loop depends on
  an ldrh immediately before it.  With -fsched-pressure two of the
  eight multiplies depend on a load in the previous instruction.  I'm
  reasonably sure that will stall.  So, that's probably why it is
  slower.  Both of those ldrh instructions have a destination register
  that was used in the instruction before the ldrh, so they could not
  be scheduled higher.  e.g.:

ldrh    r8, [r9, #2]
        ldrh    r9, [r6], #2
        smulbb  sl, r9, sl
        add     sl, ip, sl, asr r4
        ldrh    ip, [r6, #2]
        smulbb  r6, ip, r8

Conversely, the spills without -fsched-pressure did not hurt much,
  because they were around the large unrolled block.  The time of the
  block dominated the cost of the spills.

-funroll-loops -fvariable-expansion-in-unroller -fsched-pressure: <better, but still worse that -funroll-loops alone>

Here the win is straightforward.  There were spills in the middle of
  the large block without -fsched-pressure.  With it, there are none.
  There are still smulbb's which depend on the previous instruction
  (up to four, from two without variable expansion).

There were four spills around the block with -funroll-loops; one
  with -funroll-loops -fsched-pressure; two with -funroll-loops
  -fvariable-expansion-in-unroller -fsched-pressure.  The four spilled
  variables are i (inner loop index), lag (outer loop index),
  LastIndex (data size - lag, bound of the inner loop), and InputData
  (array base, function invariant).  Derived IVs are used instead of
  these in the inner loop, even though they are used at the source
  level in every iteration.  Good spill choice.

Conclusions:

* -fvariable-expansion-in-unroller increases register pressure.  We
    should not turn it on without -fsched-pressure.  Once that's
    resolved, it could be a big help.

* -fsched-pressure should be more aggressive when there is a
    spillable register that is live across an expensive block but not
    used inside it.  That's easy to say, but I have no idea how to
    do it.  The trick would be detecting the situation; then
    we could discount (partially or entirely) such registers.

* Currently, the combination does not (on this test) catch up with
    plain -funroll-loops.  But we are not far off.

Ulrich Weigand (uweigand) on 2012-02-09

Changed in gcc-linaro:
assignee:	nobody → Ulrich Weigand (uweigand)
status:	New → In Progress
importance:	Undecided → Medium

Revision history for this message

Ulrich Weigand (uweigand) wrote on 2012-07-24:

#3

The new and improved -fsched-pressure algorithm is now active by default in mainline and Linaro GCC 4.7, so this bug can be closed.

The related issues of re-investigating the performance of -fvariable-expansion-in-unroller is now tracked instead in a new blueprint:
https://blueprints.launchpad.net/gcc-linaro/+spec/investigate-loop-unrolling

Changed in gcc-linaro:
status:	In Progress → Fix Released

Linaro GCC

Try -fsched-pressure for ARM

Bug Description

Other bug subscribers

Related blueprints

Remote bug watches