[size] Failure to optimise (a/b) and (a%b) into single __aeabi_idivmod call

Bug #637797 reported by Yao Qi on 2010-09-14
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro GCC
Triaged
Medium
Ramana Radhakrishnan
gcc
Fix Released
Wishlist

Bug Description

Track GCC PR http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43721

Once this bug is fixed, code size of CINT2000/252.eon/src/myrand.cc:myrand() can be reduced to some extent.

GCC FSF turnk (20100802) generates code like this,
0000000c <_Z7my_randv>:
   c: b570 push {r4, r5, r6, lr}
   e: 4c0d ldr r4, [pc, #52] ; (44 <_Z7my_randv+0x38>)
  10: 490d ldr r1, [pc, #52] ; (48 <_Z7my_randv+0x3c>)
  12: 6825 ldr r5, [r4, #0]
  14: 4e0d ldr r6, [pc, #52] ; (4c <_Z7my_randv+0x40>)
  16: 4628 mov r0, r5
  18: f7ff fffe bl 0 <__aeabi_idiv>
  1c: 4346 muls r6, r0
  1e: 490a ldr r1, [pc, #40] ; (48 <_Z7my_randv+0x3c>)
  20: 4628 mov r0, r5
  22: f7ff fffe bl 0 <__aeabi_idivmod>
  26: f244 13a7 movw r3, #16807 ; 0x41a7
  2a: fb03 6601 mla r6, r3, r1, r6
  2e: 2e00 cmp r6, #0
  30: bfdc itt le
  32: f106 4600 addle.w r6, r6, #2147483648 ; 0x80000000
  36: 3e01 suble r6, #1
  38: 6026 str r6, [r4, #0]
  3a: 6820 ldr r0, [r4, #0]
  3c: f3c0 200e ubfx r0, r0, #8, #15
  40: bd70 pop {r4, r5, r6, pc}
  42: bf00 nop

Once bug is fixed, code can be better,
00000006 <_Z7my_randv>:
   6: b510 push {r4, lr}
   8: 4c0a ldr r4, [pc, #40] ; (34 <_Z7my_randv+0x2e>)
   a: 490b ldr r1, [pc, #44] ; (38 <_Z7my_randv+0x32>)
   c: 6820 ldr r0, [r4, #0]
   e: f7ff fffe bl 0 <__aeabi_idivmod>
  12: f244 12a7 movw r2, #16807 ; 0x41a7
  16: 4351 muls r1, r2
  18: f46f 7231 mvn.w r2, #708 ; 0x2c4
  1c: 4350 muls r0, r2
  1e: eb01 0080 add.w r0, r1, r0, lsl #2
  22: 2800 cmp r0, #0
  24: dc02 bgt.n 2c <_Z7my_randv+0x26>
  26: f06f 4100 mvn.w r1, #2147483648 ; 0x80000000
  2a: 4408 add r0, r1
  2c: 6020 str r0, [r4, #0]
  2e: f3c0 200e ubfx r0, r0, #8, #15
  32: bd10 pop {r4, pc}

Still the case with 4.5.

> arm-none-linux-gnueabi-gcc -Os -S divmod.c
> cat divmod.s
 .cpu arm10tdmi
 .fpu softvfp
 .eabi_attribute 20, 1
 .eabi_attribute 21, 1
 .eabi_attribute 23, 3
 .eabi_attribute 24, 1
 .eabi_attribute 25, 1
 .eabi_attribute 26, 2
 .eabi_attribute 30, 4
 .eabi_attribute 18, 4
 .file "divmod.c"
 .global __aeabi_idivmod
 .global __aeabi_idiv
 .text
 .align 2
 .global divmod
 .type divmod, %function
divmod:
 @ args = 0, pretend = 0, frame = 0
 @ frame_needed = 0, uses_anonymous_args = 0
 stmfd sp!, {r4, r5, r6, lr}
 mov r6, r0
 mov r5, r1
 bl __aeabi_idivmod
 mov r0, r6
 mov r4, r1
 mov r1, r5
 bl __aeabi_idiv
 add r0, r4, r0
 ldmfd sp!, {r4, r5, r6, pc}
 .size divmod, .-divmod
 .ident "GCC: (GNU) 4.5.0 20100325 (experimental)"
 .section .note.GNU-stack,"",%progbits

Before confirming this bug, we need to determine if the following program has well defined behaviour:

int count_div0 = 0;

int __aeabi_div0()
{
  count_div0++;
  return 0;
}

int divmod(int a, int b)
{
    int q = a / b;
    int r = a % b;
    return q + r;
}

int main()
{
  divmod(33, 0);
  printf("%d", count_div0);
  return 0;
}

If the above program should always print "2", then the compiler is correct and this report is invalid. If it may print any value, then the compiler is wrong and should be fixed to optimize this case. Note that __aeabi_div0 will be called if either __aeabi_idiv or __aeabi_idivmod are called with 0 as the second parameter.

In , Rguenth (rguenth) wrote :

The policy has been AFAIR that we can collapse multiple traps into a single one,
but not remove traps completely (or of course not introduce new ones).

The C99 standard says this about division by zero:

  The result of the / operator is the quotient from the division
  of the first operand by the second; the result of the % operator
  is the remainder. In both operations, if the value of the second
  operand is zero, the behavior is undefined.

The ARM ABI states the following about the __aeabi_div0() function:

  The *div0 functions:
  - Return the value passed to them as a parameter.
  - Or, return a fixed value defined by the execution environment (such as 0).
  - Or, raise a signal (often SIGFPE) or throw an exception, and do not return.
  [...]
  The *div and *divmod functions may be inlined by a tool chain. It is
  Q-o-I whether an inlined version calls *div0 out of line or returns the
  values that would have been returned by a particular value-returning
  version of *div0.

I interpret these as saying the application may make no assumptions about
the behaviour after a division (or modulus) by zero.

I think that's enough evidence to confirm.

Yao Qi (yao-codesourcery) wrote :

Track GCC PR http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43721

Once this bug is fixed, code size of CINT2000/252.eon/src/myrand.cc:myrand() can be reduced to some extent.

GCC FSF turnk (20100802) generates code like this,
0000000c <_Z7my_randv>:
   c: b570 push {r4, r5, r6, lr}
   e: 4c0d ldr r4, [pc, #52] ; (44 <_Z7my_randv+0x38>)
  10: 490d ldr r1, [pc, #52] ; (48 <_Z7my_randv+0x3c>)
  12: 6825 ldr r5, [r4, #0]
  14: 4e0d ldr r6, [pc, #52] ; (4c <_Z7my_randv+0x40>)
  16: 4628 mov r0, r5
  18: f7ff fffe bl 0 <__aeabi_idiv>
  1c: 4346 muls r6, r0
  1e: 490a ldr r1, [pc, #40] ; (48 <_Z7my_randv+0x3c>)
  20: 4628 mov r0, r5
  22: f7ff fffe bl 0 <__aeabi_idivmod>
  26: f244 13a7 movw r3, #16807 ; 0x41a7
  2a: fb03 6601 mla r6, r3, r1, r6
  2e: 2e00 cmp r6, #0
  30: bfdc itt le
  32: f106 4600 addle.w r6, r6, #2147483648 ; 0x80000000
  36: 3e01 suble r6, #1
  38: 6026 str r6, [r4, #0]
  3a: 6820 ldr r0, [r4, #0]
  3c: f3c0 200e ubfx r0, r0, #8, #15
  40: bd70 pop {r4, r5, r6, pc}
  42: bf00 nop

Once bug is fixed, code can be better,
00000006 <_Z7my_randv>:
   6: b510 push {r4, lr}
   8: 4c0a ldr r4, [pc, #40] ; (34 <_Z7my_randv+0x2e>)
   a: 490b ldr r1, [pc, #44] ; (38 <_Z7my_randv+0x32>)
   c: 6820 ldr r0, [r4, #0]
   e: f7ff fffe bl 0 <__aeabi_idivmod>
  12: f244 12a7 movw r2, #16807 ; 0x41a7
  16: 4351 muls r1, r2
  18: f46f 7231 mvn.w r2, #708 ; 0x2c4
  1c: 4350 muls r0, r2
  1e: eb01 0080 add.w r0, r1, r0, lsl #2
  22: 2800 cmp r0, #0
  24: dc02 bgt.n 2c <_Z7my_randv+0x26>
  26: f06f 4100 mvn.w r1, #2147483648 ; 0x80000000
  2a: 4408 add r0, r1
  2c: 6020 str r0, [r4, #0]
  2e: f3c0 200e ubfx r0, r0, #8, #15
  32: bd10 pop {r4, pc}

Yao Qi (yao-codesourcery) wrote :

This optimization should be also useful to 256.bzip2/spec.c:ran().

Michael Hope (michaelh1) on 2010-09-28
tags: added: size task

Here's the problem: expand_divmod always prefers the straight "div" library call over the "divmod" library call, no exceptions.

Yes, "divmodsi4" in a .md is indeed preferred over "divsi4", so the optimization works fine on targets that have those, but ARM doesn't, so those rules are irrelevant.

ARM does not provide a separate library function for "mod", so expand_divmod does use "divmod" for mod operations, but never for div operations.

The reason for the apparent opposite rules appears to be that the divmodsi4 rule can be coded to auto-detect the most optimal kind of underlying operation to use, whereas the library call is fixed, once chosen.

I see two possible ways to fix this:
 1. Teach CSE (or somewhere) that div and divmod library calls have some overlap.
 2. Always prefer divmod, and find some other way to convert it to div, where appropriate.

I don't see any way to generate the RTL with the right call, so I'm pretty sure it does have to be fixed up after the fact.

Or, I could be way off. :)

Here's the problem: expand_divmod always prefers the straight "div" library
call over the "divmod" library call, no exceptions.

Yes, "divmodsi4" in a .md is indeed preferred over "divsi4", so the
optimization works fine on targets that have those, but ARM doesn't, so those
rules are irrelevant.

ARM does not provide a separate library function for "mod", so expand_divmod
does use "divmod" for mod operations, but never for div operations.

The reason for the apparent opposite rules appears to be that the divmodsi4
rule can be coded to auto-detect the most optimal kind of underlying operation
to use, whereas the library call is fixed, once chosen.

I see two possible ways to fix this:
 1. Teach CSE (or somewhere) that div and divmod library calls have some
overlap.
 2. Always prefer divmod, and find some other way to convert it to div, where
appropriate.

I don't see any way to generate the RTL with the right call, so I'm pretty sure
it does have to be fixed up after the fact.

Or, I could be way off. :)

On 10/21/2010 12:54 AM, Andrew Stubbs wrote:
> Here's the problem: expand_divmod always prefers the straight "div" library
> call over the "divmod" library call, no exceptions.
>

Can you explain a little bit further? I go through
expmed.c:expand_divmod, but didn't get such conclusion that
"expand_divmod always prefers the straight div over divmod".

> Yes, "divmodsi4" in a .md is indeed preferred over "divsi4", so the
> optimization works fine on targets that have those, but ARM doesn't, so those
> rules are irrelevant.
>

What is "divsi4"? Is it a RTL pattern or something else? I can't find
it in gcc source by grep.

> ARM does not provide a separate library function for "mod", so expand_divmod
> does use "divmod" for mod operations, but never for div operations.
>
> The reason for the apparent opposite rules appears to be that the divmodsi4
> rule can be coded to auto-detect the most optimal kind of underlying operation
> to use, whereas the library call is fixed, once chosen.
>
> I see two possible ways to fix this:
> 1. Teach CSE (or somewhere) that div and divmod library calls have some
> overlap.
> 2. Always prefer divmod, and find some other way to convert it to div, where
> appropriate.

Can we define a peephole2 here? or we can do something on tree level?

--
Yao Qi
CodeSourcery
<email address hidden>
(650) 331-3385 x739

> Can we define a peephole2 here? or we can do something on tree level?

I believe a peephole optimization would work, if we set expand_divmod to always prefer the full divmod library call. We'd also need to either arrange for the divmod/div decision to be target specific, or else fix up any other affected backends.

However, a peephole wouldn't get to do anything until after register allocation, I think. Would that not make it suboptimal?

I'm not sure a tree level change would be appropriate. Apart from anything else, it might get rejected upstream simply because the existing system works (for most people). Presumably you'd want it to convert div+mod to a builtin function, or something, that would then be translated to the right backend function later?

Ramana Radhakrishnan (ramana) wrote :

This is still a size and speed optimization at Os but not for this benchmark at O2 and O3. It's also a speed optimization in case you are generating quotients and remainders of 2 different numbers in a routine. It's not a speed optimization for this particular benchmark because you are still generating the multiplies at O2 and O3 but is alright for the same. For this case we won't get a further benefit with code generation and improvements in this particular benchmark for speed.

I've prototyped this on trunk with the branch

https://code.launchpad.net/~ramana/gcc-linaro/divmodsi4-experiments

Ramana

Changed in gcc-linaro:
assignee: nobody → Ramana Radhakrishnan (ramana)
milestone: none → 11.05-final
milestone: 11.05-final → none
Changed in gcc:
importance: Unknown → Wishlist
status: Unknown → Confirmed
Changed in gcc-linaro:
milestone: none → 4.5-2011.02-0
Michael Hope (michaelh1) on 2011-02-08
Changed in gcc-linaro:
milestone: 4.5-2011.02-0 → 4.5-2011.03-0
status: New → Confirmed

*** Bug 47777 has been marked as a duplicate of this bug. ***

Michael Hope (michaelh1) on 2011-03-08
Changed in gcc-linaro:
milestone: 4.5-2011.03-0 → none
tags: added: speed
Changed in gcc-linaro:
importance: Undecided → Medium
Changed in gcc-linaro:
status: Confirmed → Triaged

*** Bug 65668 has been marked as a duplicate of this bug. ***

Author: prathamesh3492
Date: Fri Oct 28 19:05:12 2016
New Revision: 241660

URL: https://gcc.gnu.org/viewcvs?rev=241660&root=gcc&view=rev
Log:
2016-10-28 Prathamesh Kulkarni <email address hidden>
     Kugan Vivekanandarajah <email address hidden>
     Jim Wilson <email address hidden>

 PR tree-optimization/43721
 * target.def: New hook expand_divmod_libfunc.
 * doc/tm.texi.in: Add hook for TARGET_EXPAND_DIVMOD_LIBFUNC
 * doc/tm.texi: Regenerate.
 * internal-fn.def: Add new entry for DIVMOD ifn.
 * internal-fn.c (expand_DIVMOD): New.
 * tree-ssa-math-opts.c: Include optabs-libfuncs.h, tree-eh.h,
 targhooks.h.
 (widen_mul_stats): Add new field divmod_calls_inserted.
 (target_supports_divmod_p): New.
 (divmod_candidate_p): Likewise.
 (convert_to_divmod): Likewise.
 (pass_optimize_widening_mul::execute): Call
 calculate_dominance_info(), renumber_gimple_stmt_uids() at
 beginning of function. Call convert_to_divmod()
 and record stats for divmod.
 * config/arm/arm.c (arm_expand_divmod_libfunc): Override hook
 TARGET_EXPAND_DIVMOD_LIBFUNC.
 * doc/sourcebuild.texi: Add items for arm_divmod_simode, divmod,
 divmod_simode.

testsuite/
 * lib/target-supports.exp (check_effective_target_divmod): New.
 (check_effective_target_divmod_simode): Likewise.
 (check_effective_target_arm_divmod_simode): Likewise.
 * gcc.dg/divmod-1-simode.c: New test.
 * gcc.dg/divmod-1.c: Likewise.
 * gcc.dg/divmod-2-simode.c: Likewise.
 * gcc.dg/divmod-2.c: Likewise.
 * gcc.dg/divmod-3-simode.c: Likewise.
 * gcc.dg/divmod-3.c: Likewise.
 * gcc.dg/divmod-4-simode.c: Likewise.
 * gcc.dg/divmod-4.c: Likewise.
 * gcc.dg/divmod-5.c: Likewise.
 * gcc.dg/divmod-6-simode.c: Likewise.
 * gcc.dg/divmod-6.c: Likewise.
 * gcc.dg/divmod-7.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/divmod-1-simode.c
    trunk/gcc/testsuite/gcc.dg/divmod-1.c
    trunk/gcc/testsuite/gcc.dg/divmod-2-simode.c
    trunk/gcc/testsuite/gcc.dg/divmod-2.c
    trunk/gcc/testsuite/gcc.dg/divmod-3-simode.c
    trunk/gcc/testsuite/gcc.dg/divmod-3.c
    trunk/gcc/testsuite/gcc.dg/divmod-4-simode.c
    trunk/gcc/testsuite/gcc.dg/divmod-4.c
    trunk/gcc/testsuite/gcc.dg/divmod-5.c
    trunk/gcc/testsuite/gcc.dg/divmod-6-simode.c
    trunk/gcc/testsuite/gcc.dg/divmod-6.c
    trunk/gcc/testsuite/gcc.dg/divmod-7.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
    trunk/gcc/doc/sourcebuild.texi
    trunk/gcc/doc/tm.texi
    trunk/gcc/doc/tm.texi.in
    trunk/gcc/internal-fn.c
    trunk/gcc/internal-fn.def
    trunk/gcc/target.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/lib/target-supports.exp
    trunk/gcc/tree-ssa-math-opts.c

Fixed on trunk.

Can the bug be marked as resolved?

Fixed in GCC7 then.

Changed in gcc:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.