[thumb2,size] Redundant memory load

Bug #634731 reported by Yao Qi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro GCC
Won't Fix
Undecided
Unassigned
gcc
Fix Released
Medium

Bug Description

This bug is opened in GCC, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730
Still exists in FSF GCC trunk when compile test case with "-mthumb -Os -march=armv7-a"

Tags: size task
Revision history for this message
In , Carrot (carrot) wrote :

Created attachment 18183
test case

Revision history for this message
In , Rguenth (rguenth) wrote :

-fgcse-las should do the trick. Note that PRE would do this kind of
optimization on the tree-level, but it is disabled with -Os (so is gcse).

<bb 2>:
  D.1614_2 = p2_1(D)->front;
  p1_3(D)->head = D.1614_2;
  goto <bb 4>;

<bb 3>:
  D.1616_8 = D.1615_4->next;
  p1_3(D)->head = D.1616_8;

<bb 4>:
  D.1615_4 = p1_3(D)->head;

Revision history for this message
In , stevenb (steven-gcc) wrote :

And no, it is *not* OK to remove this kind of redundant code in DCE. The load may be redundant, but it is not dead.

It is not clear to me why cleanup_cfg would move that insn. Perhaps you can show what is going on with the RTL dumps (use -fdump-rtl-all-slim for the most readable results).

Revision history for this message
In , Carrot (carrot) wrote :

In TREE level, the two stores are different statements. Only after register allocation, the two stores get same register and make the load redundant.

try_crossjump_bb tries to find same instruction sequence in all predecessors of a basic block bb, and move that code sequence to head of bb. It is triggered by this function, and the store is moved just before the load.

I tried -fgcse-las but it couldn't do the work.

(In reply to comment #2)
> -fgcse-las should do the trick. Note that PRE would do this kind of
> optimization on the tree-level, but it is disabled with -Os (so is gcse).
>
> <bb 2>:
> D.1614_2 = p2_1(D)->front;
> p1_3(D)->head = D.1614_2;
> goto <bb 4>;
>
> <bb 3>:
> D.1616_8 = D.1615_4->next;
> p1_3(D)->head = D.1616_8;
>
> <bb 4>:
> D.1615_4 = p1_3(D)->head;
>

Revision history for this message
In , stevenb (steven-gcc) wrote :

As you said, try_crossjump_bb tries to find the same instruction sequence in *all* predecessors of a basic block bb. Meaning that the load must have been redundant even before cross jumping occurred.

If you are right, and this redundancy is not exposed until after register allocation, then this may be another case that postreload-gcse should handle (but probably does not).

There are a couple of bugs about the need for a more powerful postreload-gcse. We should perhaps group them (in a meta-bug for example) and make a plan for a fix...

Revision history for this message
In , stevenb (steven-gcc) wrote :

Carrot, can you please try this test case with my patch "crossjump_abstract.diff" from Bug 20070 applied?

Revision history for this message
In , Carrot (carrot) wrote :

(In reply to comment #6)
> Carrot, can you please try this test case with my patch
> "crossjump_abstract.diff" from Bug 20070 applied?
>

I tried your patch. It did remove the redundant memory load. Following is the output

        push {lr}
        ldr r3, [r1]
.L6:
        str r3, [r0]
        mov r2, r3 // M
        cmp r3, #0
        bne .L5
        b .L3
.L4:
        ldr r3, [r3, #8]
        b .L6
.L5:
        ldr r1, [r3, #4]
        cmp r1, #0
        beq .L4
.L3:
        str r2, [r0, #12]
        @ sp needed for prologue
        pop {pc}

In pass ifcvt it noticed the difference of two stores is the different pseudo register number and there is no conflict between the two pseudo registers, so it rename one of them to the same as another and do basic block cross jump on them earlier. Then pass iterate.c.161r.cse2 detected the redundant load and remove it.

But it introduced another redundant move instruction marked as M. At the place r2 is used, r3 still contain the same result as r2, so we can also use r3 there. I think this is another problem.

Revision history for this message
In , stevenb (steven-gcc) wrote :

That redundant move has to be a separate issue, indeed. I would expect the register allocator to coalesce those registers.

I hadn't expected this. I thought the result would be just the removal of the redundant load, but the code that comes out is bigger (14 instructions instead of 13) and has a completely different structure.

I'll see if I can understand what is going on. Thus, mine.

Revision history for this message
In , stevenb (steven-gcc) wrote :

With "GCC: (GNU) 4.5.0 20100108 (experimental) [trunk revision 155731]" and my patch for bug 20070 applied, I get the following code:

iterate:
        push {lr}
        ldr r3, [r1]
.L6:
        str r3, [r0]
        sub r2, r3, #0
        bne .L5
        b .L3
.L4:
        ldr r3, [r3, #8]
        b .L6
.L5:
        ldr r1, [r3, #4]
        cmp r1, #0
        beq .L4
.L3:
        str r2, [r0, #12]
        @ sp needed for prologue
        pop {pc}

Carrot, could you please double-check that this is still correct code?

Revision history for this message
In , Carrot (carrot) wrote :

(In reply to comment #9)
> With "GCC: (GNU) 4.5.0 20100108 (experimental) [trunk revision 155731]" and my
> patch for bug 20070 applied, I get the following code:
>
> iterate:
> push {lr}
> ldr r3, [r1]
> .L6:
> str r3, [r0]
> sub r2, r3, #0
> bne .L5
> b .L3
> .L4:
> ldr r3, [r3, #8]
> b .L6
> .L5:
> ldr r1, [r3, #4]
> cmp r1, #0
> beq .L4
> .L3:
> str r2, [r0, #12]
> @ sp needed for prologue
> pop {pc}
>
> Carrot, could you please double-check that this is still correct code?
>

Yes, it is correct.
There are still 13 instructions, I think it is related to unoptimized basic block order.

Revision history for this message
In , Stevenb-gcc (stevenb-gcc) wrote :

Subject: Re: redundant memory load

On Mon, Jan 11, 2010 at 7:47 AM, carrot at google dot com
<email address hidden> wrote:
>> iterate:
>> push {lr}
>> ldr r3, [r1]
>> .L6:
>> str r3, [r0]
>> sub r2, r3, #0
>> bne .L5
>> b .L3
>> .L4:
>> ldr r3, [r3, #8]
>> b .L6
>> .L5:
>> ldr r1, [r3, #4]
>> cmp r1, #0
>> beq .L4
>> .L3:
>> str r2, [r0, #12]
>> @ sp needed for prologue
>> pop {pc}
>>
>> Carrot, could you please double-check that this is still correct code?
>>
>
> Yes, it is correct.
> There are still 13 instructions, I think it is related to unoptimized basic
> block order.

Yes, I would have expected the block starting with .L4 to be *after*
the block starting with .L5, something like so:

iterate:
        push {lr}
        ldr r3, [r1]
.L6:
        str r3, [r0]
        sub r2, r3, #0
        beq .L3
.L5:
        ldr r1, [r3, #4]
        cmp r1, #0
        bne .L3
        ldr r3, [r3, #8]
        b .L6
.L3:
        str r2, [r0, #12]
        @ sp needed for prologue
        pop {pc}

Does that look correct? And if so, could you see if there is an open
bug report about this; or otherwise file a new PR and add me to the
CC-list?

Revision history for this message
In , Carrot (carrot) wrote :

(In reply to comment #11)
> Yes, I would have expected the block starting with .L4 to be *after*
> the block starting with .L5, something like so:
>
> iterate:
> push {lr}
> ldr r3, [r1]
> .L6:
> str r3, [r0]
> sub r2, r3, #0
> beq .L3
> .L5:
> ldr r1, [r3, #4]
> cmp r1, #0
> bne .L3
> ldr r3, [r3, #8]
> b .L6
> .L3:
> str r2, [r0, #12]
> @ sp needed for prologue
> pop {pc}
>
> Does that look correct? And if so, could you see if there is an open
> bug report about this; or otherwise file a new PR and add me to the
> CC-list?
>

It is correct. The basic block ordering issue (-Os) has been observed several times. Following are related PRs:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41396

Revision history for this message
Yao Qi (yao-codesourcery) wrote :

This bug is opened in GCC, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730
Still exists in FSF GCC trunk when compile test case with "-mthumb -Os -march=armv7-a"

Michael Hope (michaelh1)
tags: added: size task
Changed in gcc:
importance: Unknown → Medium
status: Unknown → In Progress
Revision history for this message
Michael Collison (michael-collison) wrote :

Significantly improved code is geenerated by linaro gcc 4.8 and 4.9

Changed in gcc-linaro:
status: New → Won't Fix
Revision history for this message
In , stevenb (steven-gcc) wrote :

Not working on this...

Changed in gcc:
status: In Progress → Confirmed
Revision history for this message
In , Pinskia (pinskia) wrote :

Fixed for GCC 12 by r12-897-gde56f95afaaa22 (and r11-408-g84935c9822183c).

Revision history for this message
In , Pinskia (pinskia) wrote :

(In reply to Andrew Pinski from comment #14)
> Fixed for GCC 12 by r12-897-gde56f95afaaa22 (and r11-408-g84935c9822183c).

The first redundant load was fixed by r11-408-g84935c9822183c.

The extra store was fixed was fixed by r12-897-gde56f95afaaa22 .

But it is still fixed fully.

Changed in gcc:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.