Unnecessary generation of temporaries with tst type operations

Bug #851258 reported by Ramana Radhakrishnan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linaro GCC
Triaged
Low
Unassigned

Bug Description

Consider the following testcase.

typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
typedef unsigned long uintptr_t;
typedef unsigned short uint16_t;
void f2(char *d, char const *s, int flags)
{
  uint32_t tmp0, tmp1;

  if (flags & 1)
    tmp0 = *s++;

  if (flags & 2)
    {
      uint16_t *ss = (void *)s;
      tmp1 = *ss++;
      s = (void *)ss;
    }

  if (flags & 1)
    *d++ = tmp0;

  if (flags & 2)
    {
      uint16_t *dd = (void *)d;
      *dd++ = tmp1;
      d = (void *)dd;
    }
}

GCC currently generates

 push {r4, r5}
 ands r5, r2, #1
 it ne
 ldrbne r4, [r1], #1 @ zero_extendqisi2
 ands r2, r2, #2
 it ne
 ldrhne r3, [r1, #0]
 cbz r5, .L4
 strb r4, [r0], #1
.L4:
 cbz r2, .L1
 strh r3, [r0, #0] @ movhi
.L1:
 pop {r4, r5}
 bx lr

This could very well instead be :

 tst r2, #1
        it ne
 ldrneb ip, [r1], #1 @ zero_extendqisi2
 tst r2, #2
        it ne
 ldrneh r3, [r1, #0]
 tst r2, #1
        it ne
 strneb ip, [r0], #1
 tst r2, #2
 strneh r3, [r0, #0] @ movhi
 bx lr

This is also a problem on other ports given that the a & b tst operation is CSE'd and the result is compared against 0.

cheers
Ramana

Tags: size speed task
Michael Hope (michaelh1)
tags: added: size speed task
Revision history for this message
Andrew Stubbs (ams-codesourcery) wrote :

Current upstream trunk gives the following:

        ands r3, r2, #1
        push {r4, r5}
        itt ne
        ldrbne r5, [r1, #0] @ zero_extendqisi2
        addne r1, r1, #1
        ands r2, r2, #2
        it ne
        ldrhne r4, [r1, #0]
        cbz r3, .L4
        strb r5, [r0, #0]
        adds r0, r0, #1
.L4:
        cbz r2, .L1
        strh r4, [r0, #0] @ movhi
.L1:
        pop {r4, r5}
        bx lr

The unnecessary temporaries still exist, *and* it now fails to do the auto-increment correctly as well. :(

description: updated
Revision history for this message
Andrew Stubbs (ams-codesourcery) wrote :

The problem is introduced in the "fre1" tree pass (currently number 028).

I see three ways to fix this:

1. Fix "fre1" to not do it where the backend has a suitable compare insn. I suspect this is not going to happen since the pass probably does 'see' that much context.

2. Add another tree pass (in a similar vein to the widening-multiply passes, and assuming there isn't already one that should do this) that searches for test-and-branch cases and emits the whole thing as a single tree-code. Then teach expand to DTRT.

3. Get the "combine" pass to fix it at the RTL stage. However, I think combine currently refuses to do anything with pseudos that are used in more than one place, so maybe that's not going to happen.

Clearly, the optimization is only useful in the case where *both* the value being tested *and* the test result live on past the branch. In the case where the either value is dead the optimization should be harmless (combine would have dealt with them anyway, and/or register pressure would not be an issue).

Revision history for this message
Andrew Stubbs (ams-codesourcery) wrote :

Just to be clear, in option 2 the the pass would scan the optab, and/or use a target hook to ensure that the operation is available before making the transformation.

Changed in gcc-linaro:
status: New → Triaged
Changed in gcc-linaro:
importance: Undecided → Low
Revision history for this message
Simon Hosie (simon-hosie) wrote :

I should point out that if this case is optimised then what hangs off the back of it is the possibility to combine all of the booleans in a function into a single register, thereby freeing up all but one of those registers. Either automatically, or by manually bundling them in a local bitfield.

Revision history for this message
Simon Hosie (simon-hosie) wrote :

I should also mention that where the compiler is currently taking common comparison results and storing them in whole registers, it could potentially use the technique above to merge all of these -- its own temporary booleans -- into that same packed booleans register at the cost of one conditional bit-setting operation at comparison time. That's probably kind of obvious, but I thought maybe worth logging anyway.

Is there any simple way to estimate how many booleans the compiler might be keeping in flight at a time to see where this could save registers?

Revision history for this message
Kugan Vivekanandarajah (kugan-vivekanandarajah) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.