Slowdown in 175.vpr (using LDR from literal pool rather than MOVW/MOVT)

Bug #886124 reported by Kevin Jaget
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro GCC
Fix Released
Medium
Zhenqiang Chen

Bug Description

Testing 2011-10 versions of 4.6 vs 4.5, and I'm seeing a slowdown in 175.vpr from SPEC2000. From profiling, at least part of the slowdown is in get_non_updateable_bb from place.c. It has code with multiple levels of array indirection e.g.:

 x = block[net[inet].pins[0]].x;
 y = block[net[inet].pins[0]].y;

in 4.5, most of the array addresses were loaded using MOVW/MOVT pairs :

    get_non_updateable_bb
        0x0001178c: e3083fc0 .?.. MOV r3,#0x8fc0
        0x00011790: e0800100 .... ADD r0,r0,r0,LSL #2
        0x00011794: e340300c .0@. MOVT r3,#0xc
        0x00011798: e5932000 . .. LDR r2,[r3,#0]
        0x0001179c: e3083fbc .?.. MOV r3,#0x8fbc
        0x000117a0: e92d05f0 ..-. PUSH {r4-r8,r10}
        0x000117a4: e340300c .0@. MOVT r3,#0xc
        0x000117a8: e5936000 .`.. LDR r6,[r3,#0]
        0x000117ac: e0822100 .!.. ADD r2,r2,r0,LSL #2
        0x000117b0: e9921080 .... LDMIB r2,{r7,r12}
        0x000117b4: e59c3000 .0.. LDR r3,[r12,#0]

In 4.6, the MOVW/MOVT pairs have been replaced by LDRs from a literal pool :

    get_non_updateable_bb
        0x00011bc0: e59f30f4 .0.. LDR r3,[pc,#244] ; [0x11cbc] = 0xd01c8
        0x00011bc4: e0800100 .... ADD r0,r0,r0,LSL #2
        0x00011bc8: e92d05f0 ..-. PUSH {r4-r8,r10}
        0x00011bcc: e59f20ec . .. LDR r2,[pc,#236] ; [0x11cc0] = 0xd0194
        0x00011bd0: e5933000 .0.. LDR r3,[r3,#0]
        0x00011bd4: e5926000 .`.. LDR r6,[r2,#0]
        0x00011bd8: e0830100 .... ADD r0,r3,r0,LSL #2
        0x00011bdc: e9901080 .... LDMIB r0,{r7,r12}
        0x00011be0: e59c3000 .0.. LDR r3,[r12,#0]

This leads to much slower code since a MOVW/MOVT is faster than even an L1 hit for the loads.

I'm building using

arm-unknown-linux-gnueabi-gcc -c -o place.o -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp -O3 -g -DSPEC_CPU2000 place.c

I'm not sure of the legality of posting the full pre-processed source for this test, but hopefully enough people have access to it to reproduce this.

Tags: speed task
Revision history for this message
Michael Hope (michaelh1) wrote :

(hmm, vpr is under a non-free license so we can't post source)

tags: added: speed task
Revision history for this message
Michael Hope (michaelh1) wrote :

Thank you for the bug report. In a Thumb-2 A9 NEON -O3 configuration, vpr is mildly faster overall in gcc-linaro-4.6 than our 4.5 or FSF 4.5 or 4.6.

I've added this to the agenda for our next performance call:
 https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2011-11-15

so that we can look into it further. It might be ARM mode specific, or might show an improvement that we can pull in from ARM mode.

Michael Hope (michaelh1)
Changed in gcc-linaro:
assignee: nobody → Ramana Radhakrishnan (ramana)
importance: Undecided → Medium
Revision history for this message
Ramana Radhakrishnan (ramana) wrote :

Hmmm looks like arm_force_const_mem and arm_legitimate_constant_p probably need some updating for the movw / movt cases.

Changed in gcc-linaro:
status: New → Triaged
Revision history for this message
Ramana Radhakrishnan (ramana) wrote :

It's not as trivial as changing those functions alone but it needs a bit more work and massaging through various parts of the backend.

Revision history for this message
Ramana Radhakrishnan (ramana) wrote :

The root cause appears to be that the predicates for movsi are looser than what they should be. Because of this what you can see is that if the afflicted function is compiled with -fdbg-cnt=cprop:0 the movw's / movt's will reappear.

 The reason is that cprop1 notices a high and a lo_sum of the same register and notices that it can essentially replace lo_sum (high: symbol_ref bloc) (symbol_ref (block)) with a (symbol_ref (block)) and merge the 2 together - i.e.

(insn 57 56 58 2 (set (reg/f:SI 227)
        (high:SI (symbol_ref:SI ("block") [flags 0xc0] <var_decl 0xb72692a0 block>))) /tmp/lp88612.c:16 179 {*arm_movsi_insn}
     (nil))

(insn 58 57 59 2 (set (reg/f:SI 227)
        (lo_sum:SI (reg/f:SI 227)
            (symbol_ref:SI ("block") [flags 0xc0] <var_decl 0xb72692a0 block>))) /tmp/lp88612.c:16 178 {*arm_movt}
     (expr_list:REG_EQUAL (symbol_ref:SI ("block") [flags 0xc0] <var_decl 0xb72692a0 block>)
        (nil)))

into

(NOTE_INSN_DELETED 57)

(insn 58 56 59 2 (set (reg/f:SI 227)
        (symbol_ref:SI ("block") [flags 0xc0] <var_decl 0xb72692a0 block>)) /tmp/lp88612.c:16 179 {*arm_movsi_insn}
     (expr_list:REG_EQUAL (symbol_ref:SI ("block") [flags 0xc0] <var_decl 0xb72692a0 block>)
        (nil)))

It's not wrong in what it's doing but that causes reload to come along later and look for a reload of symbol_ref (block) detect that it can't do anything about it but put it into the common constant pool and then we later push all this into our mini-pools. All dreadfully inefficient. This is probably a good opportunity as any to tighten up the predicates in the movsi matching patterns and do a careful audit of the associated macros.

Ramana

Changed in gcc-linaro:
assignee: Ramana Radhakrishnan (ramana) → Zhenqiang Chen (zhenqiang-chen)
Revision history for this message
Zhenqiang Chen (zhenqiang-chen) wrote :

In linaro gcc 4.5, we ported a patch from codesourcery.
http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch

It adds a split as

(define_split
  [(set (match_operand:SI 0 "arm_general_register_operand" "")
        (match_operand:SI 1 "general_operand" ""))]
  "TARGET_32BIT
   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
   && !flag_pic && !target_word_relocations
   && !arm_tls_referenced_p (operands[1])"
  [(clobber (const_int 0))]
{
  arm_emit_movpair (operands[0], operands[1]);
  DONE;
})

With this patch, split1 will split it.

Changed in gcc-linaro:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.