Poorly optimised code generation for cortex M0/M0+/M1 vs M3/M4

Bug #1502611 reported by Strntydog on 2015-10-04
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
GNU Arm Embedded Toolchain
Undecided
Unassigned

Bug Description

I had believed I was being hit with the bug report https://bugs.launchpad.net/gcc-arm-embedded/+bug/1401316
However further testing leads me to believe this is a distinct problem.

I am using :
Release Version - gcc-arm-none-eabi-4_9-2015q3
Version is the binary release and was downloaded directly from launchpad.
Host is 64bit Ubuntu 15.04

On Cortex M0 type cores (M0/M0+ and M1) code generation is far from optimal at -Os, (all optimization levels also exhibit this problem). The same code when compiled for M3 does not have these sub-optimal patterns, and testing has revealed that for the test case presented, the code generated by GCC for the M3 will assemble without error for the M0.

The major problem occurs that accessing memory mapped registers at known addresses cause each register to have a unique entry in the Literal Table, when a single entry and offset addressing would suffice. This cause the code to become significantly larger and slower than is necessary. The problem occurs with multiple different declarations of the same memory, direct pointers, an array, a structure. A smaller but related problem is that constants are not consistently calculated from known register contents when they can be, but instead create unnecessary literal table entries. I believe the root cause is related which is why i have created one bug report for these two issues.

There are 6 tests in the attached test case, all are suboptimal compared to the M3 compile of the same test cases, even though the compiler does not emit any instructions which are not also legal M0 code.

Of gravest concern to me as a general pattern is test 6. I quote it here:

/* Write 8 bit values to known register locations - using an array */
void test6(void)
{
    volatile uint8_t* const r = (uint8_t*)(0x40002800U); // Register Array

    r[0] = 0xFF;
    r[1] = 0xFE;
    r[2] = 0xFD;
    r[3] = 0xFC;
    r[4] = 0xEE;
    r[8] = 0xDD;
    r[12] = 0xCC;
}

Which, at -Os for -mcpu-cortex-m0 results in:
000000ec <test6>:
  ec: 22ff movs r2, #255 ; 0xff
  ee: 4b0a ldr r3, [pc, #40] ; (118 <test6+0x2c>)
  f0: 701a strb r2, [r3, #0]
  f2: 4b0a ldr r3, [pc, #40] ; (11c <test6+0x30>)
  f4: 3a01 subs r2, #1
  f6: 701a strb r2, [r3, #0]
  f8: 4b09 ldr r3, [pc, #36] ; (120 <test6+0x34>)
  fa: 3a01 subs r2, #1
  fc: 701a strb r2, [r3, #0]
  fe: 4b09 ldr r3, [pc, #36] ; (124 <test6+0x38>)
 100: 3a01 subs r2, #1
 102: 701a strb r2, [r3, #0]
 104: 4b08 ldr r3, [pc, #32] ; (128 <test6+0x3c>)
 106: 3a0e subs r2, #14
 108: 701a strb r2, [r3, #0]
 10a: 4b08 ldr r3, [pc, #32] ; (12c <test6+0x40>)
 10c: 3a11 subs r2, #17
 10e: 701a strb r2, [r3, #0]
 110: 4b07 ldr r3, [pc, #28] ; (130 <test6+0x44>)
 112: 3a11 subs r2, #17
 114: 701a strb r2, [r3, #0]
 116: 4770 bx lr
 118: 40002800 .word 0x40002800
 11c: 40002801 .word 0x40002801
 120: 40002802 .word 0x40002802
 124: 40002803 .word 0x40002803
 128: 40002804 .word 0x40002804
 12c: 40002808 .word 0x40002808
 130: 4000280c .word 0x4000280c

Each element accessed in the array of bytes has resulted in the address of that element appearing in the literal table. !!!!

By comparison the M3 build generates :

00000094 <test6>:
  94: 4b07 ldr r3, [pc, #28] ; (b4 <test6+0x20>)
  96: 22ff movs r2, #255 ; 0xff
  98: 701a strb r2, [r3, #0]
  9a: 22fe movs r2, #254 ; 0xfe
  9c: 705a strb r2, [r3, #1]
  9e: 22fd movs r2, #253 ; 0xfd
  a0: 709a strb r2, [r3, #2]
  a2: 22fc movs r2, #252 ; 0xfc
  a4: 70da strb r2, [r3, #3]
  a6: 22ee movs r2, #238 ; 0xee
  a8: 711a strb r2, [r3, #4]
  aa: 22dd movs r2, #221 ; 0xdd
  ac: 721a strb r2, [r3, #8]
  ae: 22cc movs r2, #204 ; 0xcc
  b0: 731a strb r2, [r3, #12]
  b2: 4770 bx lr
  b4: 40002800 .word 0x40002800

ALL of which is legal M0 Code.

There are 5 other tests in the test case, each exhibits this behavior for varying patterns of accessing the same memory mapped registers. The ONLY testcase which uses STR with an offset is testcase 5, but it exhibits another less severe optimization problem that causes excessive and unnecessary literal table entries to be created.

I test accessing 4 x contiguous 32bit memory registers, at a known address as :
Test 1 - Fixed pointers to known locations:
Test 2 - Accessing the registers as an array
Test 3 - Accessing the registers as a structure
Test 4 - Accessing as an array where the array element (register) is a union type.
Test 5 - Accessing the registers as a structure where each register is a union type.
Test 6 - Just writing contiguous bytes in memory as an array.

In each case the addresses are known at compile time and are fixed constants, and so is the data being written.

It seems strange that for the M3, GCC can generate so much better code than the M0, even when it does not emit any instructions specific to the M3.

The attached file contains my test source (test.c), a bash script to build the code as i tested it, and the output from my tests for both the M0 and M3.

Strntydog (strntydog) wrote :

Hmm,

I was playing with test6, which just assigns values consecutively to a byte array. If i make the base address low, 0x10 in my test case, M0 suddenly compiles code which I would think it should :

00000134 <test7>:
 134: 2310 movs r3, #16
 136: 22ff movs r2, #255 ; 0xff
 138: 701a strb r2, [r3, #0]
 13a: 3a01 subs r2, #1
 13c: 705a strb r2, [r3, #1]
 13e: 3a01 subs r2, #1
 140: 709a strb r2, [r3, #2]
 142: 3a01 subs r2, #1
 144: 70da strb r2, [r3, #3]
 146: 3a0e subs r2, #14
 148: 711a strb r2, [r3, #4]
 14a: 3a11 subs r2, #17
 14c: 721a strb r2, [r3, #8]
 14e: 3a11 subs r2, #17
 150: 731a strb r2, [r3, #12]
 152: 4770 bx lr

But strangely M3 code generation goes bad :

000000b8 <test7>:
  b8: 22ff movs r2, #255 ; 0xff
  ba: 2310 movs r3, #16
  bc: 701a strb r2, [r3, #0]
  be: 22fe movs r2, #254 ; 0xfe
  c0: 2311 movs r3, #17
  c2: 701a strb r2, [r3, #0]
  c4: 22fd movs r2, #253 ; 0xfd
  c6: 2312 movs r3, #18
  c8: 701a strb r2, [r3, #0]
  ca: 22fc movs r2, #252 ; 0xfc
  cc: 2313 movs r3, #19
  ce: 701a strb r2, [r3, #0]
  d0: 22ee movs r2, #238 ; 0xee
  d2: 2314 movs r3, #20
  d4: 701a strb r2, [r3, #0]
  d6: 22dd movs r2, #221 ; 0xdd
  d8: 2318 movs r3, #24
  da: 701a strb r2, [r3, #0]
  dc: 22cc movs r2, #204 ; 0xcc
  de: 231c movs r3, #28
  e0: 701a strb r2, [r3, #0]
  e2: 4770 bx lr

BUT, if i move the start address further down in memory to say 0x200:

The M0 code generation changes to this :
00000134 <test7>:
 134: 2380 movs r3, #128 ; 0x80
 136: 22ff movs r2, #255 ; 0xff
 138: 009b lsls r3, r3, #2
 13a: 701a strb r2, [r3, #0]
 13c: 4b07 ldr r3, [pc, #28] ; (15c <test7+0x28>)
 13e: 3a01 subs r2, #1
 140: 701a strb r2, [r3, #0]
 142: 4b07 ldr r3, [pc, #28] ; (160 <test7+0x2c>)
 144: 3a01 subs r2, #1
 146: 701a strb r2, [r3, #0]
 148: 4b06 ldr r3, [pc, #24] ; (164 <test7+0x30>)
 14a: 3a01 subs r2, #1
 14c: 701a strb r2, [r3, #0]
 14e: 3a0e subs r2, #14
 150: 705a strb r2, [r3, #1]
 152: 3a11 subs r2, #17
 154: 715a strb r2, [r3, #5]
 156: 3a11 subs r2, #17
 158: 725a strb r2, [r3, #9]
 15a: 4770 bx lr
 15c: 00000201 .word 0x00000201
 160: 00000202 .word 0x00000202
 164: 00000203 .word 0x00000203

And suddenly the code generator can calculate address 0x200 without the literal table, but cant calculate 0x201, 0x202 or 0x203 but once its got 0x203, it can calculate 0x204, 0x208 and 0x20C.

Something very strange is going on with the M0 code generation for these sequences.

Strntydog,

Thank you for reporting this. We're looking into the issue.

BR,
Andre

Changed in gcc-arm-embedded:
status: New → In Progress

FWIW, here's what I found (originally reported at http://www.avrfreaks.net/comment/1682166#comment-1682166)

The reorg pass places constant addresses in the literal pool and modifies instructions to load from them (arm_reorg in gcc/config/arm/arm.c). The code there sees if there is a constant operand involving memory access in an insn, and if yes, marks it for pooling later. This is why you're not seeing the problem if the address is not known at compile time (extern, in your example) - there is no constant operand.

So why doesn't it happen always? Turns out the postreload pass that runs before it adjusts some, but not all insns to reuse the existing value (address) in the register and just bump up the offset.

thumb1_size_rtx_costs sets the size to 8 if the immediate value being loaded satisfies the J or K constraint (constant is in range of -1 to -255, or is in range of 0 to 255 multipled by any power of 2). This makes postreload prefer the alternative rtx of adding the offset to existing register value. For others, the size cost (and speed cost) of both alternatives work out to be the same i.e. 4, and postreload keeps the original rtx that loads the immediate constant. This explains why, in your example, some constant addresses were kept in the literal pool whereas others were computed.

Not sure what the best way is to fix this though. I don't know why costs are higher (8) for immediate constants that can actually be loaded easily, and are lower for those that require some computation to load, although tat turned out to help improve code in this case. Best left to the ARM maintainers I guess.

Strntydog (strntydog) wrote :

I Just tested the latest release of the ARM GCC Compiler: gcc-arm-none-eabi-5_2-2015q4.

It produces the SAME suboptimal code when compiling for Cortext M0 vs M3 as gcc-arm-none-eabi-4_9-2015q3 does.

Strntydog (strntydog) wrote :

Upstream has confirmed this bug exists, and I just tested this on the latest release "gcc-arm-none-eabi-6-2017-q1-update" and it still persists. I have updated the upstream report, but the code generation for Cortex M0 is abysmal, especially with regard to addressing registers at known locations, something microcontroller programs do all the time.

To put the Optimisation failure into perspective, this is the difference between the 6 tests in the test case upstream:

Test 1 - Code Size is 40% Bigger for M0, and the Function is 114% bigger.
Test 2 - Code Size is 20% bigger for M0, and the Function is 44% bigger.
Test 3 - Code Size is same between M0 and M3, but the Function is 43% bigger.
Test 4 - Code Size is 40% Bigger for M0, and the Function is 86% bigger.
Test 5 - Code Size is same between M0 and M3, but the Function is 14% bigger.
Test 6 - Code Size is 38% Bigger for M0, and the Function is 100% bigger.

These failures directly and significantly negatively effect program execution time AND Flash usage.

Hi,

FYI the approach suggested by Richard Earnshaw upstream seems like a significant amount of work. Therefore I'm afraid it is unlikely to be solved anytime soon.

Best regards.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.