Poor code gen for addr[offset] = val

Bug #1401316 reported by Gary Fuehrer on 2014-12-10
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
GNU Arm Embedded Toolchain
Low
Terry Guo

Bug Description

Versions: 4.9q4 (and back thru at least 4.7q3)
Host: Windows

Symptom:
Sequences of constructs like "addr[offset] = val;" and/or "addr->offset = val;" generate poorly optimized "-Os" machine code.

Compile the following example and then again with the macro replaced with the inline assembly statement instead. The former is 30% larger than the later.

Example:

#include <stdint.h>

#define setRegsWord(addr, offset, value) \
 *(uint32_t volatile*)(&((uint8_t volatile*)addr)[offset]) = value
/* asm volatile( \
 " str %[Rv], [%[Rbase], %[Immed]]\n\t" \
 : \
 : [Rv] "r" (value), [Rbase] "r" (addr), [Immed] "I" (offset) \
 )*/

void __attribute__((noreturn)) Reset_Handler()
{
 setRegsWord(0x48000000, 8, 0xc000000);
 setRegsWord(0x48000000, 4, 0);
 setRegsWord(0x48000000, 0, 0xaaa8f3a4);
 setRegsWord(0x48000000, 12, 0x64000000);
 setRegsWord(0x48000000, 32, 0x7700);
 setRegsWord(0x48000000, 36, 0x0002a770);
 setRegsWord(0x48000000, 24, 0x20000);

 setRegsWord(0x48000400, 8, 0x500000c0);
 setRegsWord(0x48000400, 4, 0);
 setRegsWord(0x48000400, 0, 0xa00a028c);
 setRegsWord(0x48000400, 12, 0x100);
 setRegsWord(0x48000400, 32, 0);
 setRegsWord(0x48000400, 36, 0x55000077);
 setRegsWord(0x48000400, 24, 0);

 setRegsWord(0x48000800, 8, 0x1100000);
 setRegsWord(0x48000800, 4, 0x8c0);
 setRegsWord(0x48000800, 0, 0x0260503f);
 setRegsWord(0x48000800, 12, 0x00405000);
 setRegsWord(0x48000800, 32, 0);
 setRegsWord(0x48000800, 36, 0x00060600);
 setRegsWord(0x48000800, 24, 0x8c0);

 setRegsWord(0x48000C00, 8, 0x10000);
 setRegsWord(0x48000C00, 4, 0);
 setRegsWord(0x48000C00, 0, 0x20000);
 setRegsWord(0x48000C00, 12, 0);
 setRegsWord(0x48000C00, 32, 0);
 setRegsWord(0x48000C00, 36, 5);
 setRegsWord(0x48000C00, 24, 0);

 setRegsWord(0x48001000, 8, 0);
 setRegsWord(0x48001000, 4, 0);
 setRegsWord(0x48001000, 0, 0x50000);
 setRegsWord(0x48001000, 12, 0);
 setRegsWord(0x48001000, 32, 0);
 setRegsWord(0x48001000, 36, 0);
 setRegsWord(0x48001000, 24, 0x3000000);

 setRegsWord(0x48001400, 8, 0);
 setRegsWord(0x48001400, 4, 0);
 setRegsWord(0x48001400, 0, 0);
 setRegsWord(0x48001400, 12, 0);
 setRegsWord(0x48001400, 32, 0);
 setRegsWord(0x48001400, 36, 0);
 setRegsWord(0x48001400, 24, 0);

 setRegsWord(0x40022000, 0, 18);

 setRegsWord(0xe000ed00, 0x88, 0xf00000);

 do {} while(1);
}

Gary Fuehrer (gfuehrer) wrote :
Terry Guo (terry.guo) on 2014-12-11
Changed in gcc-arm-embedded:
assignee: nobody → Terry Guo (terry.guo)
Terry Guo (terry.guo) on 2014-12-11
Changed in gcc-arm-embedded:
status: New → Confirmed
Terry Guo (terry.guo) wrote :

Thanks for reporting such an interesting issue. Basically the code are doing below things:

mov r3, #imme1
str r0, [r3]

Consider case that the #imme1 is too big to be encoded in 'mov' instruction, but value like '#imme1 - offset' can be encoded. The clever way will be:
mov r3, #imme1-offset
str r0, [r3, offset]

But looks current 4.9 and trunk are not so clever, the less efficient code will be generated:

ldr r3, .L1
str r0, [r3]

.L1:
      .word #imme1

Joey, does this look familiar to you?

Terry Guo (terry.guo) wrote :

A peephole2 pattern should be able to capture this chance, but kind of backend specific. Or maybe combine pass.

Gary Fuehrer (gfuehrer) wrote :

Thanks for taking this on. As I wrote, it's been this way since I first started using GCC for ARM embedded, which was version 4.7.

My (very unknowledgeable) guess is that some front end optimizations are scrambling-up things too much for the backend peephole work, because I'm presuming that the inline assembly work-around hides things from most of these front end stages.

From what I've seen, the coding is already trying to do the "cleaver way", but doing it absurdly.

If the code is destined to use a constant pool (which I hope it always will, or at least have that as an option -- I love it so much!) then "addr[offset] = value" (and for structs, "addr->offset = value") would generally want to be coded as:

ldr r3, .L1 ;addr
mov r0, value ;or hopefully movs
str r0, [r3, offset] ;offset typically in range for, say, member of struct

L1: .word addr

If it's cleaver (as exhibited when using the inline assembly) the "r3" can be re-used or even adjusted with an "adds" for a subsequent addr. If the front end is combining the addr and the offset into a single "#imme1" then a valuable hint from the programmer is getting thrown away -- different member lookups within the same struct are ripe for direct translation.

I know that that's nothing like how compilers go about doing their work, but this is just FWIW.

-Gary

Terry Guo (terry.guo) wrote :

The issue is easy to fix but may have other side effects from the view of compiler. Thus makes it difficult to get accepted by gcc upstream. Meanwhile may I ask you to rewrite your code in below way which is more friend to compiler and will give you better code size:

typedef struct st
{
  uint32_t a0;
  uint32_t a1;
  uint32_t a2;
  uint32_t a3;
  uint32_t a4;
  uint32_t a5;
  uint32_t a6;
  uint32_t a7;
  uint32_t a8;
  uint32_t a9;
}Device;

void __attribute__((noreturn)) Reset_Handler()
{
#if 0
        // Configure the external pins immediately as some component may depend on them
        setRegsWord(0x48000000, 8, 0xc000000);
        setRegsWord(0x48000000, 4, 0);
        setRegsWord(0x48000000, 0, 0xaaa8f3a4);
        setRegsWord(0x48000000, 12, 0x64000000);
        setRegsWord(0x48000000, 32, 0x7700);
        setRegsWord(0x48000000, 36, 0x0002a770);
        setRegsWord(0x48000000, 24, 0x20000);
#endif
        volatile Device *p = (volatile Device *)(0x48000000);

        p->a2 = 0xc0000000;
        p->a1 = 0;
        p->a0 = 0xaaa8f3a4;
        p->a3 = 0x64000000;
        p->a8 = 0x7700;
        p->a9 = 0x0002a770;
        p->a6 = 0x20000;

        do {} while(1);
}

Gary Fuehrer (gfuehrer) wrote :

My original code, from which my submitted exhibition code was distilled, was all struct-pointer/member lookups. That's what I'm getting at when I say that both addr[offset] = value and addr->offset have the problem.

Gary Fuehrer (gfuehrer) wrote :

In my original code, this is the macro definition:

#define setRegsWord(regs, member, value) \
/* (regs)->member = value */ \
 asm volatile( \
 " str %[Rv], [%[Rbase], %[Immed]]\n\t" \
 : \
 : [Rv] "r" (value), [Rbase] "r" (regs), [Immed] "I" (__builtin_offsetof(typeof(*regs), member)) \
 )

Gary Fuehrer (gfuehrer) wrote :

Same example exhibiting the problem but with struct/member instead of array/offset. Code bloat is about half as bad.

Terry Guo (terry.guo) wrote :

For this original code mentioned in comment #8, the smaller code size case is from GccConstProb0.c which is using inline assembly code, the generated code at O2 level are as below:

  18: f44f 41ee mov.w r1, #30464 ; 0x7700
  1c: 6211 str r1, [r2, #32]
  1e: 4928 ldr r1, [pc, #160] ; (c0 <Reset_Handler+0xc0>)
  20: 6251 str r1, [r2, #36] ; 0x24

Since you are using inline assembly code, the compiler can't schedule those instructions. You can see there are data dependence of register r1. When you actually run such code pattern, the performance is not optimal. However, since only the low registers will be used in such case, the code size is really better.

The bigger code size case is from GccConstProb1.c which is using C code. At O2 level, the compiler will do instruction schedule to avoid such read/write data dependence as much as possible. The generated code looks like:

   2: 4b14 ldr r3, [pc, #80] ; (54 <Reset_Handler+0x54>)
   4: f8df 905c ldr.w r9, [pc, #92] ; 64 <Reset_Handler+0x64>
   8: f8df e05c ldr.w lr, [pc, #92] ; 68 <Reset_Handler+0x68>
   c: 4e12 ldr r6, [pc, #72] ; (58 <Reset_Handler+0x58>)
   e: 4d13 ldr r5, [pc, #76] ; (5c <Reset_Handler+0x5c>)
  10: 4813 ldr r0, [pc, #76] ; (60 <Reset_Handler+0x60>)
  ......................................................
    2c: f8c2 a008 str.w sl, [r2, #8]
  30: 6051 str r1, [r2, #4]
  32: f8c2 9000 str.w r9, [r2]
  36: f8c2 800c str.w r8, [r2, #12]
  3a: f8c2 c020 str.w ip, [r2, #32]
  3e: f8c2 e024 str.w lr, [r2, #36] ; 0x24

You can see the instructions are shuffled to avoid data dependence. The performance will be better. The side effects are extended register live and more registers are consumed including the high registers. So the code size is increased.

If your code is not for performance wise, I suggest you to compile it with Og option. You can get expected code size.

Gary Fuehrer (gfuehrer) wrote :

Thanks for taking the time to explain. I used -Og to distill the code that I submitted from my application. I certainly cannot use it for my whole application, though, unless I didn't care anything about size (or speed). But then I wouldn't still be here.

I'm unconvinced that the problem is instruction scheduling. First off, you didn't compile for thumb2. If you do, only registers r0-r4 are used. The wide instructions occur because of offsets that are too large. The instruction ordering is almost identical to what the inline assembly (or -Og) produces. Finally, do the Cortex-M0-4 have a pipeline? It doesn't seem like Cortex-M got a pipeline until M7.

Terry Guo (terry.guo) wrote :

Hi Gary,

I am not suggesting you to use -Og for whole projects. I know nothing about your projects. For some of my current projects, I would group my project files into two groups: the performance critical files and those not, like board initialization files. I then apply O2 for the first group and Os or Og for the second group. This is just personal experience, you can ignore it.

Let us take the GccConstProb1.c as an example which is using plain c code, in my opinion, there is no unified options to give you both good performance and code size for the whole project.

I believe below commands and size results can show you that instruction schedule does play a role here:
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -O2 GccConstProb1.c
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o
   text data bss dec hex filename
    292 0 0 292 124 GccConstProb1.o
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -O2 GccConstProb1.c -fno-schedule-insns
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o
   text data bss dec hex filename
    268 0 0 268 10c GccConstProb1.o
terguo01@terry-pc01:GccConstProb struct$
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -Og GccConstProb1.c
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o
   text data bss dec hex filename
    228 0 0 228 e4 GccConstProb1.o

Not sure why you say I am not compiling for thumb2, I do compile for thumb2. Below is a thumb2 code from GccConstProb1.o:

f8c5 9008 str.w r9, [r5, #8]

The thumb2 mode enables you to use high registers like r9, which requires 32 bits to encode this str instruction. In the inline assembly version, fewer high registers are used for load/store, so the code size is smaller.

About whether there is pipeline, would you please check arm website like http://www.arm.com/products/processors/cortex-m/cortex-m0.php, clearly "Pipeline 3-stage" is documented at Specifications part. Similar things happen to other M cores. Inside GCC, we have dedicated pipeline description file for M4 and generic pipeline descriptions for others.

Gary Fuehrer (gfuehrer) wrote :

I said that you weren't compiling thumb2 because I incorrectly guessed why you were getting code that used registers (like r9, r10 etc) that I wasn't seeing. I played around with changing the compiler optimization level from -Os or -Og to -O2 or -O3 and now I'm getting those registers used, too. Still quite different assembly than yours, though, but I'm not worried about that.

I'm grateful that you've taken so much time to explain things. It's both educational and interesting to me. Again thanks for the tips and the pointer to the arm specification bit. I don't want to give the impression that I think I know better because I know that I don't. And I feel bad and don't want to rope you into discussing this even more.

However, I believe that you've found this bug report interesting for slightly different reasons than I do (which is a good thing, too, I think). I'm focused on the code generated when I use -Os and only C -- no inline assembly. When I do that (see my last attachment which includes my listing files) there are no registers used above r4, so no wide instructions because of that, and the extra size comes about only because of unnecessarily large offsets ("unnecessary" at least going by what -Og seems to have no problem avoiding).

Have a happy holidays!

Terry Guo (terry.guo) on 2014-12-18
Changed in gcc-arm-embedded:
importance: Undecided → Low
Terry Guo (terry.guo) wrote :

Hi Gary,

Merry Christmas. What you shared are really valuable to us. Please allow me to wrap up and share some of my conclusions here:
1). Recommend to access continuous memory addresses via structure which is more friend to compiler optimization, as shown in GccConstProb1.c.

2). Sometimes the better performance comes at the cost of code size, for example function inline, loop unroll.

3). Compiler generates less optimal code for below case:
terguo01@terry-pc01:GccConstProb$ cat h.c
void __attribute__((noreturn)) Reset_Handler()
{
     *(unsigned int *)(&((unsigned char *)0x48000000)[8]) = 0xc000000;
}
terguo01@terry-pc01:GccConstProb$ arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -O2 h.c -S
terguo01@terry-pc01:GccConstProb$ cat h.s

Reset_Handler:
 @ Volatile: function does not return.
 @ args = 0, pretend = 0, frame = 0
 @ frame_needed = 0, uses_anonymous_args = 0
 @ link register save eliminated.
 ldr r3, .L2
 mov r2, #201326592
 str r2, [r3]
 bx lr
.L3:
 .align 2
.L2:
 .word 1207959560
 .size Reset_Handler, .-Reset_Handler
 .ident "GCC: (GNU Tools for ARM Embedded Processors) 4.9.3 20141119 (release) [ARM/embedded-4_9-branch revision 218278]"

We can take advantage of the offset in str instruction to avoid the use of literal pool.

4). For Os below code can be improved to get smaller code size:
arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -Os GccConstProb1.c -c
arm-none-eabi-objdump -d GccConstProb1.o

  a4: f8c2 340c str.w r3, [r2, #1036] ; 0x40c
  a8: f8c2 3420 str.w r3, [r2, #1056] ; 0x420
  ac: f8c2 3424 str.w r3, [r2, #1060] ; 0x424
  b0:f8c2 1418 str.w r1, [r2, #1048] ; 0x418

All instructions are 32bit long. The total code size is 16 bytes. It will be smaller if we do it like below:

mov.w r2, #1000
str r3, [r2, #36]
str r3, [r2, #56]
str r3, [r2, #60]
str r3, [r2, #48]

The total code size will be 12 bytes.

Those are easy to understand and come up with a hot fix to get them done. But when consider in the overall picture of compiler, such hot fix will have side effects. Some of my colleague is working on another compiler optimization task which is supposed to cover cases like below. Let us wait and see.

Terry Guo (terry.guo) wrote :

You are not seeing high registers because you are using Os while I am using O2. The bigger offset issue is included in item 4) in my above conclusion.

Strntydog (strntydog) wrote :

I am also seeing similar bad code optimisation on a m0+ target like this.

sing gcc-arm-none-eabi-4_9-2015q3, for a cortex m0+ target.

In the process of writing code, I notice that the literal tables are full of redundant entries, at ANY Optimization Level.

Here is an example at level -O2 :

This is a disassembly from objdump of a section from my test main:

 148: 4b0b ldr r3, [pc, #44] ; (178 <main+0x34>)
 14a: 601a str r2, [r3, #0]
 14c: 4b0b ldr r3, [pc, #44] ; (17c <main+0x38>)
 14e: 3215 adds r2, #21
 150: 705a strb r2, [r3, #1]
....

 170: 4b09 ldr r3, [pc, #36] ; (198 <main+0x54>)
 172: 0512 lsls r2, r2, #20
 174: 601a str r2, [r3, #0]
 176: e7fe b.n 176 <main+0x32>
 178: 40002800 .word 0x40002800
 17c: 40002804 .word 0x40002804
 180: 40002808 .word 0x40002808
 184: 4000280c .word 0x4000280c
 188: 00001234 .word 0x00001234
 18c: 40002810 .word 0x40002810
 190: 40002818 .word 0x40002818
 194: 98760000 .word 0x98760000
 198: 4000281c .word 0x4000281c

At 148, R3 is loaded with the address 0x40002800
At 14a, a byte is stored using an offset of 0 from that address.
At 14c, R3 is loaded with the address 0x40002804
At 150, a byte is stored using an offset of 1 from that address, when it could have just used a offset of 5 from the address already loaded into r3 and skip the load at 14c, AND the table entry, saving 6 bytes and producing faster code.

When you are writing code which is bit banging GPIO registers this sort of thing makes a huge difference to performance and code density. And as the GPIO registers are usually (on most micros) clustered in a tight bunch, they can typically all be reached by offsets from a fixed base. I am not sure if this is exactly the same problem reported in this bug report, but if not then it is very similar. It can be seen the literal table is full of closely placed addresses, even if they are not used contiguously, there is no need to store anything but 0x40002800 because all of the rest can be reached with offsets.

I tried declaring memory as an array or as a bunch of pointers, neither seems to make any difference to this problem.

Strntydog (strntydog) wrote :
Download full text (3.2 KiB)

Follow up:
I tried changing my register access to using a structure (as recommended here because its "more friendly to the compiler"), rather than using the register addresses directly. It made no difference to the redundancy generated in the Literal Tables.

One thing I notice, is that the addresses are constants and the compiler knows them at run time, so are the values I am writing to them. The compiler will generate the constant values (where it can) using an 8 bit immediate load, and then it will transform the constants from one to another using a single math instruction (add/subtract with 8 bit immediate). But for addresses, it does not do the same optimization, where it could just add and subtract to the base address to generate all of the addresses in the literal table, except the first. This would not be as optimal as using appropriate offsets in the loads/stores, but would be significantly more optimal than current code generation. Even if it cant reach the new destination address with an offset, because its further away than the offset allows, this kind of optimization should happen before a new literal table entry is created (at least at -Os) as it is denser.

Also, here is a test case:

int main (void)
{
    const uint16_t p1 = 0x1234;
    const uint16_t p2 = 0x9876;
    const uint32_t p3 = 0x12349876;

    const uint32_t p4 = 0x21;
    const uint32_t p5 = 0x33;

    volatile uint16_t* const first16 = (uint16_t*)(0x40002800U);
    volatile uint32_t* const first = (uint32_t*)(0x40002800U);
    volatile uint32_t* const second = (uint32_t*)(0x40002804U);
    volatile uint32_t* const third = (uint32_t*)(0x40002808U);
    volatile uint32_t* const fourth = (uint32_t*)(0x4000280CU);

    *first16 = p1;
    *first = p1;
    *first16 = p2;
    *second = p3;
    *third = p4;
    *fourth = p5;

    while (true) {}
}

Which, using gcc-arm-none-eabi-4_9-2015q3, generates this on a Cortex M0+ (compiler options : -Os -mcpu=cortex-m0plus, -mthumb):

00000118 <main>:
    volatile uint32_t* const first = (uint32_t*)(0x40002800U);
    volatile uint32_t* const second = (uint32_t*)(0x40002804U);
    volatile uint32_t* const third = (uint32_t*)(0x40002808U);
    volatile uint32_t* const fourth = (uint32_t*)(0x4000280CU);

    *first16 = p1;
 118: 4b07 ldr r3, [pc, #28] ; (138 <main+0x20>)
 11a: 4a08 ldr r2, [pc, #32] ; (13c <main+0x24>)
 11c: 801a strh r2, [r3, #0]
    *first = p1;
 11e: 601a str r2, [r3, #0]
    *first16 = p2;
 120: 4a07 ldr r2, [pc, #28] ; (140 <main+0x28>)
 122: 801a strh r2, [r3, #0]
    *second = p3;
 124: 4a07 ldr r2, [pc, #28] ; (144 <main+0x2c>)
 126: 4b08 ldr r3, [pc, #32] ; (148 <main+0x30>)
 128: 601a str r2, [r3, #0]
    *third = p4;
 12a: 2221 movs r2, #33 ; 0x21
 12c: 4b07 ldr r3, [pc, #28] ; (14c <main+0x34>)
 12e: 601a str r2, [r3, #0]
    *fourth = p5;
 130: 4b07 ldr r3, [pc, #28] ; (150 <main+0x38>)
 132: 3212 adds r2, #18
 134: 601a str r2, [r3, #0]
 136: e7fe b.n 136 <main+0x1e>
 138: 40002800 .word 0x40002800
 13c: 00001234 .word 0x00001234
 140: ffff9876 .word 0xffff9876
 144: 1234987...

Read more...

Strntydog (strntydog) wrote :

Further information : I just tried compiling the test case for cortex m3 and m4, and it does not generate literal tables like this, it seems to only do this for cortex m0, m0plus and m1

David Brown (davidbrown) wrote :

Just for fun, I've tried this with a newer version of gcc (9.2.1 - the latest on https://godbolt.org), as I have seen similar code myself. When compiling with -Os, you get bad code (big literal tables) for Cortex-M0 but it's fine for Cortex-M3 and above. When compiling with -O2, the Cortex-M0 is still bad but for the Cortex-M3 and above you get code that is not quite as bad as for the M0, but still much worse than M3 with -Os.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers