Bug #1401316 “Poor code gen for addr[offset] = val” : Bugs : GNU Arm Embedded Toolchain

Revision history for this message

Gary Fuehrer (gfuehrer) wrote on 2014-12-10:

#1

Buildable example Edit (4.1 KiB, application/zip)

Terry Guo (terry.guo) on 2014-12-11

Changed in gcc-arm-embedded:
assignee:	nobody → Terry Guo (terry.guo)

Terry Guo (terry.guo) on 2014-12-11

Changed in gcc-arm-embedded:
status:	New → Confirmed

Revision history for this message

Terry Guo (terry.guo) wrote on 2014-12-11:

#2

Thanks for reporting such an interesting issue. Basically the code are doing below things:

mov r3, #imme1
str r0, [r3]

Consider case that the #imme1 is too big to be encoded in 'mov' instruction, but value like '#imme1 - offset' can be encoded. The clever way will be:
mov r3, #imme1-offset
str r0, [r3, offset]

But looks current 4.9 and trunk are not so clever, the less efficient code will be generated:

ldr r3, .L1
str r0, [r3]

.L1:
.word #imme1

Joey, does this look familiar to you?

Revision history for this message

Terry Guo (terry.guo) wrote on 2014-12-11:

#3

A peephole2 pattern should be able to capture this chance, but kind of backend specific. Or maybe combine pass.

Revision history for this message

Gary Fuehrer (gfuehrer) wrote on 2014-12-11:

#4

Thanks for taking this on. As I wrote, it's been this way since I first started using GCC for ARM embedded, which was version 4.7.

My (very unknowledgeable) guess is that some front end optimizations are scrambling-up things too much for the backend peephole work, because I'm presuming that the inline assembly work-around hides things from most of these front end stages.

From what I've seen, the coding is already trying to do the "cleaver way", but doing it absurdly.

If the code is destined to use a constant pool (which I hope it always will, or at least have that as an option -- I love it so much!) then "addr[offset] = value" (and for structs, "addr->offset = value") would generally want to be coded as:

ldr r3, .L1 ;addr
mov r0, value ;or hopefully movs
str r0, [r3, offset] ;offset typically in range for, say, member of struct

L1: .word addr

If it's cleaver (as exhibited when using the inline assembly) the "r3" can be re-used or even adjusted with an "adds" for a subsequent addr. If the front end is combining the addr and the offset into a single "#imme1" then a valuable hint from the programmer is getting thrown away -- different member lookups within the same struct are ripe for direct translation.

I know that that's nothing like how compilers go about doing their work, but this is just FWIW.

-Gary

Revision history for this message

Terry Guo (terry.guo) wrote on 2014-12-15:

#5

The issue is easy to fix but may have other side effects from the view of compiler. Thus makes it difficult to get accepted by gcc upstream. Meanwhile may I ask you to rewrite your code in below way which is more friend to compiler and will give you better code size:

typedef struct st
{
  uint32_t a0;
  uint32_t a1;
  uint32_t a2;
  uint32_t a3;
  uint32_t a4;
  uint32_t a5;
  uint32_t a6;
  uint32_t a7;
  uint32_t a8;
  uint32_t a9;
}Device;

void __attribute__((noreturn)) Reset_Handler()
{
#if 0
        // Configure the external pins immediately as some component may depend on them
        setRegsWord(0x48000000, 8, 0xc000000);
        setRegsWord(0x48000000, 4, 0);
        setRegsWord(0x48000000, 0, 0xaaa8f3a4);
        setRegsWord(0x48000000, 12, 0x64000000);
        setRegsWord(0x48000000, 32, 0x7700);
        setRegsWord(0x48000000, 36, 0x0002a770);
        setRegsWord(0x48000000, 24, 0x20000);
#endif
        volatile Device *p = (volatile Device *)(0x48000000);

        p->a2 = 0xc0000000;
        p->a1 = 0;
        p->a0 = 0xaaa8f3a4;
        p->a3 = 0x64000000;
        p->a8 = 0x7700;
        p->a9 = 0x0002a770;
        p->a6 = 0x20000;

do {} while(1);
}

Revision history for this message

Gary Fuehrer (gfuehrer) wrote on 2014-12-15:

#6

My original code, from which my submitted exhibition code was distilled, was all struct-pointer/member lookups. That's what I'm getting at when I say that both addr[offset] = value and addr->offset have the problem.

Revision history for this message

Gary Fuehrer (gfuehrer) wrote on 2014-12-15:

#7

In my original code, this is the macro definition:

#define setRegsWord(regs, member, value) \
/* (regs)->member = value */ \
asm volatile( \
" str %[Rv], [%[Rbase], %[Immed]]\n\t" \
: \
: [Rv] "r" (value), [Rbase] "r" (regs), [Immed] "I" (__builtin_offsetof(typeof(*regs), member)) \
)

Revision history for this message

Gary Fuehrer (gfuehrer) wrote on 2014-12-15:

#8

GccConstProb struct.zip Edit (12.7 KiB, application/zip)

Same example exhibiting the problem but with struct/member instead of array/offset. Code bloat is about half as bad.

Revision history for this message

Terry Guo (terry.guo) wrote on 2014-12-16:

#9

For this original code mentioned in comment #8, the smaller code size case is from GccConstProb0.c which is using inline assembly code, the generated code at O2 level are as below:

  18: f44f 41ee mov.w r1, #30464 ; 0x7700
  1c: 6211 str r1, [r2, #32]
  1e: 4928 ldr r1, [pc, #160] ; (c0 <Reset_Handler+0xc0>)
  20: 6251 str r1, [r2, #36] ; 0x24

Since you are using inline assembly code, the compiler can't schedule those instructions. You can see there are data dependence of register r1. When you actually run such code pattern, the performance is not optimal. However, since only the low registers will be used in such case, the code size is really better.

The bigger code size case is from GccConstProb1.c which is using C code. At O2 level, the compiler will do instruction schedule to avoid such read/write data dependence as much as possible. The generated code looks like:

   2: 4b14 ldr r3, [pc, #80] ; (54 <Reset_Handler+0x54>)
   4: f8df 905c ldr.w r9, [pc, #92] ; 64 <Reset_Handler+0x64>
   8: f8df e05c ldr.w lr, [pc, #92] ; 68 <Reset_Handler+0x68>
   c: 4e12 ldr r6, [pc, #72] ; (58 <Reset_Handler+0x58>)
   e: 4d13 ldr r5, [pc, #76] ; (5c <Reset_Handler+0x5c>)
  10: 4813 ldr r0, [pc, #76] ; (60 <Reset_Handler+0x60>)
  ......................................................
    2c: f8c2 a008 str.w sl, [r2, #8]
  30: 6051 str r1, [r2, #4]
  32: f8c2 9000 str.w r9, [r2]
  36: f8c2 800c str.w r8, [r2, #12]
  3a: f8c2 c020 str.w ip, [r2, #32]
  3e: f8c2 e024 str.w lr, [r2, #36] ; 0x24

You can see the instructions are shuffled to avoid data dependence. The performance will be better. The side effects are extended register live and more registers are consumed including the high registers. So the code size is increased.

If your code is not for performance wise, I suggest you to compile it with Og option. You can get expected code size.

Revision history for this message

Gary Fuehrer (gfuehrer) wrote on 2014-12-16:

#10

Thanks for taking the time to explain. I used -Og to distill the code that I submitted from my application. I certainly cannot use it for my whole application, though, unless I didn't care anything about size (or speed). But then I wouldn't still be here.

I'm unconvinced that the problem is instruction scheduling. First off, you didn't compile for thumb2. If you do, only registers r0-r4 are used. The wide instructions occur because of offsets that are too large. The instruction ordering is almost identical to what the inline assembly (or -Og) produces. Finally, do the Cortex-M0-4 have a pipeline? It doesn't seem like Cortex-M got a pipeline until M7.

Revision history for this message

Terry Guo (terry.guo) wrote on 2014-12-17:

#11

Hi Gary,

I am not suggesting you to use -Og for whole projects. I know nothing about your projects. For some of my current projects, I would group my project files into two groups: the performance critical files and those not, like board initialization files. I then apply O2 for the first group and Os or Og for the second group. This is just personal experience, you can ignore it.

Let us take the GccConstProb1.c as an example which is using plain c code, in my opinion, there is no unified options to give you both good performance and code size for the whole project.

I believe below commands and size results can show you that instruction schedule does play a role here:
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -O2 GccConstProb1.c
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o
   text data bss dec hex filename
    292 0 0 292 124 GccConstProb1.o
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -O2 GccConstProb1.c -fno-schedule-insns
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o
   text data bss dec hex filename
    268 0 0 268 10c GccConstProb1.o
terguo01@terry-pc01:GccConstProb struct$
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -Og GccConstProb1.c
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o
   text data bss dec hex filename
    228 0 0 228 e4 GccConstProb1.o

Not sure why you say I am not compiling for thumb2, I do compile for thumb2. Below is a thumb2 code from GccConstProb1.o:

f8c5 9008 str.w r9, [r5, #8]

The thumb2 mode enables you to use high registers like r9, which requires 32 bits to encode this str instruction. In the inline assembly version, fewer high registers are used for load/store, so the code size is smaller.

About whether there is pipeline, would you please check arm website like http://www.arm.com/products/processors/cortex-m/cortex-m0.php, clearly "Pipeline 3-stage" is documented at Specifications part. Similar things happen to other M cores. Inside GCC, we have dedicated pipeline description file for M4 and generic pipeline descriptions for others.

Hi Gary,

I am not suggesting you to use -Og for whole projects. I know nothing about your projects. For some of my current projects, I would group my project files into two groups: the performance critical files and those not, like board initialization files. I then apply O2 for the first group and Os or Og for the second group. This is just personal experience, you can ignore it.

Let us take the GccConstProb1.c as an example which is using plain c code, in my opinion, there is no unified options to give you both good performance and code size for the whole project.

I believe below commands and size results can show you that instruction schedule does play a role here:
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -O2 GccConstProb1.c
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o 
   text	   data	    bss	    dec	    hex	filename
    292	      0	      0	    292	    124	GccConstProb1.o
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -O2 GccConstProb1.c -fno-schedule-insns 
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o 
   text	   data	    bss	    dec	    hex	filename
    268	      0	      0	    268	    10c	GccConstProb1.o
terguo01@terry-pc01:GccConstProb struct$
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-gcc -c -std=gnu11 -mcpu=cortex-m4 -mthumb -Og GccConstProb1.c
terguo01@terry-pc01:GccConstProb struct$ arm-none-eabi-size GccConstProb1.o 
   text	   data	    bss	    dec	    hex	filename
    228	      0	      0	    228	     e4	GccConstProb1.o

Not sure why you say I am not compiling for thumb2, I do compile for thumb2. Below is a thumb2 code from GccConstProb1.o:

f8c5 9008 	str.w	r9, [r5, #8]

The thumb2 mode enables you to use high registers like r9, which requires 32 bits to encode this str instruction. In the inline assembly version, fewer high registers are used for load/store, so the code size is smaller.

About whether there is pipeline, would you please check arm website like http://www.arm.com/products/processors/cortex-m/cortex-m0.php, clearly "Pipeline 3-stage" is documented at Specifications part. Similar things happen to other M cores. Inside GCC, we have dedicated pipeline description file for M4 and generic pipeline descriptions for others.

Revision history for this message

Gary Fuehrer (gfuehrer) wrote on 2014-12-17:

#12

I said that you weren't compiling thumb2 because I incorrectly guessed why you were getting code that used registers (like r9, r10 etc) that I wasn't seeing. I played around with changing the compiler optimization level from -Os or -Og to -O2 or -O3 and now I'm getting those registers used, too. Still quite different assembly than yours, though, but I'm not worried about that.

I'm grateful that you've taken so much time to explain things. It's both educational and interesting to me. Again thanks for the tips and the pointer to the arm specification bit. I don't want to give the impression that I think I know better because I know that I don't. And I feel bad and don't want to rope you into discussing this even more.

However, I believe that you've found this bug report interesting for slightly different reasons than I do (which is a good thing, too, I think). I'm focused on the code generated when I use -Os and only C -- no inline assembly. When I do that (see my last attachment which includes my listing files) there are no registers used above r4, so no wide instructions because of that, and the extra size comes about only because of unnecessarily large offsets ("unnecessary" at least going by what -Og seems to have no problem avoiding).

Have a happy holidays!

Terry Guo (terry.guo) on 2014-12-18

Changed in gcc-arm-embedded:
importance:	Undecided → Low

Revision history for this message

Terry Guo (terry.guo) wrote on 2014-12-18:

#13

Hi Gary,

Merry Christmas. What you shared are really valuable to us. Please allow me to wrap up and share some of my conclusions here:
1). Recommend to access continuous memory addresses via structure which is more friend to compiler optimization, as shown in GccConstProb1.c.

2). Sometimes the better performance comes at the cost of code size, for example function inline, loop unroll.

3). Compiler generates less optimal code for below case:
terguo01@terry-pc01:GccConstProb$ cat h.c
void __attribute__((noreturn)) Reset_Handler()
{
*(unsigned int *)(&((unsigned char *)0x48000000)[8]) = 0xc000000;
}
terguo01@terry-pc01:GccConstProb$ arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -O2 h.c -S
terguo01@terry-pc01:GccConstProb$ cat h.s

Reset_Handler:
@ Volatile: function does not return.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L2
mov r2, #201326592
str r2, [r3]
bx lr
.L3:
.align 2
.L2:
.word 1207959560
.size Reset_Handler, .-Reset_Handler
.ident "GCC: (GNU Tools for ARM Embedded Processors) 4.9.3 20141119 (release) [ARM/embedded-4_9-branch revision 218278]"

We can take advantage of the offset in str instruction to avoid the use of literal pool.

4). For Os below code can be improved to get smaller code size:
arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -Os GccConstProb1.c -c
arm-none-eabi-objdump -d GccConstProb1.o

  a4: f8c2 340c str.w r3, [r2, #1036] ; 0x40c
  a8: f8c2 3420 str.w r3, [r2, #1056] ; 0x420
  ac: f8c2 3424 str.w r3, [r2, #1060] ; 0x424
  b0:f8c2 1418 str.w r1, [r2, #1048] ; 0x418

All instructions are 32bit long. The total code size is 16 bytes. It will be smaller if we do it like below:

mov.w r2, #1000
str r3, [r2, #36]
str r3, [r2, #56]
str r3, [r2, #60]
str r3, [r2, #48]

The total code size will be 12 bytes.

Those are easy to understand and come up with a hot fix to get them done. But when consider in the overall picture of compiler, such hot fix will have side effects. Some of my colleague is working on another compiler optimization task which is supposed to cover cases like below. Let us wait and see.

Hi Gary,

Merry Christmas. What you shared are really valuable to us. Please allow me to wrap up and share some of my conclusions here:
1). Recommend to access continuous memory addresses via structure which is more friend to compiler optimization, as shown in GccConstProb1.c.

2). Sometimes the better performance comes at the cost of code size, for example function inline, loop unroll.

3). Compiler generates less optimal code for below case:
terguo01@terry-pc01:GccConstProb$ cat h.c
void __attribute__((noreturn)) Reset_Handler()
{
     *(unsigned int *)(&((unsigned char *)0x48000000)[8]) = 0xc000000;
}
terguo01@terry-pc01:GccConstProb$ arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -O2 h.c -S
terguo01@terry-pc01:GccConstProb$ cat h.s

Reset_Handler:
	@ Volatile: function does not return.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L2
	mov	r2, #201326592
	str	r2, [r3]
	bx	lr
.L3:
	.align	2
.L2:
	.word	1207959560
	.size	Reset_Handler, .-Reset_Handler
	.ident	"GCC: (GNU Tools for ARM Embedded Processors) 4.9.3 20141119 (release) [ARM/embedded-4_9-branch revision 218278]"

We can take advantage of the offset in str instruction to avoid the use of literal pool.

4). For Os below code can be improved to get smaller code size:
arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -Os GccConstProb1.c -c
arm-none-eabi-objdump -d GccConstProb1.o

a4:	f8c2 340c 	str.w	r3, [r2, #1036]	; 0x40c
  a8:	f8c2 3420 	str.w	r3, [r2, #1056]	; 0x420
  ac:	f8c2 3424 	str.w	r3, [r2, #1060]	; 0x424
  b0:f8c2 1418 	str.w	r1, [r2, #1048]	; 0x418

All instructions are 32bit long. The total code size is 16 bytes. It will be smaller if we do it like below:

mov.w  r2, #1000
str  r3, [r2, #36]
str  r3, [r2, #56]
str  r3, [r2, #60]
str  r3, [r2, #48]

The total code size will be 12 bytes.

Those are easy to understand and come up with a hot fix to get them done. But when consider in the overall picture of compiler, such hot fix will have side effects. Some of my colleague is working on another compiler optimization task which is supposed to cover cases like below. Let us wait and see.

Revision history for this message

Terry Guo (terry.guo) wrote on 2014-12-18:

#14

You are not seeing high registers because you are using Os while I am using O2. The bigger offset issue is included in item 4) in my above conclusion.

Revision history for this message

Strntydog (strntydog) wrote on 2015-10-02:

#15

I am also seeing similar bad code optimisation on a m0+ target like this.

sing gcc-arm-none-eabi-4_9-2015q3, for a cortex m0+ target.

In the process of writing code, I notice that the literal tables are full of redundant entries, at ANY Optimization Level.

Here is an example at level -O2 :

This is a disassembly from objdump of a section from my test main:

148: 4b0b ldr r3, [pc, #44] ; (178 <main+0x34>)
14a: 601a str r2, [r3, #0]
14c: 4b0b ldr r3, [pc, #44] ; (17c <main+0x38>)
14e: 3215 adds r2, #21
150: 705a strb r2, [r3, #1]
....

170: 4b09 ldr r3, [pc, #36] ; (198 <main+0x54>)
172: 0512 lsls r2, r2, #20
174: 601a str r2, [r3, #0]
176: e7fe b.n 176 <main+0x32>
178: 40002800 .word 0x40002800
17c: 40002804 .word 0x40002804
180: 40002808 .word 0x40002808
184: 4000280c .word 0x4000280c
188: 00001234 .word 0x00001234
18c: 40002810 .word 0x40002810
190: 40002818 .word 0x40002818
194: 98760000 .word 0x98760000
198: 4000281c .word 0x4000281c

At 148, R3 is loaded with the address 0x40002800
At 14a, a byte is stored using an offset of 0 from that address.
At 14c, R3 is loaded with the address 0x40002804
At 150, a byte is stored using an offset of 1 from that address, when it could have just used a offset of 5 from the address already loaded into r3 and skip the load at 14c, AND the table entry, saving 6 bytes and producing faster code.

When you are writing code which is bit banging GPIO registers this sort of thing makes a huge difference to performance and code density. And as the GPIO registers are usually (on most micros) clustered in a tight bunch, they can typically all be reached by offsets from a fixed base. I am not sure if this is exactly the same problem reported in this bug report, but if not then it is very similar. It can be seen the literal table is full of closely placed addresses, even if they are not used contiguously, there is no need to store anything but 0x40002800 because all of the rest can be reached with offsets.

I tried declaring memory as an array or as a bunch of pointers, neither seems to make any difference to this problem.

I am also seeing similar bad code optimisation on a m0+ target like this.

sing gcc-arm-none-eabi-4_9-2015q3, for a cortex m0+ target.

In the process of writing code, I notice that the literal tables are full of redundant entries, at ANY Optimization Level.

Here is an example at level -O2 :

This is a disassembly from objdump of a section from my test main:

148:	4b0b      	ldr	r3, [pc, #44]	; (178 <main+0x34>)
 14a:	601a      	str	r2, [r3, #0]
 14c:	4b0b      	ldr	r3, [pc, #44]	; (17c <main+0x38>)
 14e:	3215      	adds	r2, #21
 150:	705a      	strb	r2, [r3, #1]
....

170:	4b09      	ldr	r3, [pc, #36]	; (198 <main+0x54>)
 172:	0512      	lsls	r2, r2, #20
 174:	601a      	str	r2, [r3, #0]
 176:	e7fe      	b.n	176 <main+0x32>
 178:	40002800 	.word	0x40002800
 17c:	40002804 	.word	0x40002804
 180:	40002808 	.word	0x40002808
 184:	4000280c 	.word	0x4000280c
 188:	00001234 	.word	0x00001234
 18c:	40002810 	.word	0x40002810
 190:	40002818 	.word	0x40002818
 194:	98760000 	.word	0x98760000
 198:	4000281c 	.word	0x4000281c

At 148, R3 is loaded with the address 0x40002800
At 14a, a byte is stored using an offset of 0 from that address.
At 14c,  R3 is loaded with the address 0x40002804 
At 150, a byte is stored using an offset of 1 from that address, when it could have just used a offset of 5 from the address already loaded into r3 and skip the load at 14c, AND the table entry, saving 6 bytes and producing faster code.

When you are writing code which is bit banging GPIO registers this sort of thing makes a huge difference to performance and code density.  And as the GPIO registers are usually (on most micros) clustered in a tight bunch, they can typically all be reached by offsets from a fixed base.  I am not sure if this is exactly the same problem reported in this bug report, but if not then it is very similar.  It can be seen the literal table is full of closely placed addresses, even if they are not used contiguously, there is no need to store anything but 0x40002800 because all of the rest can be reached with offsets.

I tried declaring memory as an array or as a bunch of pointers, neither seems to make any difference to this problem.

Revision history for this message

Strntydog (strntydog) wrote on 2015-10-03:

#16

Download full text (3.2 KiB)

Follow up:
I tried changing my register access to using a structure (as recommended here because its "more friendly to the compiler"), rather than using the register addresses directly. It made no difference to the redundancy generated in the Literal Tables.

One thing I notice, is that the addresses are constants and the compiler knows them at run time, so are the values I am writing to them. The compiler will generate the constant values (where it can) using an 8 bit immediate load, and then it will transform the constants from one to another using a single math instruction (add/subtract with 8 bit immediate). But for addresses, it does not do the same optimization, where it could just add and subtract to the base address to generate all of the addresses in the literal table, except the first. This would not be as optimal as using appropriate offsets in the loads/stores, but would be significantly more optimal than current code generation. Even if it cant reach the new destination address with an offset, because its further away than the offset allows, this kind of optimization should happen before a new literal table entry is created (at least at -Os) as it is denser.

Also, here is a test case:

int main (void)
{
    const uint16_t p1 = 0x1234;
    const uint16_t p2 = 0x9876;
    const uint32_t p3 = 0x12349876;

const uint32_t p4 = 0x21;
const uint32_t p5 = 0x33;

    volatile uint16_t* const first16 = (uint16_t*)(0x40002800U);
    volatile uint32_t* const first = (uint32_t*)(0x40002800U);
    volatile uint32_t* const second = (uint32_t*)(0x40002804U);
    volatile uint32_t* const third = (uint32_t*)(0x40002808U);
    volatile uint32_t* const fourth = (uint32_t*)(0x4000280CU);

    *first16 = p1;
    *first = p1;
    *first16 = p2;
    *second = p3;
    *third = p4;
    *fourth = p5;

while (true) {}
}

Which, using gcc-arm-none-eabi-4_9-2015q3, generates this on a Cortex M0+ (compiler options : -Os -mcpu=cortex-m0plus, -mthumb):

00000118 <main>:
    volatile uint32_t* const first = (uint32_t*)(0x40002800U);
    volatile uint32_t* const second = (uint32_t*)(0x40002804U);
    volatile uint32_t* const third = (uint32_t*)(0x40002808U);
    volatile uint32_t* const fourth = (uint32_t*)(0x4000280CU);

    *first16 = p1;
118: 4b07 ldr r3, [pc, #28] ; (138 <main+0x20>)
11a: 4a08 ldr r2, [pc, #32] ; (13c <main+0x24>)
11c: 801a strh r2, [r3, #0]
    *first = p1;
11e: 601a str r2, [r3, #0]
    *first16 = p2;
120: 4a07 ldr r2, [pc, #28] ; (140 <main+0x28>)
122: 801a strh r2, [r3, #0]
    *second = p3;
124: 4a07 ldr r2, [pc, #28] ; (144 <main+0x2c>)
126: 4b08 ldr r3, [pc, #32] ; (148 <main+0x30>)
128: 601a str r2, [r3, #0]
    *third = p4;
12a: 2221 movs r2, #33 ; 0x21
12c: 4b07 ldr r3, [pc, #28] ; (14c <main+0x34>)
12e: 601a str r2, [r3, #0]
    *fourth = p5;
130: 4b07 ldr r3, [pc, #28] ; (150 <main+0x38>)
132: 3212 adds r2, #18
134: 601a str r2, [r3, #0]
136: e7fe b.n 136 <main+0x1e>
138: 40002800 .word 0x40002800
13c: 00001234 .word 0x00001234
140: ffff9876 .word 0xffff9876
144: 1234987...

GNU Arm Embedded Toolchain

Poor code gen for addr[offset] = val

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches