Much poorer code generated at -O2 than -Os for accessing an array through a pointer
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
GNU Arm Embedded Toolchain |
New
|
Undecided
|
Unassigned |
Bug Description
These tests have all been done with flags "-mcpu=cortex-m4 -mthumb", in the context of writing to the bit-band region on a microcontroller. The example here simply writes a number of fixed values to fixed addresses, in a non-sequential order.
Three versions of the function are:
#include <stdint.h>
#define BB_ADDRESS 0x43fe1800
void test1(void) {
volatile uint32_t * const p = (uint32_t *) BB_ADDRESS;
p[3] = 1;
p[4] = 2;
p[1] = 3;
p[7] = 4;
p[0] = 6;
}
void test2(void) {
typedef struct {
uint32_t b[8];
} bb_t;
volatile bb_t * const p = (bb_t *) BB_ADDRESS;
p->b[3] = 1;
p->b[4] = 2;
p->b[1] = 3;
p->b[7] = 4;
p->b[0] = 6;
}
void test3(void) {
typedef struct {
uint32_t b0, b1, b2, b3, b4, b5, b6, b7;
} bb_t;
volatile bb_t * const p = (bb_t *) BB_ADDRESS;
p->b3 = 1;
p->b4 = 2;
p->b1 = 3;
p->b7 = 4;
p->b0 = 6;
}
Code generated for test2 and test3 is identical with the tests I tried. Testing for test1 and test2 was done with -Os and -O2 (-O3 gave the same results as -O2), using gcc 4.6 (from https:/
// Test 1, -O2
test1:
push {r4, r5, r6, r7}
ldr r2, .L3
ldr r6, .L3+4
ldr r4, .L3+8
ldr r1, .L3+12
ldr r3, .L3+16
movs r0, #1
str r0, [r2]
movs r7, #2
movs r5, #3
movs r0, #4
movs r2, #6
str r7, [r6]
str r5, [r4]
str r0, [r1]
pop {r4, r5, r6, r7}
str r2, [r3]
bx lr
.L3:
.word 1140725772
.word 1140725776
.word 1140725764
.word 1140725788
.word 1140725760
// Test 2, -O2
test2:
push {r4, r5}
ldr r3, .L7
movs r5, #1
movs r4, #2
movs r0, #3
movs r1, #4
movs r2, #6
str r5, [r3, #12]
str r4, [r3, #16]
str r0, [r3, #4]
pop {r4, r5}
str r1, [r3, #28]
str r2, [r3]
bx lr
.L7:
.word 1140725760
// test1, -Os
test1:
ldr r3, .L2
movs r2, #1
str r2, [r3]
movs r2, #2
str r2, [r3, #4]
movs r2, #3
str r2, [r3, #-8]
movs r2, #4
str r2, [r3, #16]
movs r2, #6
str r2, [r3, #-12]
bx lr
.L2:
.word 1140725772
// test2, -Os
test2:
ldr r3, .L5
movs r2, #1
str r2, [r3, #12]
movs r2, #2
str r2, [r3, #16]
movs r2, #3
str r2, [r3, #4]
movs r2, #4
str r2, [r3, #28]
movs r2, #6
str r2, [r3]
bx lr
.L5:
.word 1140725760
The code for test1 and test2 with -Os is the same (baring irrelevant differences in the choice of base address). It is optimal in its use of registers, and of the register+offset addressing modes. The interleaving of the stores and other operations is also good, especially for bit-band region access where stores take several bus controller cycles (the compiler does not know this detail).
For -O2, the code for test1 takes a good deal more space, more registers (leading to more pushes and pops), and more time due to the additional instructions and poorer scheduling. Code for test2 is not quite as bad, but still has poorer scheduling and register usage than with -Os.
It is normal to see that -Os sometimes gives faster code than -O2. But there should not be such big differences, nor should there be such differences between the pointer version and the struct version. This is not a bug as such - all generated code is correct. But it is sub-optimal optimisation.
The behaviour is the same with gcc 7.