Simple OR'ed assignment breaks the code silently

Bug #1738730 reported by Murat Ursavas on 2017-12-18
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
GNU Arm Embedded Toolchain
Undecided
Unassigned

Bug Description

Hi,

Normally I am working on our application project with ARM-GCC 4.9. While adding a bootloader, I switched to 5.4 and things started to go out of track. Trace narrowed down the problem to an assignment into a peripheral register.

I've seen that the 4.9 version creates 6 lines of assembly but the same simple statement was creating 15 lines of assembly with 5.4. Then I tried the most recent one, 6-2017-q2-update. The result was the same. Although version 6 creates substantially smaller code compared to 5.4, the particular peripheral was not working at all.

Here's the working output:

101 NVM_SPI->CTRL = USART_CTRL_SYNC | USART_CTRL_MSBF;
00023fd6: ldr r2,[pc,#0x74] ; 0x24048
00023fd8: ldr r3,[r2]
00023fda: movs r3,#0x0
00023fdc: orr r3,r3,#0x400
00023fe0: orr r3,r3,#0x1
00023fe4: str r3,[r2]

And here's the problematic output (both 5 and 6 branch):

101 NVM_SPI->CTRL = USART_CTRL_SYNC | USART_CTRL_MSBF;
20001682: ldr r3,[pc,#0xf8] ; 0x20001778
20001684: ldrb r2,[r3]
20001686: movs r2,#0x0
20001688: orr r2,r2,#0x1
2000168c: strb r2,[r3]
2000168e: ldrb r2,[r3,#0x1]
20001690: movs r2,#0x0
20001692: orr r2,r2,#0x4
20001696: strb r2,[r3,#0x1]
20001698: ldrb r2,[r3,#0x2]
2000169a: movs r2,#0x0
2000169c: strb r2,[r3,#0x2]
2000169e: ldrb r2,[r3,#0x3]
200016a0: movs r2,#0x0
200016a2: strb r2,[r3,#0x3]

****
NVM_SPI: (void*) (0x4000C400UL)
CTRL: volatile uint32_t at offset 0
USART_CTRL_SYNC: (0x1UL << 0)
USART_CTRL_MSBF: (0x1UL << 10)
****

There is no optimization (-O0). Things could get better with other optimization options due to assignment simplification, but I haven't tried it, yet.

Regards,

Additional Info:

Compilers: 5.4.3 and 6-2017-q2-update
OS: Ubuntu 17.10

P.S: Second assembly could be seen to work on RAM. The result is the same when set to work on flash.

john (jkovach) wrote :

Looks like the compiler is accessing NVM_SPI->CTRL one byte at a time. There must be something in the declaration of NVM_SPI->CTRL and maybe in the compiler command line options that make it do it.

Murat Ursavas (murat-ursavas) wrote :

Could be. I'm switching back and forth between 4.9 and 5.4, using the same options. The thing is, why the newer versions try to access a 32bit variable byte by byte on a 32bit CPU? Let's say they have their own causes. So why is the functionality broken when accessing one byte at a time?

john (jkovach) wrote :

Changes in major compiler version number are known to change the way some pragmas, attributes and so on are interpreted. You should examine your code and see if anything in the changelog from gcc 4 to gcc 5 affects it.

Murat Ursavas (murat-ursavas) wrote :

Couldn't find something definite. Just have seen this one:

****************************

2015-12-17 Thomas Preud'homme <email address hidden>

  * config/arm/arm.h (TARGET_USE_MOVT): Check MOVT/MOVW availability
  with TARGET_HAVE_MOVT.
  (TARGET_HAVE_MOVT): Define.
  * config/arm/arm.c (const_ok_for_op): Check MOVT/MOVW
  availability with TARGET_HAVE_MOVT.
  * config/arm/arm.md (arm_movt): Use TARGET_HAVE_MOVT to check movt
  availability.
  (addsi splitter): Use TARGET_USE_MOVT to check whether to use
  movt + movw.
  (symbol_refs movsi splitter): Remove TARGET_32BIT check.
  (arm_movtas_ze): Use TARGET_HAVE_MOVT to check movt availability.
  * config/arm/constraints.md (define_constraint "j"): Use
  TARGET_HAVE_MOVT to check movt availability.
***********************

Maybe removing TARGET_32BIT could have an effect on this issue. It's a long shot, but I don't have further ideas based on this document:

https://gcc.gnu.org/svn/gcc/branches/ARM/embedded-5-branch/gcc/ChangeLog.arm

john (jkovach) wrote :

Maybe you should show us the declarations of the structure and the variable in your code. Compiler options too.

Murat Ursavas (murat-ursavas) wrote :

Sure,

This is the most obvious failing line:

    NVM_SPI->CTRL = USART_CTRL_SYNC | USART_CTRL_MSBF;

NVM_SPI is defined in a header file like this:

    #define NVM_SPI USART1

USART1 is defined in a vendor library like this:

#define USART1 ((USART_TypeDef *) USART1_BASE) /**< USART1 base pointer */

USART1_BASE is defined in another file like this:

#define USART1_BASE (0x4000C400UL) /**< USART1 base address */

Here's the struct definition for USART_TypeDef:

typedef struct
{
  __IOM uint32_t CTRL; /**< Control Register */
  __IOM uint32_t FRAME; /**< USART Frame Format Register */
  __IOM uint32_t TRIGCTRL; /**< USART Trigger Control register */
  __IOM uint32_t CMD; /**< Command Register */
  __IM uint32_t STATUS; /**< USART Status Register */
  __IOM uint32_t CLKDIV; /**< Clock Control Register */
  __IM uint32_t RXDATAX; /**< RX Buffer Data Extended Register */
  __IM uint32_t RXDATA; /**< RX Buffer Data Register */
  __IM uint32_t RXDOUBLEX; /**< RX Buffer Double Data Extended Register */
  __IM uint32_t RXDOUBLE; /**< RX FIFO Double Data Register */
  __IM uint32_t RXDATAXP; /**< RX Buffer Data Extended Peek Register */
  __IM uint32_t RXDOUBLEXP; /**< RX Buffer Double Data Extended Peek Register */
  __IOM uint32_t TXDATAX; /**< TX Buffer Data Extended Register */
  __IOM uint32_t TXDATA; /**< TX Buffer Data Register */
  __IOM uint32_t TXDOUBLEX; /**< TX Buffer Double Data Extended Register */
  __IOM uint32_t TXDOUBLE; /**< TX Buffer Double Data Register */
  __IM uint32_t IF; /**< Interrupt Flag Register */
  __IOM uint32_t IFS; /**< Interrupt Flag Set Register */
  __IOM uint32_t IFC; /**< Interrupt Flag Clear Register */
  __IOM uint32_t IEN; /**< Interrupt Enable Register */
  __IOM uint32_t IRCTRL; /**< IrDA Control Register */
  __IOM uint32_t ROUTE; /**< I/O Routing Register */
  __IOM uint32_t INPUT; /**< USART Input Register */
  __IOM uint32_t I2SCTRL; /**< I2S Control Register */
} USART_TypeDef; /** @} */

And __IOM is defined as:

#define __IOM volatile

The C++ compiler options are:

-g3 -gdwarf-2 -mcpu=cortex-m3 -mthumb -std=c++1y '-DDEBUG=1' -O0 -pedantic -Wall -Wextra -c -fmessage-length=0 -fno-rtti -fno-exceptions -mno-sched-prolog -fno-builtin -fpack-struct -fshort-enums -ffunction-sections -fdata-sections

Note:I've removed folder inclusions and some simple preprocessor definitions to hide some sensitive data about our project. I'm sure they are not related.

Do you need anything else?

john (jkovach) wrote :

-fpack-struct must be responsible for this. Why would you want this option anyway? If you really insist, you might want to have a look at the option "-munaligned-access" as well. The manual says " If unaligned access is not enabled then words in packed data structures are accessed a byte at a time."

Murat Ursavas (murat-ursavas) wrote :

John,

You're right. -fpack-struct option definitely changes the behavior for 6.3.1.

Here's enabled and disabled assembly output (got with 6 branch):

Enabled:
 ldr r3, .L11+4
 ldrb r2, [r3]
 movs r2, #0
 orr r2, r2, #1
 strb r2, [r3]
 ldrb r2, [r3, #1]
 movs r2, #0
 orr r2, r2, #4
 strb r2, [r3, #1]
 ldrb r2, [r3, #2]
 movs r2, #0
 strb r2, [r3, #2]
 ldrb r2, [r3, #3]
 movs r2, #0
 strb r2, [r3, #3]

Disabled:

 ldr r3, .L11+4
 movw r2, #1025
 str r2, [r3]

Looks like unaligned access is enabled for v7-m architecture by default.

One thing remains. OK, it has to access byte by byte to the packed struct members, but why with wrong values? This smells like a bug, isn't it?

[ After some short time ]

By the way, I've looked at the code again and looks like it has to set the register with correct values even with -fpacked-struct option. So why I'm seeing different values in the debug session? Let's say GDB reports wrong values, why the peripheral is not working as expected.

Note: I need packed structs for serializing communication and storage.

Murat Ursavas (murat-ursavas) wrote :

And additionally, why is still accessing words one byte at a time even unaligned access is enabled? This behavior is correct on 4.9 but not correct on 5 and 6 branch.

Leo Havmøller (leh-p) wrote :

New tools can reveal old problems, in this case the use of -fpacked-struct on the entire project.
> Note: I need packed structs for serializing communication and storage.
We all do. Use #pragma pack(push, 1) and #pragma pack(pop) in the affected header files.

Murat Ursavas (murat-ursavas) wrote :

Hi Leo, yes I was going to revise the project and put pragma's for the necessary structs. But this still does not clarify what bothers me, byte by byte method is used even with enabled unaligned access. What's worse is, this creates issues on the real hardware, with an unknown cause, not just making the final image longer.

Clearly there's a regression on 5+ but I don't know the internals enough to find the root cause/commit.

john (jkovach) wrote :

I think you should try to construct a minimal example demonstrating the problem. Who knows, maybe you'll find something in your setup that causes this.

Murat Ursavas (murat-ursavas) wrote :

I guess the problem lies under the -fpack-struct option. No matter unaligned access state is, the compiler switches one byte at a time mode with seeing -fpack-struct option.

I tried this on a small file with the options I said before. Therefore my setup should not be the cause. I will try to build a really small reproducible example, tomorrow.

Murat Ursavas (murat-ursavas) wrote :

I've found the holy grail. Problem is happening if you are using a pointer to a struct. Normally, while accessing to that struct members, nothing is wrong, it accesses word by word. But if you declare a pointer to that struct and do your business with it, problem arises and GCC starts using byte by byte method even while unaligned access is enabled.

Here's worse part: It doesn't matter whether you use -fpack-struct flag or #pragma pack(push, 1). You'll get the same result.

Please find the simplest example as attachment. And use this line (don't forget to add your toolchains path in the beginning)

arm-none-eabi-g++ -gdwarf-2 -mcpu=cortex-m3 -mthumb -std=c++1y '-DDEBUG=1' -O0 -pedantic -Wall -Wextra -c -fmessage-length=0 -fno-rtti -fno-exceptions -mno-sched-prolog -fno-builtin -fshort-enums -ffunction-sections -fdata-sections -MMD -MP -MF"main.d" -MT"main.o" -S -o "main.s" "main.cpp"

Here's the output of the important part:

.LCFI2:
 .cfi_def_cfa_register 7
 .loc 1 17 0
 ldr r2, .L3
 mov r3, r7
 str r3, [r2]
 .loc 1 19 0
 ldr r3, .L3
 ldr r3, [r3]
 ldrb r2, [r3]
 movs r2, #0
 orr r2, r2, #1
 strb r2, [r3]
 ldrb r2, [r3, #1]
 movs r2, #0
 orr r2, r2, #4
 strb r2, [r3, #1]
 ldrb r2, [r3, #2]
 movs r2, #0
 strb r2, [r3, #2]
 ldrb r2, [r3, #3]
 movs r2, #0
 strb r2, [r3, #3]

I would be glad if anyone can pinpoint the root cause. Things should be gone bad between 4.9 and 5.4.

Regards,

Murat Ursavas (murat-ursavas) wrote :

Update: Problem still exists on the fresh release, 7-2017-q4-major.

Murat Ursavas (murat-ursavas) wrote :

Update: The problem also exists on the first official 5 branch release, 5.2 (5-2015-q4-major). Now I'm trying to isolate the change from 4.9 to 5.2, with my non-existing GCC infrastructure knowledge :) I feel like I'm entering adark cave with full of bats :)

john (jkovach) wrote :

Just tested with the latest release. GCC: (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204

gcc -mcpu=cortex-m3 -mthumb -O0 -munaligned-access -S test.c -o test.s

test.c:

#pragma pack(1)
struct s { short i; } volatile *p;
void f(void) { p->i = 0; }

output:

ldr r3, .L2
ldr r3, [r3]
ldrb r2, [r3]
movs r2, #0
strb r2, [r3]
ldrb r2, [r3, #1]
movs r2, #0
strb r2, [r3, #1]

There two problems:
1) Byte-wise access. One would expect "-munaligned-access" to help avoid this.
2) The variable is read before it is written for no apparent reason. This is in violation of the volatile semantics, as far as I can see. And this problem goes away when I remove "volatile" from source.

Murat Ursavas (murat-ursavas) wrote :

Hi John,

Thanks for your interest and effort. I appreciate it. By the way, I've checked the output with 5.2 and the result is the same.

And here's the 4.9 output:

 ldr r3, .L2
 ldr r3, [r3]
 ldrh r2, [r3] @ movhi
 movs r2, #0
 strh r2, [r3] @ movhi
 mov sp, r7

Another boring thing, there is a redundant nop instruction before popping sp, in 5,6 and 7 versions. I don't know the necessity.

For me, the most suspicious thing is the addition of ARM-v8 architec. This is the major change for the 5. Even though the change is announced in 4.7, I've seen significant changes in 5.2 comparing to 4.9. Maybe mixing 64bit logic with 32bit logic could have an effect on this issue. Who knows? (Definitely not me :)

Regards,

john (jkovach) wrote :

My guess is, the nop is there to make sure the literal pool is word-aligned.
As for the bug, our hope is that the guys from GNU ARM Embedded pick it up and report upstream. I don't think there is a point for you to try and pinpoint the exact commit that introduced it.

Murat Ursavas (murat-ursavas) wrote :

Today, I really tried, but fell short. Meld is still open for the source directories, 4.9 and 5.2.

I guess the time is not right. Maybe in the new year. Will be waiting for the team.

Thanks again.

Murat Ursavas (murat-ursavas) wrote :

By the way, could you click on "affects me" link to heat the issue up? This way the team may give it a better priority?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers