Compiled application from GCC v7 differs and does not work in comparison to GCC v4.8 version

Bug #1743765 reported by Jamie
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Arm Embedded Toolchain
New
Undecided
Unassigned

Bug Description

I am compiling the source code from https://github.com/NordicPlayground/nRF52840-ble-secure-bootloader/ using ARM GCC v7-2017-q4-major (7.2.1) and comparing it to ARM GCC v4.8.4, segger embedded studio (with inbuilt GCC - version not known) and keil v5.

The outputs from keil, segger embedded studio and ARM GCC v4.8.4 all work on the target device and produce a hex filesize of 66KB, the ARM GCC v7.2.1 version however produces a smaller hex file of 57KB and whilst it illuminates the LED on the target board, no other parts of the application function (which call 'softdevice' functions, i.e. functions that reside in an application that already resides on the embedded hardware), so it seems to be missing something or a lot.

I don't know how to investigate what is causing the problems i.e. if code is missing or what. The application compilation flags are the same for all builds - with debug, optimise for size. I've attached a zip of 3 hex files, the larger files from GCC v4.8 and SES's version of GCC and the smaller non-functioning one from GCC v7. I also made a quick script to compare what functions were in the disassembly of both versions and I've included that in the zip file too along with the assembly files themselves if they're of any use.

Toolchains being used are from the binary releases on the launchpad site.
Host machine is windows 7 64-bit
Testcase: download the nRF52 SDK v14.2.0 from http://developer.nordicsemi.com/nRF5_SDK/nRF5_SDK_v14.x.x/ and extract it, clone micro-ecc from https://github.com/kmackay/micro-ecc into the external/micro-ecc directory and build the library using GCC in external/micro-ecc/nrf52hf_armgcc/armgcc, then clone https://github.com/NordicPlayground/nRF52840-ble-secure-bootloader/issues into the examples directory, remove the #if check from examples/nRF52840-ble-secure-bootloader/dfu_req_handling/dfu_public_key.c and build using GCC in the directory examples/nRF52840-ble-secure-bootloader/bootloader_secure_ble/pca10056/armgcc - the resultant hex file should be 57KB using GCC v7 from launchpad, which is seemingly missing some part of the code

Revision history for this message
Jamie (lairdj) wrote :
Revision history for this message
Leo Havmøller (leh-p) wrote :

New toolchains can reveal old bugs.
Start by generating and comparing .map files for the builds.

Revision history for this message
john (jkovach) wrote :

Sounds like you found a bug in a GitHub project and you are using the wrong bug tracker to report it.

Revision history for this message
Jamie (lairdj) wrote :

Why would this be a bug in the project and not GCC? Earlier versions of GCC work fine, Keil works fine, I haven't tried IAR but they have a project for that too so I assume it also works fine, the only one with a problem and a 9KB smaller filesize is GCC 7 so I'm more inclined to believe it's a problem with GCC unless shown otherwise.

I've found a workaround to the issue by disabling the link time optimisation flag passed to GCC (-flto), the GCC 7 hex output filesize is now 65KB and the application works

Revision history for this message
john (jkovach) wrote :

It could be a GCC bug, of course. But don't expect GCC people to start looking for it in the files that you posted. I've had my share of debugging GCC output with LTO enabled and high optimization level. I must say, it's not easy. Someone familiar with this GitHub code would be more appropriate for this kind of work. In the end, it could still end up being a bug in the code, not GCC. The fact that it works with other compilers doesn't prove anything.

Revision history for this message
Thomas Preud'homme (thomas-preudhomme) wrote :

Hi Jamie,

It is very common for new compiler versions to expose bug in projects due to being more aggressive at optimizing based on undefined behavior. Clearly it could also be a bug in the compiler, impossible to know without a full analysis.

To do this, we usually expect a minimal reproducer testcase with a clear explanation of what is going wrong. Ie. we would need a smaller example and an explanation of the form "function foo is not called" or "line X does not seem to be executed" or even better "problem happens in this instruction". As it stands we cannot investigate this issue you are facing.

One way to reduce the testcase would be to compile only one file using the newer toolchain and the other using the old toolchain and see if it still works. If yes, repeat on another file until you have isolated the file that has the bug. Then try to isolate the function (you can split the file into several and put each function in a different file so you can reuse the method I've just described). Once that's done, you can compare the assembly and see if you can spot what's wrong. If not, at least try to make a testcase with just that function, such that we have only a single file to look at. Use -save-temps when compiling that testcase, which will give a .i file with all preprocessing done (ie. we don't need special header to compile that file).

Hope this help.

Best regards.

Revision history for this message
David Brown (davidbrown) wrote :

Since the code works when LTO is disabled, the most likely cause is code or data that is not being linked because its use cannot be traced back to main(). In C, all code other than the library start up routines is expected to be called directly or indirectly from main(). LTO traces this to see what code is actually used, and to drop code and data that can't legally be accessed from C. You might need to add "externally_visible" attributes to certain functions, or use "keep" in the linker file, in order to avoid dropping such code. Likely candidates include interrupt functions, "naked" functions, and parts of context switch routines in an RTOS.

Examining the map files is very useful for seeing such cases.

The other thing about LTO is that it can be a lot more sensitive to bugs in the C code that works on more limited compilers (or with more limited optimisations enabled), but is due to incorrect C usage. Common cases are to get type aliasing wrong (compile with "-fno-strict-aliasing" to turn off this optimisation), signed integer overflow leading to different code generation when merging functions across modules (try "-fwrapv" to test this hypothesis), and assuming that calling functions in other modules acts as a memory barrier (and thus you are missing "volatile" qualifiers or memory barriers).

Revision history for this message
Dominik Vogel (windoze) wrote :

This is probably related to #1747966

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.