gcc-10 breaks on armhf (flaky): internal compiler error: Segmentation fault
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gcc |
In Progress
|
Medium
|
|||
gcc-10 (Ubuntu) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Hi,
this could be the same as bug 1887557 but as I don't have enough data I'm filing it as individual issue for now.
I have only seen this happening on armhf so far.
In 2 of 5 groovy builds of qemu 5.0 this week I have hit the issue, but it is flaky.
Flakyness:
1. different file
first occurrence
/<<PKGBUILDDIR>
second occurrence
/<<PKGBUILDDIR>
Being so unreliable I can't provide mcuh more yet.
I filed it mostly for awareness and so that I can be dup'ed onto the right but if there is a better one.
Christian Ehrhardt (paelzer) wrote : | #1 |
Christian Ehrhardt (paelzer) wrote : | #2 |
Christian Ehrhardt (paelzer) wrote : | #3 |
There was another one in Groovy as of yesterday.
https:/
https:/
...
qapi/qapi-
qapi/qapi-
6570 | }
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:/
...
The bug is not reproducible, so it is likely a hardware or OS problem.
So the compiler itself is recognizing that it isn't the source code (alone) but some awkwardness that is flaky.
It seems qemu builds in groovy hit this in ~1/3 of the builds we do on armhf - not sure if that is enough for debugging for you?
Matthias Klose (doko) wrote : | #4 |
no, try a local build until you have a reproducer. When DEB_BUILD_OPTIONS is set, the compiler driver retries up to three times to see if it's reproducible.
description: | updated |
Balint Reczey (rbalint) wrote : | #5 |
Found it again in glibc 2.32-0ubuntu3 build.
vfscanf-internal.c: In function ‘__vfscanf_
vfscanf-
3057 | }
| ^
Christian Ehrhardt (paelzer) wrote : | #6 |
I'm building qemu (known to be able to trigger it) on Canonistack armhf LXD container in an arm64 VM (the setup that should be closest to the failing builders).
I also installed whoopsie and apport to catch even a single crash.
But I'm building for quite some hours by now and nothing happened.
I'll let it run the rest of the day in a a loop, but if it won't trigger again we need a better approach trying to corner this bug.
Christian Ehrhardt (paelzer) wrote : | #7 |
I compiled for almost 24h now, it just won't crash :-/
Not sure what else I could do to more likely reproduce this ...
Christian Ehrhardt (paelzer) wrote : | #8 |
Another breakage at
https:/
I had to retry it, we will see if it works on retry as before
Christian Ehrhardt (paelzer) wrote : | #9 |
And again on the same :-/
cc -iquote /<<PKGBUILDDIR>
The bug is not reproducible, so it is likely a hardware or OS problem.
There seems to be no pattern to it (e.g. on which source file it break), just a chance that increased probably on source size. But I wonder what else I could do on top of the canonistack build that I have tried - maybe concurrency?
Christian Ehrhardt (paelzer) wrote : | #10 |
cc -iquote /<<PKGBUILDDIR>
during RTL pass: reload
/<<PKGBUILDDIR>
/<<PKGBUILDDIR>
2936 | }
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
Now hit at 3/3 retries which is exactly what we were afraid of might happen ...
Changed in gcc-10 (Ubuntu): | |
importance: | Undecided → Critical |
Christian Ehrhardt (paelzer) wrote : | #11 |
Bumping the prio since -as we were afraid of - this starts to become a service-problem (what if we can't rebuild anymore?)
Christian Ehrhardt (paelzer) wrote : | #12 |
I reduced the CPU/Mem of my canonistack system that I try to recreate on (to be more similar).
Also I now do run with DEB_BUILD_
/me hopes this might help to finally trigger it in a debuggable environment.
P.S: I'm now at 4/4 retries that failed for the real build ... :-/ It gladly worked on the fifth retry
P.P.S: Note to myself 4cpu/8G Memory is the real size used (I have 4/4 atm since I set it up before I could reach anyone)
Seth Forshee (sforshee) wrote : | #13 |
We're also seeing this in kernel builds.
Christian Ehrhardt (paelzer) wrote : | #14 |
I got the crash in the repro env.
dmesg holds no OOM which is good - also no other dmesg/journal entry that would be related.
It might be depending on concurrent execution as this was the primary change to last time.
And not having set up apport/whoopsie to catch the crash :-/
I've installed them now and run the formerly breaking command in a loop.
For the sake of "just eating cpu cycles" I have spawned some cpu hogs in the background.
But with all that in place it ran the compile 300 times without a crash :-/
It seems I have to re-run in the build env and hope that apport will catch it into /var/crash this time :-/
Christian Ehrhardt (paelzer) wrote : | #15 |
Finally:
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
...
The bug is not reproducible, so it is likely a hardware or OS problem.
make[2]: *** [/root/
make[2]: Leaving directory '/root/
make[1]: *** [Makefile:527: i386-linux-
make[1]: *** Waiting for unfinished jobs....
Still nothing in /var/crash to report :-/
Why is that - I have apport/whoopsie installed, the kernel is set up
$ sysctl -a | grep core_patt
kernel.
Also I have set
$ cat ~/.config/
[main]
unpackaged=true
This is armhf lxd on arm64 host - maybe apport has a guest/host problem here?
@Doko - do you happen to know if there are any extra whoops to jump through to get a crash report from gcc when it crashes in debuild?
Christian Ehrhardt (paelzer) wrote : | #16 |
Ok, apport through the stack of LXD is ... not working.
I have used a more mundane core pattern and a C test program to ensure I will get crash dumps.
$ cat /proc/sys/
/var/crash/
$ gcc test.c ; ./a.out ; ll /var/crash/
Segmentation fault (core dumped)
total 3
drwxrwsrwt 2 root whoopsie 3 Sep 24 05:48 ./
drwxr-xr-x 13 root root 15 Sep 23 09:40 ../
-rw------- 1 root whoopsie 208896 Sep 24 05:48 core.a.
Trying to run into the real gcc crash again with this ensured ...
Christian Ehrhardt (paelzer) wrote : | #17 |
Three reruns later I got
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
The bug is not reproducible, so it is likely a hardware or OS problem.
Again no crash of gcc to find, how it is disabling that ... ?!?
I was reading through /usr/share/
Christian Ehrhardt (paelzer) wrote : | #18 |
Interim Summary:
- hits armhf compiles of various large source projects, chances are it it completely random
and just hits those more likely by compiling more
- build system auto-retries the compiles and they work on retry eventually reported as "The bug
is not reproducible, so it is likely a hardware or OS problem."
- The bug always occurs on different source files, retrying a failed one works for hundreds of
times so it seems to be sort of random when it hits and not tied to the source.
- It seems we need concurrency to trigger it, but again it might just have increases the
likeliness
- I can trigger it reliably now in ~2-8h of compile time on Canonistack when building qemu
on an armhf LXD container on a arm64 Hosts (same as the builders)
- Despite my tries I'm unable to gather a crash dump of the gcc segfault and would be happy
about a hint/advise on that.
Christian Ehrhardt (paelzer) wrote : | #19 |
Not sure if it is entirely random, it hit the second time on
/<<PKGBUILDDI
in like 2/8 hits I've had so far. Given how much code it builds that is unlikely to be an accident.
Christian Ehrhardt (paelzer) wrote : | #20 |
I tried to isolate what was running concurrently and found 7 gcc calls.
I have set them up to run concurrently in endless loops each.
That way they reached a lot of iterations without triggering the issue :-/
I don't know how to continue :-/
But I can share a login to this system and show how to trigger the bug.
The following will get you there and trigger the bug usually in 1-2 loops (~4h on average)
$ ssh ubuntu@10.48.130.69
$ lxc exec groovy-gccfail bash
# cd qemu-5.0/
# i=1; export DEB_BUILD_
@Doko could you take over from here as I'd hope you know how to force gcc to give you a dump?
I imported your key to the system mentioned above.
Christian Ehrhardt (paelzer) wrote : | #21 |
It was brought up with foundations last week in our sync and mentioned that someone will look into it for further guidance on the case. Since nothing happened I'll add the rls-gg-incoming tag to make sure it is re-visited in your bug meetings.
I beg you pardon, i know it is your tag and please feel free to remove it if it really is incorrect here - but I just want (more or less any) a response on this from someone able to decide if this is actually critical (or not) and how to go on.
tags: | added: rls-gg-incoming |
Christian Ehrhardt (paelzer) wrote : | #22 |
There is a new gcc-10 version from two days ago in groovy now.
I was talking with doko and we wanted to try different gcc-10 versions in general trying to corner the issue to when it started to appear.
https:/
https:/
https:/
https:/
I usually had the crash in 1-2 runs, so I will consider 4 good runs as the issue being not present. Although there is some racyness to this I just can't wait much longer without growing out of a day for a single test :-/
I'll update once the I got more results
Christian Ehrhardt (paelzer) wrote : | #23 |
Christian Ehrhardt (paelzer) wrote : | #24 |
Downloaded the other two as well and running on https:/
Christian Ehrhardt (paelzer) wrote : | #25 |
FYI: This passed two runs good by now, but that isn't enough. I need to have it running over night to be sure about 10.1
Christian Ehrhardt (paelzer) wrote : | #26 |
https:/
Now on https:/
Christian Ehrhardt (paelzer) wrote : | #27 |
So all 10.x that I could get fail:
https:/
https:/
https:/
https:/
Now looking which 9.x I could try ...
Christian Ehrhardt (paelzer) wrote : | #28 |
https:/
So the breakage was between 9.3.0-18ubuntu1 and 10-20200425-
How to continue from here, will you throw me PPA builds and/or do you still have debs anywhere I should try?
Christian Ehrhardt (paelzer) wrote : | #29 |
Trying gcc-snapshot 1:20200917-1ubuntu1 now
Christian Ehrhardt (paelzer) wrote : | #30 |
gcc-snapshot 1:20200917-1ubuntu1 fails in other places.
/root/qemu-
0xf0afc3 internal_error(char const*, ...)
???:0
0x8fa705 verify_
???:0
0x5f644b rest_of_
???:0
0x1f61c7 finish_
???:0
0x246ef9 c_parser_
???:0
0x254d81 c_parse_file()
???:0
0x2a3305 c_common_
???:0
So gcc-snapshot is no good to try this :-/
Christian Ehrhardt (paelzer) wrote : | #31 |
Doko passed me gcc-10 - 10.2.0-14ubuntu0.1 from https:/
Still building on armhf, but I'll give those a try once complete.
Christian Ehrhardt (paelzer) wrote : | #32 |
As expected the non-strip removed the dbgsym:
The following packages will be REMOVED:
gcc-10-dbgsym
The following packages will be upgraded:
cpp-10 g++-10 gcc-10 gcc-10-base gcc-10-multilib libasan6 libatomic1 libcc1-0 libgcc-10-dev libgcc-s1 libgomp1 libsfasan6 libsfatomic1 libsfgcc-10-dev libsfgcc-s1 libsfgomp1 libsfubsan1
libstdc++-10-dev libstdc++-10-pic libstdc++6 libubsan1
21 upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
This is now running and likely to crash later today.
But since I fail to get a crash dump before that (how to get one) will be the remaining issue we need to solve.
Christian Ehrhardt (paelzer) wrote : | #33 |
With this build the crash does still not leave a .crash file, but it is more verbose
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
0x532d6b crash_signal
../../
0x523a5b avoid_constant_
../../
0x4f6f9d commutative_
../../
0x4f705b swap_commutativ
../../
0x51deb3 simplify_
../../
0x51df01 simplify_
../../
0x42c191 lra_constraints
../../
0x41f483 lra(_IO_FILE*)
../../
0x3f0915 do_reload
../../
0x3f0915 execute
../../
Does this help you in any way?
Christian Ehrhardt (paelzer) wrote : | #34 |
I'll re-run and dump a few of them just to help you to get to the root cause:
cc -iquote /root/qemu-
0x532d6b crash_signal
../../
0x41d0c7 add_regs_
../../
0x41d1c9 add_regs_
../../
0x41d1c9 add_regs_
../../
0x41e28f lra_update_
../../
0x41e3d5 lra_update_
../../
0x41e3d5 lra_push_insn_1
../../
0x436bb5 spill_pseudos
../../
0x436bb5 lra_spill()
../../
0x41f4ef lra(_IO_FILE*)
../../
0x3f0915 do_reload
../../
0x3f0915 execute
../../
Christian Ehrhardt (paelzer) wrote : | #35 |
cc -iquote /root/qemu-
0x532d6b crash_signal
../../
0x71769f thumb2_
../../
0x717c15 arm_legitimate_
../../
0x717c15 arm_legitimate_
../../
0x427eef valid_address_p
../../
0x427eef simplify_
../../
0x4287ed curr_insn_transform
../../
0x42c133 lra_constraints
../../
0x41f483 lra(_IO_FILE*)
../../
0x3f0915 do_reload
../../
0x3f0915 execute
../../
Christian Ehrhardt (paelzer) wrote : | #36 |
gcc-snapshot still has various issues - but not the crash
/root/qemu-
44 | };
| ^
...
/root/qemu-
Can't continue with gcc-snapshot due to those (even with the newer version).
Christian Ehrhardt (paelzer) wrote : | #37 |
Defaults:
# gcc -Q --help=target | grep -e '-marm' -e '-mthumb'
-marm [disabled]
-mthumb [enabled]
-mthumb-interwork [enabled]
Doko suggested to change that by using -marm.
This is running since a while, but needs some more time to trigger ...
Christian Ehrhardt (paelzer) wrote : | #38 |
@Doko - I can confirm that with -marm the issue is gone.
I have had 6 full runs yesterday and overnight.
We can conclude, -mthumb is a requirement to trigger the issue.
Christian Ehrhardt (paelzer) wrote : | #39 |
I spoke too soon after ~7.5 runs I got the following with -marm:
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12519 | }
| ^
cc -iquote /root/qemu-
Christian Ehrhardt (paelzer) wrote : | #40 |
FYI now Testing 10.2.0-14ubuntu0.2 from https:/
I've stopped setting -marm to trigger the issue "faster", please let me know if you want me to continue to use -marm for those tests.
Christian Ehrhardt (paelzer) wrote : | #41 |
The extra checks that are enabled trigger the same issues I was seeing with gcc-snapshot (maybe they have it enabled as well?).
/root/qemu-
/root/qemu-
1935 | static abi_long do_setsockopt(int sockfd, int level, int optname,
| ^~~~~~~~~~~~~
...
/root/qemu-
/root/qemu-
7674 | static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
| ^~~~~~~~~~~
/root/qemu-
/root/qemu-
...
I see many of those, but all are only "note:" level and when searching for the actual issue I now find this more verbose output (next comment for readability):
Christian Ehrhardt (paelzer) wrote : | #42 |
Does the following help anything, do you want source and preprocessed source of it?
cc -iquote /root/qemu-
/root/qemu-
44 | };
| ^
<array_type 0xf64ca660
type <integer_type 0xf7af2420 unsigned int asm_written public unsigned SI
size <integer_cst 0xf729fe58 constant 32>
unit-size <integer_cst 0xf729fe70 constant 4>
align:32 warn_if_not_align:0 symtab:-146335680 alias-set -1 canonical-type 0xf7af2420 precision:32 min <integer_cst 0xf72b00f0 0> max <integer_cst 0xf72b00d8 4294967295>
SI size <integer_cst 0xf729fe58 32> unit-size <integer_cst 0xf729fe70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf64ca660
domain <integer_type 0xf70dbd20
type <integer_type 0xf7af2060 sizetype public unsigned SI size <integer_cst 0xf729fe58 32> unit-size <integer_cst 0xf729fe70 4>
SI size <integer_cst 0xf729fe58 32> unit-size <integer_cst 0xf729fe70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf70dbd20 precision:32 min <integer_cst 0xf729fe88 0> max <integer_cst 0xf729fe88 0>>>
/root/qemu-
<array_type 0xf64ca660
type <integer_type 0xf7af2420 unsigned int ...
Christian Ehrhardt (paelzer) wrote : | #43 |
The last one now is reproducible (not sure if that is what the segfault was), but still useful.
$ cd /root/qemu-
$ cc -iquote /root/qemu-
/root/qemu-
44 | };
| ^
<array_type 0xf6ba1a80
type <integer_type 0xf7edb420 unsigned int asm_written public unsigned SI
size <integer_cst 0xf7688e58 constant 32>
unit-size <integer_cst 0xf7688e70 constant 4>
align:32 warn_if_not_align:0 symtab:-142400624 alias-set -1 canonical-type 0xf7edb420 precision:32 min <integer_cst 0xf76990f0 0> max <integer_cst 0xf76990d8 4294967295>
SI size <integer_cst 0xf7688e58 32> unit-size <integer_cst 0xf7688e70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf6ba1a80
domain <integer_type 0xf73a0660
type <integer_type 0xf7edb060 sizetype public unsigned SI size <integer_cst 0xf7688e58 32> unit-size <integer_cst 0xf7688e70 4>
SI size <integer_cst 0xf7688e58 32> unit-size <integer_cst 0xf7688e70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf73a0660 precision:32 min <integer_cst 0xf7688e88 0> max <integer_cst 0xf7688e88 0>>>
/root/qemu-
<array_type 0...
Christian Ehrhardt (paelzer) wrote : | #44 |
Christian Ehrhardt (paelzer) wrote : | #45 |
In GCC Bugzilla #97323, Matthias Klose (doko) wrote : | #46 |
seen on the gcc-10 branch and trunk 20201003 on arm-linux-
$ cat signal.i
typedef int a __attribute_
a b[1];
$ gcc -c -g -O0 signal.i
signal.i:2:1: error: 'TYPE_CANONICAL' is not compatible
2 | a b[1];
| ^
<array_type 0xf751d7e0
type <integer_type 0xf7a4f3c0 int public SI
size <integer_cst 0xf7426e58 constant 32>
unit-size <integer_cst 0xf7426e70 constant 4>
align:32 warn_if_not_align:0 symtab:-144899760 alias-set -1 canonical-type 0xf7a4f3c0 precision:32 min <integer_cst 0xf74370a8 -2147483648> max <integer_cst 0xf74370c0 2147483647>
SI size <integer_cst 0xf7426e58 32> unit-size <integer_cst 0xf7426e70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf751d7e0
domain <integer_type 0xf751d6c0
type <integer_type 0xf7a4f060 sizetype public unsigned SI size <integer_cst 0xf7426e58 32> unit-size <integer_cst 0xf7426e70 4>
SI size <integer_cst 0xf7426e58 32> unit-size <integer_cst 0xf7426e70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf751d6c0 precision:32 min <integer_cst 0xf7426e88 0> max <integer_cst 0xf7426e88 0>>>
signal.i:2:1: error: 'TYPE_MODE' of 'TYPE_CANONICAL' is not compatible
<array_type 0xf751d7e0
type <integer_type 0xf7a4f3c0 int public SI
size <integer_cst 0xf7426e58 constant 32>
unit-size <integer_cst 0xf7426e70 constant 4>
align:32 warn_if_not_align:0 symtab:-144899760 alias-set -1 canonical-type 0xf7a4f3c0 precision:32 min <integer_cst 0xf74370a8 -2147483648> max <integer_cst 0xf74370c0 2147483647>
SI size <integer_cst 0xf7426e58 32> unit-size <integer_cst 0xf7426e70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf751d7e0
domain <integer_type 0xf751d6c0
type <integer_type 0xf7a4f060 sizetype public unsigned SI size <integer_cst 0xf7426e58 32> unit-size <integer_cst 0xf7426e70 4>
SI size <integer_cst 0xf7426e58 32> unit-size <integer_cst 0xf7426e70 4>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xf751d6c0 precision:32 min <integer_cst 0xf7426e88 0> max <integer_cst 0xf7426e88 0>>>
<array_type 0xf751d600
type <integer_type 0xf751d660 a SI
size <integer_cst 0xf7426e58 constant 32>
unit-size <integer_cst 0xf7426e70 constant 4>
user align:16 warn_if_not_align:0 symtab:-144899808 alias-set -1 canonical-type 0xf7a4f3c0 precision:32 min <integer_cst 0xf74370a8 -2147483648> max <integer_cst 0xf74370c0 2147483647>>
no-force-blk BLK size <integer_cst 0xf7426e58 32> unit-size <integer_cst 0xf7426e70 4>
user align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-typ...
Changed in groovy: | |
importance: | Unknown → Medium |
status: | Unknown → New |
In GCC Bugzilla #97323, Rguenth (rguenth) wrote : | #50 |
works on x86_64-linux
In GCC Bugzilla #97323, Ktkachov (ktkachov) wrote : | #51 |
confirmed on trunk with the extra checking enabled
In GCC Bugzilla #97323, Fabio Pedretti (pedretti-fabio) wrote : | #52 |
*** Bug 97368 has been marked as a duplicate of this bug. ***
Launchpad Janitor (janitor) wrote : | #47 |
Status changed to 'Confirmed' because the bug affects multiple users.
Changed in gcc-10 (Ubuntu): | |
status: | New → Confirmed |
In GCC Bugzilla #97323, Matthias Klose (doko) wrote : | #53 |
this started with a regression hunt in-between with GCC 9 and GCC 10. However the test case with a compiler configured with the extra and rtl checking already produces this ICE with 2018-01-01 and 2019-01-01 builds.
In GCC Bugzilla #97323, Fabio Pedretti (pedretti-fabio) wrote : | #54 |
This issue is reproducible (but more rarely) also when using -g0 , see the full build log here:
https:/
dann frazier (dannf) wrote : | #48 |
I was talking to Matthias and he mentioned that this seems to be correlated with the LP builder upgrade to bionic:
https:/
I'm running some tests to see if there might be a lower level issue:
https:/
In GCC Bugzilla #97323, Matthias Klose (doko) wrote : | #55 |
this is triggered by:
2015-05-19 Jan Hubicka <email address hidden>
* tree.c (verify_
* tree.h (gimple_
In GCC Bugzilla #97323, Matthias Klose (doko) wrote : | #56 |
commit 872d5034baa1007
dann frazier (dannf) wrote : | #49 |
I've been able to reproduce reliably on X-Gene gear when running in a KVM instance. I have not been able to reproduce outside of KVM, nor on an alternate SoC (Hi1616). I *can* reproduce on a xenial kvm guest running on a xenial X-Gene host - which suggests that correlation with the LP builder upgrade is likely just coincidence. I also tried an older xenial guest kernel just in case there was a kernel patch that was backported to all releases that may have broke things - but I was also able to reproduce there.
If I were to draw a conclusion at this stage, it would be that there may very well be a low level issue causing this but, if so, it is unlikely a new one.
Changed in groovy: | |
status: | New → Confirmed |
In GCC Bugzilla #97323, Mkuvyrkov (mkuvyrkov) wrote : | #57 |
Hi Richard,
Interested in checking out this bug? The original testcase is from QEMU source: https:/
In GCC Bugzilla #97323, Rth-d (rth-d) wrote : | #63 |
As a data point, this problem can be seen with any
strict-alignment target -- e.g. sparc.
In GCC Bugzilla #97323, Rth-d (rth-d) wrote : | #64 |
Created attachment 49473
rfc patch
The following fixes the ICE.
It seems like a hack, done at the wrong level.
Should we have in fact set TYPE_STRUCTURAL
back on the unaligned 'a' type, before we even try to create an
array of 'a'? If so, that would have properly triggered the test
here in build_array_type_1 that would have bypassed the problem.
Christian Ehrhardt (paelzer) wrote : | #58 |
I got a ping by Doko (thanks) to try
https:/
ii cpp-10 10.2.0-16ubuntu1.1 armhf GNU C preprocessor
ii g++-10 10.2.0-16ubuntu1.1 armhf GNU C++ compiler
ii gcc-10 10.2.0-16ubuntu1.1 armhf GNU C compiler
ii gcc-10-base:armhf 10.2.0-16ubuntu1.1 armhf GCC, the GNU Compiler Collection (base package)
ii libasan6:armhf 10.2.0-16ubuntu1.1 armhf AddressSanitizer -- a fast memory error detector
ii libatomic1:armhf 10.2.0-16ubuntu1.1 armhf support library providing __atomic built-in functions
ii libcc1-0:armhf 10.2.0-16ubuntu1.1 armhf GCC cc1 plugin for GDB
ii libgcc-10-dev:armhf 10.2.0-16ubuntu1.1 armhf GCC support library (development files)
ii libgcc-s1:armhf 10.2.0-16ubuntu1.1 armhf GCC support library
ii libgomp1:armhf 10.2.0-16ubuntu1.1 armhf GCC OpenMP (GOMP) support library
ii libstdc+
ii libstdc+
ii libstdc++6:armhf 10.2.0-16ubuntu1.1 armhf GNU Standard C++ Library v3
ii libubsan1:armhf 10.2.0-16ubuntu1.1 armhf UBSan -- undefined behaviour sanitizer (runtime)
I started my loop with that build and will report back later if that triggered the issue again (or another one).
Christian Ehrhardt (paelzer) wrote : | #59 |
FYI as one would expect this continues to affect Hirsute just as much, I just had a broken qemu-5.1 build on armhf.
/<<BUILDDIR>
P.S. I'm still unsure if that "TYPE_CANONICAL" issue IS the formerly seen crash or just a new issue on top with either the debug builds and/or newer versions. Did anybody track the crashes down enough to know if those are really "the same"?
Christian Ehrhardt (paelzer) wrote : | #60 |
And FYI the test build with the new compiler by Doko still runs, but we know that not failing on the first rounds isn't a 100% win. I'll let it continue some hours and ping back later once it passed e.g. 5 rounds or so which we never achieved before.
Christian Ehrhardt (paelzer) wrote : | #61 |
Over night it made 5 complete runs and all worked.
@Doko - I think we can call this fix you have a good one at least from my current tests POV.
Christian Ehrhardt (paelzer) wrote : | #62 |
I've learned that the gcc in hirsute has the checking enabled atm.
That explains why any qemu 5.1 build I do (merging) or Doko's rebuild on [1] fail atm.
New GCC build is coming in [2] that has the fix applied but the checking no more enabled.
Once built I'll re-test that one as well.
[1]: https:/
[2]: https:/
Christian Ehrhardt (paelzer) wrote : | #65 |
FYI - Started a build run with 10.2.0-16ubuntu1.2
Christian Ehrhardt (paelzer) wrote : | #66 |
Failed on the second run with:
during RTL pass: reload
/root/qemu-
/root/qemu-
1535 | }
| ^
...
0x532aeb crash_signal
../../
0x41ccd7 add_regs_
../../
0x41cdd9 add_regs_
../../
0x41cdd9 add_regs_
../../
0x41de9f lra_update_
../../
0x42be95 lra_constraints
../../
0x41f093 lra(_IO_FILE*)
../../
0x3f0405 do_reload
../../
0x3f0405 execute
../../
Please submit a full bug report,
...
The bug is not reproducible, so it is likely a hardware or OS problem.
So we surely fixed the "TYPE_CANONICAL" issue in the checker builds.
But is this one that I hit now the same original issue we had before or a different one?
Can you derive that from the traceback?
Christian Ehrhardt (paelzer) wrote : | #67 |
For the sake of a potential upstream change I was trying qemu from git up to current master, but it still fails with the same error: ‘TYPE_CANONICAL’ is not compatible.
Due to that - as long as the checking is enabled - qemu is unbuildable in Hirsute.
At the same time Doko and I began with tests for a potential bisect of gcc.
Christian Ehrhardt (paelzer) wrote : | #68 |
Bisect step #1 - Expected to ICE and indeed does so.
gcc-20200507.tar.xz
/root/qemu-
/root/qemu-
12479 | }
| ^
0x5518cb crash_signal
../../
0x51299f extract_
../../
0x51299f extract_
../../
0x5185d1 decompose_
../../
0x5185d1 decompose_
../../
0x51892f decompose_
../../
0x44822f process_address_1
../../
0x449803 process_address
../../
0x449803 curr_insn_transform
../../
0x44cfd5 lra_constraints
../../
0x440653 lra(_IO_FILE*)
../../
0x411f05 do_reload
../../
0x411f05 execute
../../
Christian Ehrhardt (paelzer) wrote : | #69 |
FYI currently on 20190425.
First build passed, but we need a few more to be sure.
Christian Ehrhardt (paelzer) wrote : | #70 |
20190425 can be considered good it completed 4.5 times before I scheduled the next run.
Next is r10-4054 which has -v of:
gcc version 10.0.0 20191022 (experimental) (GCC)
Christian Ehrhardt (paelzer) wrote : | #71 |
r10-4054 failed with
during RTL pass: reload
/root/qemu-
/root/qemu-
670 | }
| ^
0x4b7443 crash_signal
../../
0x66b7c7 thumb2_
../../
0x66bc31 arm_legitimate_
../../
0x458667 memory_
../../
0x85b013 nonimmediate_
../../
0x85b013 nonimmediate_
../../
0x85b013 nonimmediate_
../../
I'm unsure how to deal with this. It is an ICE, and it happened throughout the RTL stage as before. But the signature seems to be a different one.
It could be "bad" with the actual issue we look for hidden behind this.
It could be "good" as well with the actual issue we look fixed but hidden behind this.
Or it could be "bad" as in, this is the same issue but with a different signature.
For the time being I'll handle it as bad, with some luck we get our old trace back further down the bisect. But @Doko please have a look at the trace above (maybe it is a known one) and advise.
Starting on r10-2027
Christian Ehrhardt (paelzer) wrote : | #72 |
r10-2027 seems to be good passing 4 runs without a fail.
Continuing with r10-3040 next.
Christian Ehrhardt (paelzer) wrote : | #73 |
r10-3040 got 4.5 good passes before I aborted it - it seems to be good as well.
That means next is r10-3400
Christian Ehrhardt (paelzer) wrote : | #74 |
r10-3400 is good
I need to switch to a different overview to make sure I can track this :-)
20190425 good
r10-1014
r10-2027 good
r10-2533
r10-3040 good
r10-3220
r10-3400 good
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 next
r10-3727
r10-4054 bad
r10-6080
20200507 bad
Christian Ehrhardt (paelzer) wrote : | #75 |
r10-3657 had 5 good runs
Status:
20190425 good
r10-1014
r10-2027 good
r10-2533
r10-3040 good
r10-3220
r10-3400 good
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good
r10-3727 next
r10-4054 bad
r10-6080
20200507 bad
I might want to do a re-run of r10-4054 if r10-3727 is also good. Just to ensure we are not doing 6 more steps on something that won't fail.
Christian Ehrhardt (paelzer) wrote : | #76 |
1.5 builds good on r10-3727, but that is not enough to make a decision. Right now there is some machine downtime due to a datacenter move. Back on Monday I guess.
Christian Ehrhardt (paelzer) wrote : | #77 |
FYI: Systems are back up, restarted tests on r10-3727
Christian Ehrhardt (paelzer) wrote : | #78 |
r10-3727 had another 2.5 good runs, overall it LGTM now.
I'll re-run r10-4054 just to be sure not to hunt a ghost.
20190425 good
r10-1014
r10-2027 good
r10-2533
r10-3040 good
r10-3220
r10-3400 good
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good
r10-3727 good
r10-4054 bad next
r10-6080
20200507 bad
Christian Ehrhardt (paelzer) wrote : | #79 |
On this re-check r10-4054 had 7 complete runs without a fail.
So as I was afraid of in comment 71 already, it might have been another much more rare ICE hidden in there as well. Or OTOH we are cursed by some very bad statistical chances :-/.
I'll check r10-6080 next to see if it
a) reproduces an ICE faster
b) will show the same signature we saw more often before
20190425 good
r10-1014
r10-2027 good
r10-2533
r10-3040 good
r10-3220
r10-3400 good
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good
r10-3727 good
r10-4054 other kind of bad?
r10-6080 next
20200507 bad
Christian Ehrhardt (paelzer) wrote : | #80 |
r10-6080 now had 10 good runs.
I'm going back to test 20200507 next - we had bad states with that version so often this MUST trigger IMHO,
Reminder this runs on in armhf LXD containers on arm64 VMs (like our builds do).
I'm slowly getting the feeling it could be an issue with the underlying virtualization or bare metal.
We had a datacenter move, so the cloud runs on the same bare metal overall, but my instance could run on something else today than last week. If 20200507 no more triggers we have to investigate where the code is running.
20190425 good
r10-1014
r10-2027 good
r10-2533
r10-3040 good
r10-3220
r10-3400 good
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good
r10-3727 good
r10-4054 other kind of bad?
r10-6080 good
20200507 bad ?next?
Christian Ehrhardt (paelzer) wrote : | #81 |
FYI - inquiry for the underlying HW/SW is in RT 128805 - I set Doko and Rick to CC on that.
Christian Ehrhardt (paelzer) wrote : | #82 |
Ok, 20200507 almost immediately triggered the ICE
/root/qemu-
/root/qemu-
12479 | }
| ^
0x5518cb crash_signal
../../
0x542673 avoid_constant_
../../
0x515cad commutative_
../../
0x515d6b swap_commutativ
../../
0x53cacb simplify_
../../
0x53cb19 simplify_
../../
0x44d033 lra_constraints
../../
0x440653 lra(_IO_FILE*)
../../
0x411f05 do_reload
../../
0x411f05 execute
../../
This triggered on the first build. While waiting for some builds between r10-6080 and 20200507 I'll rerun this version to get some stats on how early to expect it.
Christian Ehrhardt (paelzer) wrote : | #83 |
Another crash with 20200507 at first try:
/root/qemu-
/root/qemu-
7504 | }
| ^
0x5518cb crash_signal
../../
0x43e363 add_regs_
../../
0x43e465 add_regs_
../../
0x43e465 add_regs_
../../
0x43e51b add_regs_
../../
0x43f497 lra_update_
../../
0x43f5dd lra_update_
../../
0x43f5dd lra_push_insn_1
../../
0x4579fb spill_pseudos
../../
0x4579fb lra_spill()
../../
0x4406bf lra(_IO_FILE*)
../../
0x411f05 do_reload
../../
0x411f05 execute
../../
Christian Ehrhardt (paelzer) wrote : | #84 |
Doko is so kind and builds r10-7093 got me.
20190425 good
r10-1014
r10-2027 good
r10-2533
r10-3040 good
r10-3220
r10-3400 good
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good
r10-3727 good
r10-4054 other kind of bad?
r10-6080 good
r10-7093 next
20200507 bad bad bad
Christian Ehrhardt (paelzer) wrote : | #85 |
3 full runs good with r10-7093 but then I got:
/root/qemu-
/root/qemu-
7969 | }
| ^
0x602fa7 crash_signal
../../
0x4f1f47 add_regs_
../../
0x4f203b add_regs_
../../
0x4f3061 lra_update_
../../
0x505ca7 process_
../../
0x505ca7 lra_eliminate(bool, bool)
../../
0x500877 lra_constraints
../../
0x4f4237 lra(_IO_FILE*)
../../
0x4c5c59 do_reload
../../
0x4c5c59 execute
../../
Christian Ehrhardt (paelzer) wrote : | #86 |
We again need to ask, is this the one we are hunting for - or might it be another issue in between.
Doko ?
Christian Ehrhardt (paelzer) wrote : | #87 |
To be sure i was running r10-7093 again and so far got 8 good runs in a row :-/
If only we could have a better trigger :-/
Christian Ehrhardt (paelzer) wrote : | #88 |
14 runs and going ...
It was never "so rare" when we were at the gcc that is in hirsute or 20200507.
I'll let it continue to run for now
Christian Ehrhardt (paelzer) wrote : | #89 |
Failed on #17
during RTL pass: reload
/root/qemu-
/root/qemu-
1535 | }
| ^
cc -iquote /root/qemu-
0x527c2f crash_signal
../../
0x4147bf add_regs_
../../
0x4148b3 add_regs_
../../
0x4148b3 add_regs_
../../
0x4158d9 lra_update_
../../
0x415a29 lra_update_
../../
0x415a29 lra_push_insn_1
../../
0x42dd53 spill_pseudos
../../
0x42dd53 lra_spill()
../../
0x416b1b lra(_IO_FILE*)
../../
0x3e84d1 do_reload
../../
0x3e84d1 execute
../../
Christian Ehrhardt (paelzer) wrote : | #90 |
I'm not yet sure what we should learn from that - do we need 30 runs of each step to be somewhat sure? That makes an already slow bisect even slower ...
Christian Ehrhardt (paelzer) wrote : | #91 |
FYI - another 8 runs without a crash on r10-7093.
My current working theory is that the root cause of the crash might have been added as early as r10-4054 but one or many later changes have increased the chance (think increase the race window or such) for the issue to trigger.
If that assumption is true and with the current testcase it is nearly impossible to properly bisect the "original root cause". And at the same time still hard to find the one that increased the race window - since crashing early does not surely imply we are in the high/low chance area.
We've had many runs with the base versions so that one is really good.
But any other good result we've had so far could - in theory - be challenged and needs ~30 good runs to be somewhat sure (puh that will be a lot of time).
I'm marking the old runs that are debatable with good?<count-
Also we might want to look for just the "new" crash signature.
20190425 good
r10-1014
r10-2027 good?4
r10-2533
r10-3040 good?4
r10-3220
r10-3400 good?4
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good?5
r10-3727 good?3
r10-4054 other kind of bad - signature different, and rare?
r10-6080 good?10
r10-7093 bad, but slow to trigger
20200507 bad bad bad
Signatures:
r10-4054 arm_legitimate_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Of course it could be that the same root cause surfaces as two different signatures - but to it could as well be a multitude of issues. Therefore - for now - "add_regs_
With some luck (do we have any in this?) the 10 runs on 6080 are sufficient.
Let us try r10-6586 next and plan for 15-30 runs to be sure it is good.
If hitting the issue I'll still re-run it so we can compare multiple signatures.
Christian Ehrhardt (paelzer) wrote : | #92 |
Since this seems to become a reproducibility
Christian Ehrhardt (paelzer) wrote : | #93 |
r10-6586 - passed 27 good runs, no fails
Updated Result Overview:
20190425 good
r10-1014
r10-2027 good?4
r10-2533
r10-3040 good?4
r10-3220
r10-3400 good?4
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good?5
r10-3727 good?3
r10-4054 other kind of bad - signature different, and rare?
r10-6080 good?10
r10-6586 good?27
r10-7093 bad, but slow to trigger (2 of 19)
20200507 bad bad bad
Signatures:
r10-4054 arm_legitimate_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Next I'll run r10-7093 in this new setup.
@Doko - It would be great to have ~6760 be built for the likely next step.
Christian Ehrhardt (paelzer) wrote : | #94 |
Add another 1/3 fails to r10-7093
Now I am on the next two
- r10-6760
- r10-6839
Christian Ehrhardt (paelzer) wrote : | #95 |
2/7 runs of r10-6839 failed with
r10-6839 add_regs_
Next will be r10-6760
Christian Ehrhardt (paelzer) wrote : | #96 |
Updated Result Overview:
20190425 good
r10-1014
r10-2027 good?4
r10-2533
r10-3040 good?4
r10-3220
r10-3400 good?4
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good?5
r10-3727 good?3
r10-4054 other kind of bad - signature different, and rare?
r10-6080 good?10
r10-6586 good?27
r10-6760 next
r10-6839 bad (2 of 9)
r10-7093 bad, but slow to trigger (2 of 19)
20200507 bad bad bad
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Christian Ehrhardt (paelzer) wrote : | #97 |
We'll need more runs to be sure, but so far r10-6760 seems good.
In preparation - could I requests builds between r10-6760 - r10-6839 please ?
Christian Ehrhardt (paelzer) wrote : | #98 |
Ok, r10-6760 reached 20 good runs and is considered good.
Doko was so kind to build 6779 6799 6819 for me - of which 6799 will be next.
Note: I've aligned the comments to all have the same style and dropped the untested revisions.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6779 untested
r10-6799 next
r10-6819 untested
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Christian Ehrhardt (paelzer) wrote : | #99 |
FYI: r10-6799 had 14 good runs so far, I'll let it run for a bit longer to be sure.
Then - later today - if nothing changes r10-6819 will be next.
Christian Ehrhardt (paelzer) wrote : | #100 |
Completed 20 good runs on r10-6799, continuing with r10-6819 as planned.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 next
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Christian Ehrhardt (paelzer) wrote : | #101 |
r10-6819 had 22 good runs.
r10-6829 will be the next to try.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6829 next
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #102 |
r10-6829 has 2 fails in 35 runs
Signature matches, both are: add_regs_
r10-6824 = next
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6824 next
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #103 |
r10-6824 bad 1 of 24, signature matches
We have only a few steps to go and need to increase the number of runs to be sure, so I'll let it run for a while longer.
Also - eventually - I'll re-run what we consider to be the last good, quite a few times to be sure.
Most likely I'll later today switch and test r10-6822 next.
Christian Ehrhardt (paelzer) wrote : | #104 |
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6822 next
r10-6824 bad 1 of 33
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6824 add_regs_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #105 |
r10-6822 so far has 0 of 20, but I'll let it run another ~24h
Christian Ehrhardt (paelzer) wrote : | #106 |
r10-6822 seems good.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6822 good 0 of 37
r10-6823 next
r10-6824 bad 1 of 33
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6824 add_regs_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #107 |
r10-6823 bad 1 of 28
during RTL pass: reload
/root/qemu-
/root/qemu-
3298 | }
| ^
0x524cf3 crash_signal
0x411e07 add_regs_
0x411efb add_regs_
0x411efb add_regs_
0x411efb add_regs_
0x412f21 lra_update_
0x413071 lra_update_
0x413071 lra_push_insn_1
0x42b373 spill_pseudos
0x42b373 lra_spill()
0x414163 lra(_IO_FILE*)
0x3e5b9d do_reload
0x3e5b9d execute
Please submit a full bug report,
with preprocessed source if appropriate
I'll give the hopefully good r10-6822 another few chances to fail, because - as it is obvious by now - it seems we can't rely much on these bisect results.
Afterwards I'll give 10.2.1-1 in Hirsute a try (requested by Doko)
Christian Ehrhardt (paelzer) wrote : | #108 |
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6822 good 0 of 37 <- giving this more runs now
r10-6823 bad 1 of 28
r10-6824 bad 1 of 33
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6823 add_regs_
r10-6824 add_regs_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #109 |
As mentioned before - I didn't trust this result.
And with "likeliness" of this being so low we all know that results are unreliable.
Due to that now r10-6822 is
r10-6822 - bad 2 of 67
The signature was the same "add_regs_
What to do from here ...
We could bisect again starting with r10-6822 and 20190425 and use at least like 100 runs each.
But that would be a last resort as I'm on ~1run/h which means ~4 days each step.
I have a few "maybe we are lucky" things to try first:
- 10.2.1-1 in hirsute
- trunk gcc-r11-5879.tar.xz
- Doing a run with -O1
Dimitri John Ledkov (xnox) wrote : | #110 |
"just retry the build" is our solution to this issue. It's a bit a waste of time hunting this all down at this point, unfortunately.
maybe we can try reproducing this on some publicly available hardware, i.e. graviton2 on aws. But also not sure how much value there is in doing this.
tags: |
added: rls-gg-notfixing removed: rls-gg-incoming |
Changed in gcc-10 (Ubuntu): | |
status: | Confirmed → Won't Fix |
affects: | groovy → gcc |
Christian Ehrhardt (paelzer) wrote : Re: [Bug 1890435] Re: gcc-10 breaks on armhf (flaky): internal compiler error: Segmentation fault | #112 |
On Thu, Dec 10, 2020 at 5:31 PM Dimitri John Ledkov
<email address hidden> wrote:
>
> "just retry the build" is our solution to this issue.
It is not - in hirsute the builds of the actual package on LP hit 100%
fail-rate.
Unfortunately not in the repro, but due to the above the workaround
currently is to build with gcc-9 on armhf.
But that is not a long term solution.
Therefore also this IMHO can't be won't fix
Changed in gcc-10 (Ubuntu): | |
status: | Won't Fix → New |
Christian Ehrhardt (paelzer) wrote : | #111 |
> "just retry the build" is our solution to this issue.
It is not - in hirsute the builds of the actual package on LP hit 100% fail-rate.
Unfortunately not in the repro, but due to the above the workaround currently is to build with gcc-9 on armhf.
But that is not a long term solution.
Therefore also this IMHO can't be "won't fix"
Christian Ehrhardt (paelzer) wrote : | #113 |
I'll give things a try in current Hirsute (gcc on 10.2.1, qemu on 5.2) building with gcc-10.
If we are back at a level where retries work I'm ok to lower severity.
I'll let you know about these results in a few days.
But since we have had the case of it reaching 100% breakage (and then would be e.g. un-serviceable) I'm unsure if we should - even then - fully close it.
Christian Ehrhardt (paelzer) wrote : | #114 |
In the test env (not LP build infra, but canonistack) I've got 30 good runs on 10.2.1 which gives me some hope ...
Christian Ehrhardt (paelzer) wrote : | #115 |
Indeed, gcc-10.2.1 with qemu 5.2 no more breaks 100%.
Here a good build log
https:/
I'll need a few more builds anyway and will let you know.
As mentioned before that does lower severity, but not close the bug.
Changed in gcc-10 (Ubuntu): | |
status: | New → Confirmed |
importance: | Critical → Medium |
Christian Ehrhardt (paelzer) wrote : | #116 |
r11-5879 - bad 8 of 10
So we know:
a) the bug has not been fixed yet
b) as we've seen with later GCC-10 runs, the chances to trigger further increased
Christian Ehrhardt (paelzer) wrote : | #117 |
I left r11-5879 running over the weekend and it concluded with 37 of 75 runs failing
That is ~50%
I'll look at -O1 next
Christian Ehrhardt (paelzer) wrote : | #118 |
Fails with -O1 as well, although I have to admit that different -O levels are deeply integrated in qemus build system. So it is hard to overwrite "all of them". Therefore - while I set -O1 and that affected some builds, it isn't implying that all compiler calls were -O1.
I know dannf has made some bare-metal tests and so far none of those have failed.
Unfortunately our builders are VM based, so that isn't very helpful anyway.
Never the less I've transported my test container over to a box to build there.
Trying to maas-deploy a few more chip types didn't work out, but maybe it will eventually with some help by the HWE team.
Christian Ehrhardt (paelzer) wrote : | #119 |
I was unable to trigger the issue on my rpi4 yet, but as you'd imagine it is rather slow.
But (thanks Dannf) I got access to an X-gene - and carrying my known bad setup there (LXD container export FTW) I was able to recreate this on bare-metal as well.
(Host) Kernel: 5.4.0-58-generic
Model: X-Gene - 8 cores
The guest is Hirsute building qemu 5.0 with r11-5879
I got two known bug signatures - once the common one we see most and once a different one (that we've seen before with 20200507).
This happened on the first two runs, once it has run some hours I'll post the rate of success-vs-fails as well.
--- ---
during RTL pass: reload
/root/qemu-
/root/qemu-
1535 | }
| ^
0x56715f crash_signal
0x4599ad add_regs_
0x459ab9 add_regs_
0x459ab9 add_regs_
0x45abc7 lra_update_
0x468985 lra_constraints
0x45bc15 lra(_IO_FILE*)
42d463 do_reload
0x42d463 execute
Please submit a full bug report,
--- ---
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
0x56715f crash_signal
0x527e35 extract_
0x52d84b extract_
0x52d84b decompose_
0x52d84b decompose_
0x52dbc3 decompose_
0x463551 process_address_1
0x464c47 process_address
0x464c47 curr_insn_transform
0x468913 lra_constraints
0x45bc15 lra(_IO_FILE*)
0x42d463 do_reload
0x42d463 execute
Please submit a full bug report,
Christian Ehrhardt (paelzer) wrote : | #120 |
The canonistack machines I used to crash it (and likely the LP builders) are X-Gene as well.
So we might have a chance to lock this in on specific HW if there are other chip types I could use.
Christian Ehrhardt (paelzer) wrote : | #121 |
So far 2/4 failed of r11-5879 on X-Gene BareMetal.
Doko asked me to try if I could get these to fail with -j1 as well (in the past I was unable to do so, but it is worth a try).
Christian Ehrhardt (paelzer) wrote : | #122 |
On BareMetal now also triggered with -j1 (but there were multiple LXD containers each running -j1 to increase the chance to find it).
/root/qemu-
/root/qemu-
485 | }
| ^
0x56715f crash_signal
0x4599ad add_regs_
0x459ab9 add_regs_
0x459ab9 add_regs_
0x45abc7 lra_update_
0x468985 lra_constraints
0x45bc15 lra(_IO_FILE*)
0x42d463 do_reload
0x42d463 execute
Please submit a full bug report,
with preprocessed source if appropriate.
Christian Ehrhardt (paelzer) wrote : | #123 |
Just FYI - as we were afraid of - this now starts to break SRUs and other service actions to qemu in Groovy. https:/
And without a better solution I'll need to trigger retry with fingers crossed.
In GCC Bugzilla #97323, Rguenth (rguenth) wrote : | #124 |
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
Changed in gcc: | |
status: | Confirmed → In Progress |
Oibaf (oibaf) wrote : | #125 |
Is this still an issue? I was able to only reproduce it on groovy, now EoL.
In GCC Bugzilla #97323, Jakub-gcc (jakub-gcc) wrote : | #126 |
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
In GCC Bugzilla #97323, Rguenth (rguenth) wrote : | #127 |
GCC 10 branch is being closed.
In GCC Bugzilla #97323, Pinskia (pinskia) wrote : | #128 |
*** Bug 112791 has been marked as a duplicate of this bug. ***
I've today seen this on DPDK /launchpadlibra rian.net/ 497142982/ buildlog_ ubuntu- groovy- armhf.dpdk_ 20.08-1ubuntu1~ ppa1_BUILDING. txt.gz
https:/
And recently also on qemu again (but that was in the main archive and I could not hold back hitting retry on which it worked).
Is there anything in the pipeline that could address this and makes it worth running a few re-compiles?