Windows build unstable with mingw tools installed via msys2

Bug #1921141 reported by Eric Timmons
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
SBCL
Fix Released
Undecided
Unassigned

Bug Description

Building SBCL 2.1.2 on Windows results in an executable that randomly crashes. It crashes frequently enough that it's nearly impossible to get through building the contribs (but tends to fail on different contribs each time).

uname -a:

MINGW64_NT-10.0-18363 DESKTOP-O8J2MCI 3.1.7-340.x86_64 2020-11-08 12:32 UTC x86_64 Msys

A typical crash looks like the following:

C:/msys64/tmp/tmp.rVLDFNelEc/sbcl-2.1.2-daed6bff426f7b9653d6de1d318cfaaf04c74724515644b8851ec3140d75af6b/src/runtime/sbcl --noinform --core C:/msys64/tmp/tmp.rVLDFNelEc/sbcl-2.1.2-daed6bff426f7b9653d6de1d318cfaaf04c74724515644b8851ec3140d75af6b/output/sbcl.core --lose-on-corruption --disable-debugger --no-sysinit --no-userinit --load ../asdf-stub.lisp \
 --eval '(asdf::test-asdf-contrib "sb-bsd-sockets")'
Unhandled SB-SYS:MEMORY-FAULT-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING
                                                 {10057301C3}>:
  Unhandled memory fault at #xFFFFFFFFFFFFFFFA.
Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {10057301C3}>
0: (SB-KERNEL::%SIGNAL #<TYPE-ERROR expected-type: SB-C::VOP datum: #<UNPRINTABLE instance of #<SB-C::VOP ..> {100568AD03}>>)
1: (ERROR TYPE-ERROR :DATUM #<UNPRINTABLE instance of #<SB-C::VOP :INFO SB-C:CLOSURE-INIT :ARGS #<SB-C:TN-REF :TN #<SB-C:TN t1[RAX] :NORMAL> :WRITE-P NIL :VOP SB-C:CLOSURE-INIT> :RESULTS NIL :CODEGEN-INFO (0)> {100568AD03}> :EXPECTED-TYPE SB-C::VOP :CONTEXT NIL)
2: (SB-KERNEL:INTERNAL-ERROR #.(SB-SYS:INT-SAP #XFC07FD520) #<unused argument>)
3: ("foreign function: #x7FF6DD2FC323")
4: ("foreign function: #x7FF6DD2D41F8")
unhandled condition in --disable-debugger mode, quitting
Unhandled SB-SYS:MEMORY-FAULT-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING
                                                 {10057301C3}>:
  Unhandled memory fault at #xFFFFFFFFFFFFFFFA.
Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {10057301C3}>
0: (SB-KERNEL::%SIGNAL #<TYPE-ERROR expected-type: HASH-TABLE datum: #<UNPRINTABLE instance of #<FUNCTION SB-IMPL::REMHASH/EQ> {10053EBEC3}>>)
1: (ERROR TYPE-ERROR :DATUM #<UNPRINTABLE instance of #<FUNCTION SB-IMPL::REMHASH/EQ> {10053EBEC3}> :EXPECTED-TYPE HASH-TABLE :CONTEXT HASH-TABLE)
2: (SB-KERNEL:INTERNAL-ERROR #.(SB-SYS:INT-SAP #XFC07FB6E0) #<unused argument>)
3: ("foreign function: #x7FF6DD2FC323")
4: ("foreign function: #x7FF6DD2D41F8")
unhandled condition in --disable-debugger mode, quitting

Revision history for this message
Douglas Katzman (dougk) wrote :

There are literally zero occurrences of #+/-sb-linkable-runtime in the Lisp code, so it's clearly some sort of link-time problem, but I don't think there are any developers in a position to diagnose it.

The original commit (https://sourceforge.net/p/sbcl/sbcl/ci/402a8fab62) said "Support this feature on ... Windows" but if I had to guess, "support" only meant that it's not expressly disallowed to attempt to build it that way.

Revision history for this message
Eric Timmons (daewok) wrote :

Actually, I think the :sb-linkable-runtime was a red herring. I just got the exact same behavior trying a vanilla build.

I normally build a vanilla SBCL on Windows via a CI process. I'll look back at those to try and figure out if either the bug was there all along and my script just gobbled up the error or if there's some sort of build environment effect.

I didn't originally think this was related to 1917481, largely because the error is remarkably consistent (even across computers) and is different than the error in that ticket. But I could be wrong in that assessment.

Revision history for this message
Eric Timmons (daewok) wrote :

Yep, my CI scripts we're gobbling up the error. My logs are kind of sparse, but there are hints the bug started appearing when MSYS2 updated from 20201109 -> 20210215. I'll keep digging, but this is definitely not caused by :sb-linkable-runtime so I'll update the description.

summary: - Windows build unstable with :sb-linkable-runtime
+ Windows build unstable
description: updated
Eric Timmons (daewok)
summary: - Windows build unstable
+ Windows build unstable with mingw tools installed via msys2
Revision history for this message
Eric Timmons (daewok) wrote :

OK, this problem *actually* appears to be related to mingw-w64 tools installed via msys2's pacman. When I install mingw-w64 directly from the project, everything appears to work. Tests pass, and contribs built fine several times in a row (with and without :sb-linkable-runtime). I believe this setup is more in line with how the Github hosted Windows runners are configured.

Feel free to close if you don't want to support or track issues with building using the mingw-w64 toolchain installed via msys2.

Revision history for this message
Douglas Katzman (dougk) wrote :

But I build with mingw installed through msys2 pacman and I've not seen an error.
How did you reach this conclusion?
As far as solving it, it may help to rid win32-os.c of all remnants of bogus pthread functions.
I removed most in https://sourceforge.net/p/sbcl/sbcl/ci/544548c21c but it seems like sigaddset, sigdelset, etc are still there. At some point in the halfway-done state, we released an SBCL that did not work at all for some users. I'm not sure how it progressed from "didn't work at all" to "seemed to work" to "unreliably works". But anyway removal of legacy crap is I think we should proceed, not that I have any preconceived notion of whether it'll fix anything.

Revision history for this message
Eric Timmons (daewok) wrote :

I didn't mean to imply it was an issue with msys2 in general, just with recent versions. It's the only explanation I can come up with that fits all the data I have.

+ This error started cropping up in my CI logs around March 1. I just didn't notice until now because my Windows scripts apparently suck.

+ I've been able to build successfully using msys2's mingw-w64 toolchain in the past, but have been having issues on new installs of msys2 over the past two weeks as I've been poking at things. Using the same msys2 install and just munging my PATH to point to mingw-w64 8.1 from Sourceforge caused everything to work.

+ I managed to dig up an instance of msys2 and its mingw-w64 toolchain I installed on Feb 18. I was able to build SBCL successfully several times in a row. I did the pacman -Syu dance and the very first build after that failed with this issue.

Looking at the timing of when packages were released, there was a new version of pthreads released into msys2's repos on Feb 27. Would not surprise me if that's the culprit.

Revision history for this message
Douglas Katzman (dougk) wrote :

I've seen it randomly fail now under VMWare + win10 eval + msys2. I guess that's progress.

Revision history for this message
Douglas Katzman (dougk) wrote :

Commenting out all the signal-related junk does nothing to improve the situation.

My intuition now is that garbage-collection does not see all of the lisp stack.
Turning on verify_gens passes, which means GC believes that it has done the right thing, but upon returning control to Lisp, there are telltale signs of pointers which weren't updated.

Revision history for this message
Eric Timmons (daewok) wrote :

It seems like ASLR is a contributing factor. See the 2021-01-31 entry on https://www.msys2.org/news/ . After adding -Wl,--disable-dynamicbase,--disable-high-entropy-va,--default-image-base-low to LINKFLAGS I couldn't reproduce the issue on four consecutive builds.

My Feb 18 install wasn't affected because, even though I installed it after that announcement, it still had binutils 2.35 for whatever reason.

Stas Boukarev (stassats)
Changed in sbcl:
status: New → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.