cannot compile sbcl 2.1.3 with 2.1.0 on windows 32-bit

Bug #1923325 reported by alexis rivera
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Fix Released
Undecided
Unassigned

Bug Description

I'm failing to compile sbcl 2.1.3 with 2.1.0 on 32-bit windows machine following the directions at https://solarianprogrammer.com/2019/08/20/building-sbcl-steel-bank-common-lisp-windows/
I've attached the compilation log in the file dump.txt
I've also failed to compile it with 1.4.14.

After doing the step unset SBCL_HOME and sh install.sh, I get the following error:
output/sbcl.core not found, aborting installation.

Something failed during compilation, but I'm not a LISP expert to decode the error.

I had been successful in compiling 2.1.0 on a 32-bit windows machine using those instructions. Also, I have been successful in compiling 2.1.3 with 2.1.0 on a 64-bit windows machine.

I'm using MINGW32 to compile. The result of uname -a is
MINGW32_NT-10.0-18363 DESKTOP-0ETM2HC 3.1.7-340.i686 2020-09-29 08:19 UTC i686 Msys

Revision history for this message
alexis rivera (riveraah) wrote :
Revision history for this message
alexis rivera (riveraah) wrote :

Attached is the compilation log using 1.4.14

description: updated
alexis rivera (riveraah)
description: updated
Revision history for this message
Douglas Katzman (dougk) wrote :

Some users have reported problems with address space randomization.
See https://bugs.launchpad.net/sbcl/+bug/1921141

Revision history for this message
Stas Boukarev (stassats) wrote :

I added the --disable-dynamicbase flag, can you check that the git version builds for you?

Revision history for this message
alexis rivera (riveraah) wrote :

I downloaded version sbcl-2.1.3-95-g389e012d7. I wasn't able to compile it in mingw32 on the 32-bit windows machine. Attached is the compile log.

Revision history for this message
Stas Boukarev (stassats) wrote :

After updating mingw I can reproduce this. But no idea what they broke there. Maybe it's time to drop 32-bit windows support.

Revision history for this message
alexis rivera (riveraah) wrote :

If it helps, I was able to debug the executable
./src/runtime/sbcl --core output/cold-sbcl.core --lose-on-corruption $SBCL_MAKE_TARGET_2_OPTIONS --no-sysinit --no-userinit --eval '(sb-fasl::!warm-load "src/cold/warm.lisp")' --quit

Before it crashes, these are the stack traces

689
690 /* Doing this immediately after the core has been located
691 * and before any random malloc() calls occur improves the chance
692 * of mapping dynamic space at our preferred address (if movable).
693 * If not movable, it was already mapped in allocate_spaces(). */
694 initial_function = load_core_file(core, embedded_core_offset,
695 merge_core_pages);
696 if (initial_function == NIL) {
697 lose("couldn't find initial function");
698 }

Then inside load_core_file,
1044 case BUILD_ID_CORE_ENTRY_TYPE_CODE:
1045 stringlen = *ptr++;
1046 --remaining_len;
1047 gc_assert(remaining_len * sizeof (core_entry_elt_t) >= stringlen);
1048 if (stringlen+1 != sizeof build_id || memcmp(ptr, build_id, stringlen))
1049 lose("core was built for runtime \"%.*s\" but this is \"%s\"",
1050 (int)stringlen, (char*)ptr, build_id);
1051 break;

Then, inside lose(),

 lose(char *fmt, ...)
123 {
124 va_list ap;
125 /* Block signals to prevent other threads, timers and such from
126 * interfering. If only all threads could be stopped somehow. */
127 block_blockable_signals(0);
128 fprintf(stderr, "fatal error encountered");
129 va_start(ap, fmt);
130 print_message(fmt, ap);
131 va_end(ap);

In block_blockable_signals,

 void
572 block_blockable_signals(sigset_t *old)
573 {
574 thread_sigmask(SIG_BLOCK, &blockable_sigset, old);
575 }

old is 0,

In sb_pthread_sigmask,
2021 if (oldset)
2022 *oldset = self->blocked_signal_set;
2023 if (set) {
2024 switch (how) {
2025 case SIG_BLOCK:
2026 self->blocked_signal_set |= *set;
2027 break;
2028 case SIG_UNBLOCK:
2029 self->blocked_signal_set &= ~(*set);
2030 break;

oldset = 0
self->blocked_signal_set cannot access memory address at 0x403c,
but self has an address at 0x4000

Let me know how else can I help troubleshoot.

Revision history for this message
Stas Boukarev (stassats) wrote :

OK, if it dies in lose() then that memory fault isn't exactly relevant (but lose should work anyway).

Revision history for this message
alexis rivera (riveraah) wrote :
Download full text (4.0 KiB)

I was successful compiling sbcl latest on mingw64. Below are some notes differences that I noted between the compilation log on mingw64 (attached here for you to review) and the compilation log on mwing32. I don't know if they are relevant to the problem but I'll mention them anyway.

- mingw64 compiles -g -O5, mingw32 compiled with -g -O3
- mingw32 compiles with -D_WIN32_WINNT=0x600 vs mingw64 compiles with -DWINVER=0x0501. Intuitively it feels it should be the opposite?
- mingw32 compiles with -mpreferred-stack-boundary=2, this option is not in mingw64. What is the purpose of this option?
- mingw64 compiles with -Wno-unused-function -Wno-unused-parameter -Wno-cast-function-type
- Compiling cross-float.fasl-tmp in win32 generated the following message (there were other occurrences that didn't occur in mingw64)
//CROSS-FLOAT DISCREPANCY!
// CACHE: (COMMON-LISP:EXPT #.(MAKE-DOUBLE-FLOAT #x40240000 #x0) -309) -> (#.(MAKE-DOUBLE-FLOAT #xB815 #x7268FDAF))
// HOST : (9.999928873692877d-310)
- mingw32 warns the following
in gc-common.c,
gc-common.c:1851:16: warning: 'can_invoke_post_gc' defined but not used [-Wunused-function]
 1851 | static boolean can_invoke_post_gc(__attribute__((unused)) struct thread* th,
in interrupt.c,
interrupt.c:1260:1: warning: 'run_deferred_handler' defined but not used [-Wunused-function]
 1260 | run_deferred_handler(struct interrupt_data *data, os_context_t *context)
      | ^~~~~~~~~~~~~~~~~~~~
interrupt.c:710:1: warning: 'check_interrupt_context_or_lose' defined but not used [-Wunused-function]
  710 | check_interrupt_context_or_lose(os_context_t *context)
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
interrupt.c:167:13: warning: 'sigmask_logior' defined but not used [-Wunused-function]
  167 | static void sigmask_logior(sigset_t *dest, const sigset_t *source)
      | ^~~~~~~~~~~~~~
in monitor.c:507
monitor.c:507:1: warning: 'sigint_handler' defined but not used [-Wunused-function]
  507 | sigint_handler(int __attribute__((unused)) signal,
      | ^~~~~~~~~~~~~~
in win32-os.c
At top level:
win32-os.c:729:15: warning: 'qpcMultiplier' defined but not used [-Wunused-variable]
  729 | static double qpcMultiplier;
      | ^~~~~~~~~~~~~
win32-os.c:728:22: warning: 'lisp_init_time' defined but not used [-Wunused-variable]
  728 | static LARGE_INTEGER lisp_init_time;
      | ^~~~~~~~~~~~~~

- in fd-stream.lisp
; file: D:/Downloads/sbcl_latest/sbcl/src/code/fd-stream.lisp
; in: DEFUN SB-IMPL::!MAKE-COLD-STDERR-STREAM
; (SB-WIN32::GET-STD-HANDLE-OR-NULL SB-WIN32::+STD-ERROR-HANDLE+)
;
; caught WARNING:
; undefined variable: SB-WIN32::+STD-ERROR-HANDLE+
;
; compilation unit finished
; Undefined variable:
; SB-WIN32::+STD-ERROR-HANDLE+
; caught 1 WARNING condition
; printed 12 notes

Likely suspicious calls:
   1 SB-EXT:GET-TIME-OF-DAY

Possibly suspicious calls:
 194 SB-KERNEL:%COERCE-CALLABLE-TO-FUN
 181 SB-KERNEL:SPECIFIER-TYPE
 162 SB-KERNEL:%DOUBLE-FLOAT
  94 SB-KERNEL:%SINGLE-FLOAT
  33 SB-KERNEL:%UNARY-TRUNCATE/DOUBLE-FLOAT
  24 (SETF SB-INT:INFO)
  15 SB-KERNEL:%UNARY-TRUNCATE/SINGLE-FLOAT
   9 SB-INT:INFO
   8 SB-KERNEL:VALUES-SPECIFIER-TYPE
   5 SB-C::...

Read more...

Revision history for this message
Stas Boukarev (stassats) wrote :

I don't think you'll find the culprit by comparing it with the 64-bit version of the current mingw. The previous version of 32-bit mingw successfully compiles the current revision of SBCL.

Revision history for this message
alexis rivera (riveraah) wrote :

My thought was that if the mingw64 and mingw32 are the latest versions, they perhaps have the same fixes (the address randomization in this case). Yet, the 64-bit one works. So is there something different in the way we compile the code that can cause the difference in behavior (here I was thinking in terms of compiler flags). What do you recommend I focus on? Any suggestions?

Revision history for this message
Stas Boukarev (stassats) wrote :

We still don't know where it fails. So that would be a start. But neither gdb nor windbg are elucidating.

Revision history for this message
alexis rivera (riveraah) wrote :

I changed the Config.x86-win32 and removed -O3 and the disable-dynamicbase options to compile with a debug version. The code still crashes. But I saw the following things.
- at startup I get the following message
warning: C:\Windows\System32\win32u.dll: import table's virtual address (0x0) is outside .idata section's range [0x19000, 0x19004[.
- The code makes it to load_core_file:1048 (the BUILD_ID_CORE_ENTRY_TYPE_CODE case)
- It enters the if because the value of ptr and the value of build_id are different
  build_id = "DESKTOP-0ETM2HC-ahriv-2021-04-23-23-14-17"
  ptr = "DESKTOP-0ETM2HC-ahriv-2021-01-22-21-48-01"

I believe the build_id is the value from the core than is compiled (sbcl-2.1.3-95-g389e012d7) and ptr is the value from the core that was loaded (2.1.0). Is this correct?

Then it enters the lose function where the segfault happens.

Is this execution path expected? Am I running the new executable with the incorrect SBCL_HOME?

Thanks

Revision history for this message
Stas Boukarev (stassats) wrote :

The actual problem isn't with lose (although lose dose segfault if invoked early). You're somehow miscompiling it and getting different build ids.

Revision history for this message
alexis rivera (riveraah) wrote :

never mind. rookie mistake. I wasn't passing the --core path in the debugger.

Revision history for this message
alexis rivera (riveraah) wrote :

Now, I get all the way to create_main_lisp(function=598799389) at thread.c:342 to the
function funcall0(function)
 call_into_lisp(function,args,0)
   x86-assem.S:463 call *CLOSURE_FUN_OFFSET(%eax) where it dies

eax has the value 598796925

Revision history for this message
Stas Boukarev (stassats) wrote :

Dies how? From what I got from gdb is that it does get into lisp, maybe your debugger can't deal with lisp code, which doesn't have any debug info.

Revision history for this message
alexis rivera (riveraah) wrote :

It says the following:
Thread 1 received signal SIGSEGV, Segmentation fault.
0x22ab5a77 in ?? ()
(gdb) up
#1 0x23b0f485 in ?? ()
(gdb) up
#2 0x0018996d in call_into_lisp () at x86-assem.S:463
463 call *CLOSURE_FUN_OFFSET(%eax)

I'm using gcc (Rev10, Built by MSYS2 project) 10.2.0 and gdb 10.1 for what is worth.

Is there anything I need to specify to gdb so it's able to deal with the lisp code?

Revision history for this message
alexis rivera (riveraah) wrote :

I was able to compile SBCL "sbcl-2.1.4-54-g36d9e877a"
on my 32-bit windows MINGW32_NT-10.0-19042 DESKTOP-0ETM2HC 3.1.7-340.i686 2020-09-29 08:19 UTC i686 Msys

I added following compiler flag to the Config.x86-win32

LINKFLAGS += -Wl,--disable-dynamicbase,--disable-nxcompat

Apparently they go together. This document gave a clue
https://lists.gnu.org/archive/html/bug-binutils/2015-09/msg00204.html

There was another discussion that was more direct about it but I can't find the link again. I'm going to try to compile 2.1.3 with this change and report back.

Revision history for this message
alexis rivera (riveraah) wrote :
Revision history for this message
alexis rivera (riveraah) wrote :

Added

ifeq ($(shell $(LD) --disable-dynamicbase 2>&1 | grep disable-dynamicbase),)
LINKFLAGS += -Wl,--disable-dynamicbase,--disable-nxcompat
endif

to the Config.x86-win32 for version 2.1.3 and was able to compile and install it also.

Stas Boukarev (stassats)
Changed in sbcl:
status: New → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.