Frequent startup hangs on MacOS 14.5 (Sonoma) amd64

Bug #2067313 reported by Richard M Kreuter
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Fix Released
Undecided
Unassigned

Bug Description

After a software updates yesterday to the latest MacOS (Sonoma 14.5) and latest Macports on an amd64, all versions of SBCL I have access to frequently hang during startup, but with differing frequencies under different terminal emulators and user privilege levels.

SBCL versions tested: 2.2.9 from sbcl.org/platform-table.html, 2.4.4 from Macports, and this weekend's git head (with build commands 'make.sh --xc-host=ccl64' and sometimes trying --with-sb-devel).

uname -a: Darwin flint.local 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:09:52 PDT 2024; root:xnu-10063.121.3~5/RELEASE_X86_64 x86_64

Observable behavior: SBCL prints the banner, never gets to the prompt. Externally, top shows an SBCL process consuming 100% of a CPU with a single thread and a flat virtual memory utilization for however long I wait (I've waited up to 5 minutes).

Every time I send a SIGABRT, the backtrace looks like the one below; the hang always happens in SB-THREAD::INIT-MAIN-THREAD. I instrumented that function to printf after each assignment (diff attached); the hang seems to be happening in the first MAKE-MUTEX, for "Exit Lock".

Frequency of occurrence: the hang has occurred 100% of the ~30 times I've tried running SBCL as an unprivileged user in an XTerm or Rxvt; once out of ~10 times I've run SBCL inside an Emacs *shell* buffer or *inferior-lisp* buffer; perhaps once every 5 times I've tried running SBCL in Terminal.app. (Environment variables and ulimit settings below.) I've been unable to reproduce the hang while running SBCL as root in any of these terminal emulators (about 20 tries). I haven't been terribly methodical here, could be moreso if needed. All 3 SBCL versions mentioned above appear to evince the same propensity to hang within each of terminal emulator as an unprivileged user.

If anybody has suggestions for how to debug this, please let me know.

Representative backtrace:
--
$ ./run-sbcl.sh --no-userinit --no-sysinit
This is SBCL 2.4.4.144-1b9a5ed6c-WIP, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
dcount: 0. <---------------------------------------------- this is my instrumentation, see attached
fatal error encountered in SBCL pid 50640 pthread 0x7ff850490fc0:
SIGABRT received.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> ba
Backtrace:
Warning: lisp package array is not initialized for C
   0: fp=0x18dfd48 pc=0x54878b52 <??? type 45>::INIT-MAIN-THREAD
   1: fp=0x18dfd90 pc=0x5488e8cb (FLET "WITHOUT-GCING-BODY-" :IN <??? type 45>::REINIT)
   2: fp=0x18dfdd8 pc=0x5488e584 <??? type 45>::REINIT
   3: fp=0x18dfe70 pc=0x54baf271 (FLET <??? type 45>::BODY :IN <??? type 45>::START-LISP)
   4: fp=0x18dff30 pc=0x54baf09b (FLET "WITHOUT-INTERRUPTS-BODY-3" :IN <??? type 45>::START-LISP)
   5: fp=0x18dffc8 pc=0x54baeec8 <??? type 45>::%START-LISP
ldb> q
--

The rest of this report is environment variables and ulimits under various terminal emulators, in case they're relevant.

UXTerm (100% frequency)
--
$ env; ulimit -a
TERM=xterm
SHELL=/bin/bash
TMPDIR=/var/folders/sk/rg4zxn_14910jmbfy2k1fhb80000z9/T/
WINDOWID=12582924
XAPPLRESDIR=/Users/kreuter/.app-defaults
XTERM_SHELL=/bin/bash
USER=kreuter
COMMAND_MODE=unix2003
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.soAPlG0PI2/Listeners
__CF_USER_TEXT_ENCODING=0x3E9:0x0:0x0
PATH=/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/usr/local/zfs/bin
__CFBundleIdentifier=org.macports.X11
PWD=/Users/kreuter/zzz/sbcl
XPC_FLAGS=0x0
PS1=\h\$
XTERM_VERSION=XTerm(388)
XTERM_LOCALE=en_US.UTF-8
XPC_SERVICE_NAME=0
TEXINPUTS=/Users/kreuter/src/texmf:
SHLVL=1
HOME=/Users/kreuter
PARINIT=rTbgq B=.,?_A_a Q=_s>|
LOGNAME=kreuter
LC_CTYPE=en_US.UTF-8
DISPLAY=/private/tmp/com.apple.launchd.FLzKPCnFVv/org.macports:0
X11_PREFS_DOMAIN=org.macports.X11
_=/usr/bin/env
OLDPWD=/Users/kreuter/zzz
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4864
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 2784
virtual memory (kbytes, -v) unlimited
--

URxvt (100% frequency)
--
$ env; ulimit -a
TERM=rxvt-unicode-256color
SHELL=/bin/bash
TMPDIR=/var/folders/sk/rg4zxn_14910jmbfy2k1fhb80000z9/T/
WINDOWID=25165831
XAPPLRESDIR=/Users/kreuter/.app-defaults
XTERM_SHELL=/bin/bash
USER=kreuter
COMMAND_MODE=unix2003
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.soAPlG0PI2/Listeners
__CF_USER_TEXT_ENCODING=0x0:0:0
PATH=/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/usr/local/zfs/bin
STY=15994.ttys003.flint
__CFBundleIdentifier=org.macports.X11
PWD=/Users/kreuter/zzz/sbcl
XPC_FLAGS=0x0
PS1=\h\$
XTERM_VERSION=XTerm(388)
XTERM_LOCALE=en_US.UTF-8
XPC_SERVICE_NAME=0
TEXINPUTS=/Users/kreuter/src/texmf:
COLORFGBG=0;15
SHLVL=3
HOME=/Users/kreuter
PARINIT=rTbgq B=.,?_A_a Q=_s>|
LOGNAME=kreuter
WINDOW=2
LC_CTYPE=en_US.UTF-8
DISPLAY=/private/tmp/com.apple.launchd.FLzKPCnFVv/org.macports:0
X11_PREFS_DOMAIN=org.macports.X11
COLORTERM=rxvt
_=/usr/bin/env
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4864
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 2784
virtual memory (kbytes, -v) unlimited
--

Terminal.app (low-ish frequency)
--
$ env; ulimit -a
TERM_PROGRAM=Apple_Terminal
SHELL=/bin/bash
TERM=xterm-256color
TMPDIR=/var/folders/sk/rg4zxn_14910jmbfy2k1fhb80000z9/T/
TERM_PROGRAM_VERSION=453
OLDPWD=/Users/kreuter
TERM_SESSION_ID=BE3FE097-5386-4D95-9FB5-EF3C835F7624
USER=kreuter
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.soAPlG0PI2/Listeners
PATH=/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/usr/local/zfs/bin
__CFBundleIdentifier=com.apple.Terminal
PWD=/Users/kreuter/zzz/sbcl
LANG=en_US.UTF-8
XPC_FLAGS=0x0
XPC_SERVICE_NAME=0
SHLVL=1
HOME=/Users/kreuter
LOGNAME=kreuter
DISPLAY=/private/tmp/com.apple.launchd.FLzKPCnFVv/org.macports:0
_=/usr/bin/env
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 2560
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 2784
virtual memory (kbytes, -v) unlimited
--

Emacs *shell* buffer (very low frequency)
--
$ env; ulimit -a
SHELL=/bin/bash
TERM=dumb
TMPDIR=/var/folders/sk/rg4zxn_14910jmbfy2k1fhb80000z9/T/
WINDOWID=14680076
OLDPWD=/Users/kreuter
XAPPLRESDIR=/Users/kreuter/.app-defaults
USER=kreuter
XTERM_SHELL=/bin/bash
COMMAND_MODE=unix2003
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.soAPlG0PI2/Listeners
TERMCAP=
__CF_USER_TEXT_ENCODING=0x3E9:0x0:0x0
COLUMNS=80
PATH=/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/usr/local/zfs/bin
STY=15994.ttys003.flint
__CFBundleIdentifier=org.macports.X11
PWD=/Users/kreuter/zzz/sbcl
XPC_FLAGS=0x0
PS1=\h\$
XPC_SERVICE_NAME=0
XTERM_LOCALE=en_US.UTF-8
XTERM_VERSION=XTerm(388)
TEXINPUTS=/Users/kreuter/src/texmf:
HOME=/Users/kreuter
SHLVL=3
PARINIT=rTbgq B=.,?_A_a Q=_s>|
LOGNAME=kreuter
WINDOW=0
LC_CTYPE=en_US.UTF-8
DISPLAY=/private/tmp/com.apple.launchd.FLzKPCnFVv/org.macports:0
X11_PREFS_DOMAIN=org.macports.X11
INSIDE_EMACS=29.3,comint
_=/usr/bin/env
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4864
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 9788
cpu time (seconds, -t) unlimited
max user processes (-u) 2784
virtual memory (kbytes, -v) unlimited
--

Revision history for this message
Richard M Kreuter (kreuter) wrote :
Revision history for this message
Stas Boukarev (stassats) wrote :

Can you get a C backtrace for this?

Revision history for this message
Richard M Kreuter (kreuter) wrote :

Sure:

$ lldb -p 54968
(lldb) process attach --pid 54968
Process 54968 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x50944490)
    frame #0: 0x0000000054878b52
-> 0x54878b52: movq %rdx, 0x50944490
    0x54878b5a: subq $0x10, %rsp
    0x54878b5e: xorl %ecx, %ecx
    0x54878b60: movq %rbp, (%rsp)
Target 0: (sbcl) stopped.
Executable module set to "/Users/kreuter/zzz/sbcl/src/runtime/sbcl".
Architecture set to: x86_64h-apple-macosx-.
(lldb) bt
bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x50944490)
  * frame #0: 0x0000000054878b52
    frame #1: 0x000000005488e8cb
    frame #2: 0x000000005488e584
    frame #3: 0x0000000054baf271
    frame #4: 0x0000000054baf09b
    frame #5: 0x0000000054baeec8
    frame #6: 0x0000000005a4808d sbcl`call_into_lisp_ at x86-64-assem.S:209
    frame #7: 0x0000000005a2c4e8 sbcl`create_main_lisp_thread(function=<unavailable>) at thread.c:393:5 [opt]
    frame #8: 0x0000000005a27ed8 sbcl`initialize_lisp(argc=5, argv=0x00007ff7ba5f6660, envp=0x00007ff7ba5f6690) at runtime.c:812:5 [opt]
    frame #9: 0x0000000005a1c029 sbcl`main(argc=<unavailable>, argv=<unavailable>, envp=<unavailable>) at main.c:11:5 [opt]
    frame #10: 0x00007ff80cbc4366 dyld`start + 1942
(lldb)

Revision history for this message
Stas Boukarev (stassats) wrote :

Can you create a sample in the activity monitor when it's spinning?

Revision history for this message
Douglas Katzman (dougk) wrote :

fwiw, I'm running Sonoma. amd64 and have never seen this.
Can you apply the following diffs and see if it changes anything about the failure rate? These are not fixes, just diagnostic tools for when all hope seems lost:

1) disable memory protection
diff --git a/src/runtime/GNUmakefile b/src/runtime/GNUmakefile
index 15e12a995..6b9269a54 100644
--- a/src/runtime/GNUmakefile
+++ b/src/runtime/GNUmakefile
@@ -33,7 +33,7 @@ __LDFLAGS__ =

 include ../../output/prefix.def

-CFLAGS += -g -Wall -Wundef -Wsign-compare -Wpointer-arith -O3
+CFLAGS += -g -Wall -Wundef -Wsign-compare -Wpointer-arith -O3 -DENABLE_PAGE_PROTECTION=0
 ASFLAGS += $(CFLAGS)
 CPPFLAGS += -I.

2) indicate whether startup heap-relocation was needed:
diff --git a/src/runtime/coreparse.c b/src/runtime/coreparse.c
index 817d2ecc1..53ab2a2ef 100644
--- a/src/runtime/coreparse.c
+++ b/src/runtime/coreparse.c
@@ -289,7 +289,7 @@ set_adjustment(struct cons* pair, int id, uword_t actual_addr)
     adj->n_ranges = j+1;
 }

-#define SHOW_SPACE_RELOCATION 0
+#define SHOW_SPACE_RELOCATION 1
 #if SHOW_SPACE_RELOCATION > 1
 # define FIXUP(expr, addr) fprintf(stderr, "%p: (a) %lx", addr, *(long*)(addr)), \
    expr, fprintf(stderr, " -> %lx\n", *(long*)(addr)), ++adj->n_relocs_abs

Revision history for this message
Richard M Kreuter (kreuter) wrote :

@stas sample attached.

@doug will try that now.

Thanks both!

Revision history for this message
Stas Boukarev (stassats) wrote :

>fwiw, I'm running Sonoma.

But is it 14.5?

Revision history for this message
Richard M Kreuter (kreuter) wrote :

I can't reproduce the hang in a build with -DENABLE_PAGE_PROTECTION=0: several hundred successful runs of --eval (print :foo) --quit.

Not seeing any indication of heap relocation with SHOW_SPACE_RELOCATION at 1 (or 2, FWIW).

Also, I should mention/emphasize, I had never seen this hang before updating the OS yesterday.

Revision history for this message
Stas Boukarev (stassats) wrote :

That kinda makes sense. Since it looks like it's constantly faulting on 0x50944490, although the sample doesn't show this. But why, is mprotect broken? It's not broken on arm64.

Revision history for this message
Stas Boukarev (stassats) wrote :

When I make it constantly fault the signal handling does show up:

Call graph:
    2334 Thread_6179410 DispatchQueue_1: com.apple.main-thread (serial)
    + 2253 start (in dyld) + 1942 [0x102c6a366]
    + ! 2253 main (in sbcl) + 9 [0x2b020a9] main.c:11
    + ! 2253 initialize_lisp (in sbcl) + 2888 [0x2b0df18] runtime.c:812
    + ! 2253 create_main_lisp_thread (in sbcl) + 280 [0x2b12528] thread.c:393
    + ! 2253 call_into_lisp_ (in sbcl) + 77 [0x2b2e0ad] x86-64-assem.S:209
    + ! 2253 ??? (in <unknown binary>) [0x54bae4d8]
    + ! 2253 ??? (in <unknown binary>) [0x54bae6ab]
    + ! 2253 ??? (in <unknown binary>) [0x54bae881]
    + ! 2253 ??? (in <unknown binary>) [0x5488e324]
    + ! 2253 ??? (in <unknown binary>) [0x5488e66b]
    + ! 1208 _sigtramp (in libsystem_platform.dylib) + 51 [0x7ff804872ff3]
    + ! : 1142 ??? (in Rosetta JIT) [0x102e780ec]
    + ! : 66 ??? (in Rosetta JIT) [0x102e780e4]
    + ! 1045 0x0
    + ! 541 _sigtramp (in libsystem_platform.dylib) + 51 [0x7ff804872ff3]
    + ! | 538 __sigreturn (in libsystem_kernel.dylib) + 10 [0x7ff80480da8e]
    + ! | + 497 ??? (in runtime) load address 0x7ff7ffd77000 + 0x22788 [0x7ff7ffd99788]
    + ! | + 39 ??? (in <unknown binary>) [0x7ff894ae6d34]
    + ! | + 1 ??? (in <unknown binary>) [0x7ff894ae6d68]
    + ! | + ! 1 ??? (in runtime) load address 0x7ff7ffd77000 + 0x1befc [0x7ff7ffd92efc]
    + ! | + 1 ??? (in runtime) load address 0x7ff7ffd77000 + 0x1bec0 [0x7ff7ffd92ec0]
    + ! | 2 _sigtramp (in libsystem_platform.dylib) + 27 [0x7ff804872fdb]
    + ! | + 1 ??? (in <unknown binary>) [0x7ff894ae419c]
    + ! | + 1 ??? (in <unknown binary>) [0x7ff894ae41c0]
    + ! | 1 _sigtramp (in libsystem_platform.dylib) + 7 [0x7ff804872fc7]
    + ! 468 __sigreturn (in libsystem_kernel.dylib) + 10 [0x7ff80480da8e]
    + ! | 468 ??? (in runtime) load address 0x7ff7ffd77000 + 0x22794 [0x7ff7ffd99794]
    + ! | 435 ??? (in runtime) load address 0x7ff7ffd77000 + 0x1bce4 [0x7ff7ffd92ce4]

Revision history for this message
Stas Boukarev (stassats) wrote :

Maybe there's no signal handling at all so it just keeps faulting on that address without going to user space?

Revision history for this message
Douglas Katzman (dougk) wrote :

yes I'm running 14.5

From the sound of it, there is a situation where a store to a protected page gets handled in the sigsegv handler by unprotecting the page and marking it as touched for GC, then resuming at the faulting instruction, which gets another fault right away, forever. It is as if page fault handling does not work until after the main thread is fully initialized in Lisp, which we can confirm via the following diff. Mind you this "should not" make a difference, but I think it will:

diff --git a/src/runtime/immobile-space.c b/src/runtime/immobile-space.c
index 95d6acee7..f231f680f 100644
--- a/src/runtime/immobile-space.c
+++ b/src/runtime/immobile-space.c
@@ -1185,8 +1185,6 @@ void immobile_space_coreparse(uword_t fixedobj_len,
                 sword_t size = object_size2(obj, header);
                 fixedobj_pages[page].attr.parts.obj_align = size;
                 fixedobj_pages[page].gens |= 1 << immobile_obj_gen_bits(obj);
- if (gen != 0 && ENABLE_PAGE_PROTECTION)
- fixedobj_pages[page].attr.parts.flags = WRITE_PROTECT;
                 break;
             }
         }
diff --git a/src/runtime/runtime.c b/src/runtime/runtime.c
index 274687c8f..ea1315bb8 100644
--- a/src/runtime/runtime.c
+++ b/src/runtime/runtime.c
@@ -778,10 +778,6 @@ initialize_lisp(int argc, char *argv[], char *envp[])
     extern void callback_wrapper_trampoline();
     SYMBOL(CALLBACK_WRAPPER_TRAMPOLINE)->value = (lispobj)callback_wrapper_trampoline;
 #endif
- /* Delayed until after dynamic space has been mapped, fixups made,
- * and/or immobile-space linkage entries written,
- * since it was too soon earlier to handle write faults. */
- write_protect_immobile_space();
 #endif

     /* Requires STATIC_SPACE_START to have been calculated. */

Revision history for this message
Stas Boukarev (stassats) wrote :

If I do sa.sa_handler = SIG_IGN; then I get similar symptoms.

Revision history for this message
Douglas Katzman (dougk) wrote :

(sidebar: I'm fairly near to removing sigsegv-based card marking for immobile-space, but i did not want to be forced to yet. It would be good to understand how mprotect, or our use of it, is broken on macOS)

Revision history for this message
Stas Boukarev (stassats) wrote :

I changed run-program to do

sigfillset(&sset);
sigprocmask(SIG_SETMASK, &sset, NULL);

and then launched an x86-64 sbcl, and I reproduce the exact same behavior without any modifications.
It also explains why it happens more often with some terminal emulators.

Revision history for this message
Stas Boukarev (stassats) wrote :

Can you try this?

Revision history for this message
Richard M Kreuter (kreuter) wrote (last edit ):

That seems to have done it. Can't reproduce the hang in any of the terminal emulators as an unprivileged user.

P.S., for clarity, with the patch to clear the sigmask alone, i.e., without any of the other diagnostic changes posted earlier.

Stas Boukarev (stassats)
Changed in sbcl:
status: New → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.