SBCL and .Net Core runtimes stomp on each other signals while loaded in one process

Bug #1834964 reported by Dmitry Ignatiev on 2019-07-01
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Undecided
Unassigned

Bug Description

Hello! I'm currently working on a library, which allows for interacting with .Net Core runtime while embedding it into Common Lisp implementation, and specifically, into SBCL.

https://github.com/Lovesan/bike

The problem that i have encountered is that SBCL and .Net Core runtimes do not coexist well on Linux.

In particular, the runtimes seem to be stomping on each other signals. This leads to SBCL process being crashing, and the following message being displayed:

------------
fatal error encountered in SBCL pid 8500(tid 0x7f54a6096b80):
blockable signals partially blocked: {1,2,3,13,14,15,17,20,23,24,25,26,27,28,29}
------------

The message can also mention "blockables unblocked".

I've spent almost a week trying to dig into this issue but had no luck with any kind of a solution.

What i got, however, is a piece of code that actually reproduces an issue in 100% percent of cases. It does not utilize my library, it represents purely SBCL-specific lisp code, which could be loaded using 'sbcl --script'

https://gist.github.com/Lovesan/d8c184c8a2d286255af03852e3019bb2

https://launchpad.net/~stassats mentioned that it may be something wrong with .Net Core signal handlers, but maybe there could be a workaround on SBCL side?

I've put an initial bounty of $1000 on bountysource for solving the problem or finding a workaround:

https://www.bountysource.com/issues/75904410-sbcl-crashes-while-net-is-here

--------------
Tested on:

SBCL 1.5.3

Linux vpn 4.15.0-1043-aws #45-Ubuntu SMP Mon Jun 24 14:07:03 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
--------------

Dmitry Ignatiev (love5an) wrote :
Download full text (3.3 KiB)

Here's a dump provided by <email address hidden>

---------
[(nil)/(nil)] /entering interrupt_init()
[(nil)/(nil)] /returning from interrupt_init()
code scavenged: 0 total, 0 skipped
Next gc when 75009331 bytes have been consed
Loading /usr/share/dotnet/shared/Microsoft.NETCore.App/2.2.5/libcoreclr.so
Initialized CoreCLR
fatal error encountered in SBCL pid 19111(tid 0x7fb16a2f0b80):
blockable signals partially blocked: {1,2,3,13,14,15,17,20,23,24,25,26,27,28,29}

0: Foreign function (null), pc = 0x41866a, fp = 0x7fb169385e50
1: Foreign function (null), pc = 0x4187f8, fp = 0x7fb169385f40
2: Foreign function all_signals_blocked_p, pc = 0x4193fd, fp = 0x7fb1693860d0
3: Foreign function interrupt_handle_pending, pc = 0x41af2d, fp = 0x7fb169386100
4: Foreign function handle_trap, pc = 0x41c47d, fp = 0x7fb169386140
5: Foreign function (null), pc = 0x4190de, fp = 0x7fb169386180
6: Foreign function (null), pc = 0x7fb168e10359, fp = 0x7fb1693861b0
7: Foreign function (null), pc = 0x7fb169cca0e0, fp = 0x7fb169386778
8: SB-IMPL::%MAKE-HASH-TABLE, pc = 0x52456f7f, fp = 0x7fb1693867f8
9: MAKE-HASH-TABLE, pc = 0x5211b21c, fp = 0x7fb1693868a8
10: SB-C::MAKE-IR2-COMPONENT, pc = 0x52420bbd, fp = 0x7fb169386958
11: SB-C::GTN-ANALYZE, pc = 0x524ecdb7, fp = 0x7fb169386988
12: SB-C::%COMPILE-COMPONENT, pc = 0x5230ebc8, fp = 0x7fb169386a10
13: SB-C::COMPILE-COMPONENT, pc = 0x5221884f, fp = 0x7fb169386a38
14: SB-C::%COMPILE, pc = 0x52310bdf, fp = 0x7fb169386ad0
15: (FLET "LAMBDA0" :IN "SYS:SRC;COMPILER;TARGET-MAIN.LISP"), pc = 0x521eb1d8, fp = 0x7fb169386ba0
16: (FLET SB-C::WITH-IT :IN SB-C::%WITH-COMPILATION-UNIT), pc = 0x5226625d, fp = 0x7fb169386c80
17: SB-C::COMPILE-IN-LEXENV, pc = 0x521ebda4, fp = 0x7fb169386da8
18: COMPILE, pc = 0x5216fd7f, fp = 0x7fb169386de0
19: (LAMBDA () :IN "/home/test/crash.lisp"), pc = 0x52b53ef4, fp = 0x7fb169386e00
20: SB-INT::SIMPLE-EVAL-IN-LEXENV, pc = 0x52227f8a, fp = 0x7fb169386ec8
21: SB-EXT::EVAL-TLF, pc = 0x52353efd, fp = 0x7fb169386ef0
22: (LABELS SB-FASL::EVAL-FORM :IN SB-INT::LOAD-AS-SOURCE), pc = 0x52294ff8, fp = 0x7fb169387120
23: (LAMBDA (SB-KERNEL::FORM &KEY :CURRENT-INDEX &ALLOW-OTHER-KEYS) :IN SB-INT::LOAD-AS-SOURCE), pc = 0x522947f1, fp = 0x7fb169387270
24: SB-C::%DO-FORMS-FROM-INFO, pc = 0x52266d90, fp = 0x7fb169387330
25: SB-INT::LOAD-AS-SOURCE, pc = 0x522942c8, fp = 0x7fb1693874f0
26: (FLET SB-FASL::THUNK :IN LOAD), pc = 0x521cf93a, fp = 0x7fb1693875f8
27: SB-FASL::CALL-WITH-LOAD-BINDINGS, pc = 0x5236f36b, fp = 0x7fb169387680
28: (FLET SB-FASL::LOAD-STREAM :IN LOAD), pc = 0x521cfaed, fp = 0x7fb169387788
29: LOAD, pc = 0x521cf567, fp = 0x7fb169387880
30: (FLET SB-IMPL::LOAD-SCRIPT :IN SB-IMPL::PROCESS-SCRIPT), pc = 0x522edb84, fp = 0x7fb169387960
31: (FLET SB-UNIX::BODY :IN SB-IMPL::PROCESS-SCRIPT), pc = 0x522ed304, fp = 0x7fb169387a00
32: (FLET "WITHOUT-INTERRUPTS-BODY-2" :IN SB-IMPL::PROCESS-SCRIPT), pc = 0x522ed035, fp = 0x7fb169387ac0
33: SB-IMPL::PROCESS-SCRIPT, pc = 0x522ece2a, fp = 0x7fb169387b60
34: SB-IMPL::TOPLEVEL-INIT, pc = 0x52256a8d, fp = 0x7fb169387d30
35: (FLET SB-UNIX::BODY :IN SB-EXT::SAVE-LISP-AND-DIE), pc = 0x523b98e2, fp = 0x7fb169387e00
36: (FLET "WITHOUT-INTERRUPTS-BODY-7" :IN SB-EXT::SAVE-LISP-...

Read more...

Stas Boukarev (stassats) wrote :

That trace just says "it fails when it fails".

I added a restore_sbcl_signals function in b656602a309fc9647dd01255154c1068305f12f7.
You can call it with (sb-alien:alien-funcall (sb-alien:extern-alien "restore_sbcl_signals" (function (values))))

It will erase any third party signal handlers, which the third party installing them might or might not like. It's possible to restore only the sa_mask but, after trying that with ImageMagick, that's not enough and SBCL crashes.

There's also a possibility of race conditions, so all signals need to be blocked around whatever overwrites them and restore_sbcl_signals, provided there's no other threads.

Stas Boukarev (stassats) wrote :

Restoring just the signal masks seems to work with dotnet, but it still has the issue of race conditions.

Whichever solution: clearing the signals, restoring sa_mask, it should be done by dotnet. I.e. not installing the signal handlers, and there's an option for that as mentioned on the github ticket, or copying the correct sa_mask, somewhere in https://github.com/dotnet/coreclr/blob/4f6cc4aff1356802f04a7d6bb7a40c1642f7e96e/src/pal/src/exception/signal.cpp

Dmitry Ignatiev (love5an) wrote :

I did several ad-hoc tests in a container, and seems like your hack solves the problem.
I.e. i'm calling your function right after coreclr_initalize foreign function, should it succeed.

I'm very concerned of what that could do to their runtime however. I have not seen any sort of a direct impact like on a GC(e.g. finalizers are working and for ex. collecting the LispObject instances that are coming from lisp). But what are your thoughts on that? I'm currently studying their runtime, but i' haven't dug deep enough to understand that.

Also, please elaborate on possible race conditions.

Also please apply for bounty: https://www.bountysource.com/issues/75904410-sbcl-crashes-while-net-is-here

Stas Boukarev (stassats) wrote :

> I'm very concerned of what that could do to their runtime however. I have not seen any sort of a direct impact like on a GC(e.g. finalizers are working and for ex. collecting the LispObject instances that are coming from lisp). But what are your thoughts on that? I'm currently studying their runtime, but i' haven't dug deep enough to understand that.

The main concern is, what do they do with sigsegv, are they just translating it into exceptions or are they using it to implement garbage collection.
They have Initialize entry points that do not establish signal handlers, so, presumably, it can cope with that.

> Also, please elaborate on possible race conditions.

At the point between calling coreclr_initalize and restore_sbcl_signals a signal may hit and trip on blocked signals.

My preferred behavior order as far as SBCL is conerned:
a) never establishing any signals
b) clearing the signals with restore_sbcl_signals
c) copying sa_mask inside dotnet
d) copying sa_mask with a similar function to restore_sbcl_signals (which I can add if needed).

"a" requires asking the dotnet people to provide an entry point
which can avoid passing PAL_INITIALIZE_REGISTER_SIGNALS to Initialize.
And "b" https://github.com/dotnet/coreclr/blob/4f6cc4aff1356802f04a7d6bb7a40c1642f7e96e/src/pal/src/exception/signal.cpp#L863 requires getting the old sa_mask from sigaction() and using it.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers