Steel Bank Common Lisp

sb-safepoint and sb-sprof memory corruption

Reported by Stas Boukarev on 2013-02-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Low
Unassigned

Bug Description

(require :sb-sprof)

(sb-sprof:start-profiling :sample-interval 0.000001)

(sb-thread:make-thread
 (lambda ()
   (loop repeat 1000
         do
         (compile nil '(lambda ()))
         (sb-ext:gc :full t))))

Eventually ends up in the LDB.

Just to clarify, this is not an sb-safepoint (= using safepoints) problem as such, but rather an sb-safepoint-strictly (= not having pseudo-atomic) problem, right?

The issue that was discussed on #sbcl being that the signal handler trampoline for this particular signal (the only asynchronous signal allowed to strike asynchronously during what used to be p/a) isn't careful enough at preventing the handler to be called when it's not appropriate to do so. (During GC? In a safepoint? Not clear. Need to find out.)

FWIW, stacktraces would be a nice first step towards fixing this.

Stas Boukarev (stassats) wrote :

Looking at this again, no it's a sb-safepoint problem, sb-safepoint-strictly is not enabled.

Stas Boukarev (stassats) wrote :

The attached trace shows that SIGPROF is delivered to another thread during GC, the GC is initiated from a different thread, as a result, the signal is not deferred and causes trouble.
The questions are:
Should the safepoint version of stop_the_world block deferrable signals?
Should maybe_defer_handler take into account that GC is happening at the moment? Currently it checks for without-interrupts and pseudo-atomic.

Stas Boukarev (stassats) wrote :

More things actually happen. When doing sbcl --load test.lisp the main thread ends up in a C function call waiting for the REPL input, while the second thread invokes GC. Since the main thread is inside a C function, it cannot be stopped by a safepoint and keeps running. Until it's interrupted by SIGPROF, which calls a lisp function, which only then encounters a safepoint. Somwhere in the period between receiving a signal and calling a lisp function something bad happens.

Stas Boukarev (stassats) wrote :

In reality, the safpoint is not triggered during the lisp function call, but just before it, using WITH_GC_AT_SAFEPOINTS_ONLY() in interrupt_handle_now. The problem turns out to be, before using that macro it does

        context_sap = alloc_sap(context);
        info_sap = alloc_sap(info);
which allocates a sap, breaking the GC in process.
Fixed in c07b621a73f9580a32d27d94e301c01c5dad5f4e.

Changed in sbcl:
status: Triaged → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments