Broken binary search in sb-vm::%simple-fun-from-entrypoint

Bug #1942191 reported by Marco Heisig
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Fix Released
Undecided
Unassigned

Bug Description

To reproduce: (sb-vm::find-called-object (sb-vm::get-lisp-obj-address #'+)) on any SBCL from 1.4.12 to 2.1.8.

Code objects whose code-n-entries is 1 trigger a "The value -1 is not of type (unsigned-byte 16).", because the MAX variable assumes a value of -1 after the first iteration of the binary search.

This bug breaks SAVE-LISP-AND-DIE when statically linking the core. It was introduced by 278ad85561efdd73b86848a9ea4e9653e86a59a0.

Revision history for this message
Douglas Katzman (dougk) wrote :

That's interesting, but there are literally thousands of code objects with 1 entry, so this can't be the whole story. Do you have an example of breakage without directly calling it?
If I put a trace on those two functions and call save-lisp-and-die, they work fine.

 0: (SB-VM::FIND-CALLED-OBJECT 1386265968)
    1: (SB-VM::%SIMPLE-FUN-FROM-ENTRYPOINT #<code id=3605 [1] SB-KERNEL:TWO-ARG-GCD {52A0BD1F}> 1386265968)
    1: SB-VM::%SIMPLE-FUN-FROM-ENTRYPOINT returned
         #<FUNCTION SB-KERNEL:TWO-ARG-GCD>
  0: SB-VM::FIND-CALLED-OBJECT returned #<FUNCTION SB-KERNEL:TWO-ARG-GCD>
etc

I think the error stems from: if the callee is not found, then binary-search fails.
However, not finding a callee, in the context of save-lisp-and-die, is an illegal state.
So I don't mind the failure of binary search. It's catching some other problem.

Revision history for this message
Marco Heisig (marco-heisig-h) wrote :

You are absolutely right, Doug, this isn't the whole story :)

This bug occurs when I dump an image containing my sb-simd library. So chances are I botched some VOP definitions or whatnot. Let me try to narrow this problem down to a minimal test case.

Revision history for this message
Marco Heisig (marco-heisig-h) wrote :

OK, this is a very weird bug. I have narrowed it down to a single top level PROGN with several dozens of defuns that all use some of my newly defined VOPs. After I compile and load this code, SAVE-LISP-AND-DIE fails when statically linking the core as described above. Once I remove a few of these defuns, SAVE-LISP-AND-DIE works as intended. But it doesn't matter which defuns I remove. Another interesting observation is that SAVE-LISP-AND-DIE also works as intended when I don't compile and load it in the same image, but compile first, restart SBCL, and load the FASLs into the new image.

Speculation:
- This is somehow related to the size of a single top level expression.
- This is either caused as a side-effect of compilation, or somehow GC related.

I will investigate further and try to find a minimal example that causes this behavior.

Revision history for this message
Marco Heisig (marco-heisig-h) wrote :

I narrowed the bug down to the following sequence of events:

1. SAVE-LISP-AND-DIE calls sb-vm::statically-link-core.

2. sb-vm::statically-link-core calls sb-vm::find-called-object on each target address of a call that points outside of its own code component. sb-vm::find-called-object first uses the alien function "search_all_gc_spaces" to locate the target, and then turns the located address into a Lisp object.

3. After several hundred successful invocations, and while processing the code object of one of my SIMD instructions, we end up with a target address that gets resolved by to the code object of swank/source-file-cache::read-file. Then, when calling %simple-fun-from-entrypoint to turn that code object into function by means of %simple-fun-from-entrypoint, we end up with the bug.

Now I just need to figure out why search_all_gc_spaces returns an address that is most certainly not called by my code.

Revision history for this message
Marco Heisig (marco-heisig-h) wrote :

I think I found the bug. The instruction definition of some AVX2 instructions was wrong, so when emitting them the code object's instructions were off by one byte. This was fine so far, because this instruction was one of the few ones I haven't written tests for, but it ended up breaking sb-vm::statically-link-core by screwing up map-segment-instructions, and by seeing calls that weren't really there.

That was the weirdest bug I've had to track down in a long time. Thanks Doug for pointing me in the right direction!

Stas Boukarev (stassats)
Changed in sbcl:
status: New → Fix Committed
Stas Boukarev (stassats)
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.