sbcl 2.2.6 x86-64: no scavenge function for object 0xfe0301fd (widetag 0xfd)

Bug #1982608 reported by Andrew Berkley
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
SBCL
Fix Released
Undecided
Unassigned

Bug Description

fatal error encountered in SBCL pid 2269027 tid 2607849:
no scavenge function for object 0xfe0301fd (widetag 0xfd)

This was under heavy multi-threaded use. It has only happened once so far in a couple weeks of working, so not easily reproducible. There is a discussion in sbcl-dev mailing list with the same odd widetag back in April. I don't have any more information.

Revision history for this message
Douglas Katzman (dougk) wrote :

I'm trying to understand when you say "once so far in X time" and "not easily reproducible" whether you mean "only once ever" versus "once recently and possibly as well before that"
GC bugs are taken very seriously, but with only a single example of something that might have been anything from a neutrino strike to hardware error, there is literally nothing actionable here.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

Fair enough, I wasn't entirely clear. I've only been using 2.2.6 for a couple of weeks as I am testing whether it is stable enough to upgrade our production and development systems to. The error I reported has only happened once so far. If it happens again, I will of course update this bug description and do some more poking around at it.

We run sbcl 2.0.9 as our development and production setup and it gets a lot of use and we've never seen something like this. So while I agree it's possible this was a one off issue, my gut feel is it's a real problem.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

While I still cannot rule out user problems (we have a large codebase in which we may do some evil things), but we weren't seeing these before upgrading from 2.0.9.

fatal error encountered in SBCL pid 49163 tid 104422:
no scavenge function for object 0xff04ff01 (widetag 0x1)

Any hints on how to debug this would be useful...

Revision history for this message
Douglas Katzman (dougk) wrote :

you can set the following C variables from lisp:
(setf (extern-alien "pre_verify_gen_0" int) 1)
(setf (extern-alien "verify_gens" char) 0) ; 0 is unintuitive but right

The first one checks that lisp did not corrupt the heap just before GC is started and the second checks for heap consistency after the GC. Or make the equivalent assignments in C and rebuild the runtime if you want them always enabled. These slow down GC by a lot, so be prepared for that.

Also I usually do
 (setf (extern-alien "gencgc_verbose" char) 1)

Revision history for this message
Patrick Poitras (pfpoitras) wrote :

As a follow up to the previous, we've observed this issue 4 times at this point. We've rolled back to 2.0.9 for the moment, and are working to try to get some sort of reproducible test case out.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

OK, we did the above and see:

Ptr 0x10099029c7 @ 11860a40d8 (lispobj 11860a40d7,pg199700) sees unallocated space
Ptr 0x1009c3a997 @ 11864f3848 (lispobj 11864f3847,pg199838) sees younger obj from WP page
Ptr 0x100a227637 @ 1189fb47c8 (lispobj 1189fb47c7,pg201718) sees unallocated space
Ptr 0x100a227637 @ 1189fb47c8 (lispobj 1189fb47c7,pg201718) sees forwarding ptr
Ptr 0x100a227637 @ 1189fb47c8 (lispobj 1189fb47c7,pg201718) sees younger obj from WP page
Ptr 0x100a227637 @ 1189fb47c8 (lispobj 1189fb47c7,pg201718) sees junk
Ptr 0x10025c3e77 @ 12037010a8 (lispobj 12037010a7,pg263904) sees younger obj from WP page
Ptr 0x10025c3e77 @ 12037010a8 (lispobj 12037010a7,pg263904) sees junk
fatal error encountered in SBCL pid 144 tid 4353:
Verify failed: 8 errors

Any further hints? Thank you!

Revision history for this message
Andrew Berkley (ajberkley) wrote :

Sorry left out the first part of the log
Verify before GC [immobile] [threads] [RO] [static] [dynamic]

Revision history for this message
Andrew Berkley (ajberkley) wrote :

This might be more helpful, or not. Here is another crash
Verify before GC [immobile] [threads] [RO] [static] [dynamic]
Ptr 0x1050319667 @ 1256f34948 (lispobj 1256f34947,pg306662) sees younger obj from WP page
Ptr 0x1052e61117 @ 1257abc428 (lispobj 1257abc427,pg307031) sees younger obj from WP page
Ptr 0x1052bd3bc7 @ 1257d4ee78 (lispobj 1257d4ee77,pg307113) sees younger obj from WP page

We had uncommented the call to hexdump_spaces (the comment has a typo in it) after this fails and we see:

range 306662:306662 (0x1256f30000:0x1256f37f00) t5 g1 (-)
 f30000: fill to 0x42a6f30008
 f33050: 10094a59ef (>g1)
 f33060: fill to 0x1266f33068
 f33070: 101cb6b64f (>g2)
 f33080: fill to 0x2486f33088
 f342b0: 1243abe6f3 (>g1)
 f342c0: 505f2baf (>g1)
 f342d0: fill to 0x1816f342d8
 f34890: 5010014f (>g1)

Honestly I don't know how to interpret this yet.

Revision history for this message
Douglas Katzman (dougk) wrote :

hi Andrew. Just confirm something for me - did you set BOTH of
(setf (extern-alien "pre_verify_gen_0" int) 1)
(setf (extern-alien "verify_gens" char) 0) ; 0 is unintuitive but right
?
I ask because it is highly unusual to see a failure in verify before GC on a page in generation 1 where it didn't _also_ have an error after the prior GC. A pre-GC failure places the blame on Lisp but a post-GC failure places the blame on C, so we have to establish "who" caused the next failure - Lisp or C - but a pre-GC failure can be the same as a post-GC failure where it didn't actually run the verify, so it was a silent failure.

I assume that to induce this problem you had to give some huge amount of time for a process to soak and you don't have a standalone reproducer?

Also a couple other questions: do you suspect that you're doing any of the following?
* manipulating Lisp objects from C
* manipulating Lisp objects using (setf (sap-ref-word ...))
* running in safety 0
* violating any rules around dynamic-extent

While I'm certainly happy to explain the format of the hexdump, it's probably not the best way to go about this.
How large a heap are you running, and do you think that you'd be able to collect a gc crashdump (which is different from a hexdump) and provide it to me for analysis? The answer has to be "no" if it contains anything whatsoever that is confidential or personally identifying.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

The first case we captured had just had the features turned on, so no post-gc verification had run. The second case did not have any post gc errors.

We use dynamic-extent often. I don't think we do much manipulation of lisp objects from C (we do use C allocated arrays and manipulate them from Lisp). We use safety 0 in only a very very small number of places.

We don't have a standalone reproducer and I'm pretty sure we won't be able to provide an image without passing it through the legal stack which will take some time. dynamic-space-size is 64GB with some static arrays living around. It seems we can reproduce our problem in about thirty minutes or so, so I will try removing all our uses of dynamic-extent and then safety 0 and if that doesn't work I will bisect sbcl releases between 2.0.9 and 2.2.6.

Thank you very much for your help --- I will update this ticket when we learn something.

Revision history for this message
Stas Boukarev (stassats) wrote :

You can set sb-ext:*stack-allocate-dynamic-extent* to NIL and recompile.

Revision history for this message
Douglas Katzman (dougk) wrote :

In that every problematic object, its address ends in 7 which is sb-vm:list-pointer-lowtag.
And the word which holds the bad pointer is at object+1 which is the CDR of a cons.
So I'm trying to think what do you do that is so usual with cons cells that this is the first I've heard of it. I thought maybe there is a bug related to SETF on the CDR of a cell in a &REST list, but the assembly code looks right. Does this ring any bells for you that you're doing something with lists that is somehow atypical though permissible ?

Revision history for this message
Andrew Berkley (ajberkley) wrote :

We do a lot of atomic-push, atomic-pop, and atomic-incf.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

Just randomly looking through code with cdr in it, in one place we atomic-push an sb-alien:alien to the cdr of a cons and pop them off. This is a buffer pool for some operations.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

atomic-incf/decf on the cdr of a cons in several places.

Removing all dynamic-extent and (safety 0) stuff didn't fix anything, so I've been bisecting sbcl releases. It's slow and I think not 100% reliable since I have an arbitrary patience limited wait time for failures, but so far I have 2.0.9, 2.2.0 and 2.2.3 are fine while 2.2.5 and 2.2.7 are not.

Revision history for this message
Douglas Katzman (dougk) wrote :

I suspect that I can possibly save you some bisection time by guessing that 9394492fc644b14fa857f39281ac34085deed68b will be where the problem started appearing for you.
So at one revision prior, you should not be able to elicit the problem.
If we can establish that, it would help. But I still have no idea what's wrong.

Revision history for this message
Douglas Katzman (dougk) wrote :

wait, ignore what I just said. I was overly pessimistic.

I think the problem may be f4e01a5a6dd76f27be435ebc1cbb662dcf377b08 because that went in in April and you think you're OK through March 28 at release 2.2.3

Revision history for this message
Stas Boukarev (stassats) wrote :

f4e01a5a6dd76f27be435ebc1cbb662dcf377b08 was later rectified, so testing just that wouldn't tell much.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

Using the gc verification method of detecting when things go wrong (as opposed to waiting for the world to blow up):

2.2.3 runs fine
2.2.3.187-3b78f2bf4 runs fine
2.2.3.188-f4e01a5a6 fails
2.2.4 fails
2.2.5 fails
2.2.7 fails
2.2.9 fails

Thinking of other weird things we do: atomic-push to a simple-vector element.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

2.2.5.178-9b417e65d fails (this was my attempt to find what the later rectification was for f4e01a5a6)

Revision history for this message
Douglas Katzman (dougk) wrote :

thanks Andrew, 9b417e65d was exactly the right thing to try.
As you probably noticed, 2.2.6 already had that fix.

It is valid (but suboptimal) for the suspect logic to return NIL, so would it be possible to apply this diff at some revision to be chosen by you, and see if passes the GC checker?
I plan to put some additional runtime checks in to my current development version and see if I can make the regression suite fail, but it's good to have a multi-pronged attack.

diff --git a/src/compiler/ir1util.lisp b/src/compiler/ir1util.lisp
index d513d8625..afe5407ad 100644
--- a/src/compiler/ir1util.lisp
+++ b/src/compiler/ir1util.lisp
@@ -429,6 +429,7 @@
           (dominates-p block1 block2)))))

 (defun set-slot-old-p (node)
+ (return-from set-slot-old-p nil)
   (let ((args (combination-args node)))
     (multiple-value-bind (object-lvar value-lvar)
         (if (lvar-fun-is (combination-fun node) '(%%primitive))

Revision history for this message
Andrew Berkley (ajberkley) wrote :

I ran with 2.2.9-WIP with the change above from Douglas and it works fine! There are only a handful of places where this gets triggered in our code base, so I'll dig through them tomorrow and find out which is doing the wrong thing.

Revision history for this message
Andrew Berkley (ajberkley) wrote :

The places where set-slot-old-p returns true and seems wrong are something of the following sort:

(defun test-it ()
  (let ((p (cons nil nil)))
    (values
     (lambda (blarg) (setf (cdr p) blarg))
     (lambda () p))))

My understanding, which I am writing just for my benefit to get you to correct it if it is wrong is: The first function (a 'setter') erroneously does not emit a storage barrier / mark the page p lives on when updating (cdr p) which means whatever blarg is will not be kept alive because p will not be scanned next gc. gc verification catches this early because p, on a still write protected page, points to the younger blarg and won't be scanned to keep it alive. The crash occurs when gc comes along later and deletes blarg and then someone triggers the gc to scan p and it wanders off into the deleted land pointed at by (cdr p). In our code, most of the time these are all short lived objects and so it doesn't crash all the time (they all live in the nursery happily).

Paste the following into a repl to trigger this. It's probably not minimal but it's late here.

(defun test-it ()
  (let ((p (cons nil nil)))
    (values
     (lambda (blarg) (setf (cdr p) blarg))
     (lambda () p))))

(progn (setf (extern-alien "verify_gens" char) 0)
       (setf (extern-alien "pre_verify_gen_0" int) 1)
       (setf (extern-alien "gencgc_verbose" char) 1))

(multiple-value-bind (setter getter)
    (test-it)
  (defparameter *setter* setter)
  (defparameter *getter* getter))
(gc :full t)
(progn (funcall *setter* (list 1 2 3)) nil)
(gc :full t)

Stas Boukarev (stassats)
Changed in sbcl:
status: New → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.