poorly distributed SXHASH results for compound data

Bug #309443 reported by Nikodemus Siivola on 2008-12-18
Affects Status Importance Assigned to Milestone

Bug Description

  SBCL's SXHASH could probably try a little harder. ANSI: "the
  intent is that an implementation should make a good-faith
  effort to produce hash-codes that are well distributed
  within the range of non-negative fixnums". But
 (let ((hits (make-hash-table)))
   (dotimes (i 16)
     (dotimes (j 16)
       (let* ((ij (cons i j))
                     (newlist (push ij (gethash (sxhash ij) hits))))
         (when (cdr newlist)
           (format t "~&collision: ~S~%" newlist))))))
  reports lots of collisions in sbcl-0.7.8. A stronger MIX function
  would be an obvious way of fix. Maybe it would be acceptably efficient
  to redo MIX using a lookup into a 256-entry s-box containing
  29-bit pseudorandom numbers?

Changed in sbcl:
importance: Undecided → Medium
status: New → Confirmed

After some experiments, I have found the optimal (well, I hope it to be so) MIX function definition. The above test shows no collisions when using this function, and all regression tests pass.

Please review and propose further improvements (if any) or commit the current variant.

Changed in sbcl:
assignee: nobody → Roman Marynchak (roman-marynchak)
status: Confirmed → In Progress
tags: added: review

I have evaluated the performance degradation with this fix, and the results are bad: more than 2x degradation for the code which looks like the initial example in this issue.

So, the more complex MIX is a wrong way.

Changed in sbcl:
assignee: Roman Marynchak (roman-marynchak) → nobody
status: In Progress → Confirmed
tags: removed: review
Gustavo (gugamilare) wrote :

I question the validity of this bug. The code in the description is not a good detector of hash collisions because it uses only small fixnums. For instance, the following code gives no collisions at all.

(let ((hits (make-hash-table)))
  (dotimes (i 2500)
    (dotimes (j 2500)
      (let* ((ij (cons (random most-positive-fixnum)
                       (random most-positive-fixnum)))
             (newlist (push ij (gethash (sxhash ij) hits))))
        (when (cdr newlist)
          (format t "~&collision: ~S~%" newlist))))))

What causes the amount of collisions in the example is that SBCL's implementation of SXHASH for fixnums has the obvious drawback that the lower bits of the fixnums only affect the lower bits of the result and MIX doesn't do a good job in changing that.

Of course, if you want to make a better case for small numbers (which are probably more used in the code), then what needs to be done is to bring the lower bits higher in the hash. For instance, the following solves the "problem" completely:

(433494437 is a prime number, I got it from Wikipedia)

--- a/src/code/target-sxhash.lisp
+++ b/src/code/target-sxhash.lisp
@@ -67,7 +67,7 @@
   ;; (These are vaguely like the ideas used in many cryptographic
   ;; algorithms, but we're not pushing them hard enough here for them
   ;; to be cryptographically strong.)
- (let* ((xy (+ (* x 3) y)))
+ (let* ((xy (+ (logand most-positive-fixnum (* x 433494437)) y)))
     (logand most-positive-fixnum
             (logxor 441516657

Changed in sbcl:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers