Debian 64t transition causes build failure on 32bit arch

Bug #2063340 reported by Peter Van Eynde
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
New
Undecided
Unassigned

Bug Description

Hi,

During a rebuild of sbcl 2.3.7 on Debian/unstable, after the https://wiki.debian.org/ReleaseGoals/64bit-time transition, we see a problem for armhf. See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1069520

> "obj/from-xc/src/code/late-globaldb.lisp-obj"
>
> debugger invoked on a TYPE-ERROR:
> The value
> 542868266893180929
> is not of type
> (UNSIGNED-BYTE 32)
> when setting slot SB-THREAD::OBSERVED-INTERNAL-REAL-TIME-DELTA-SEC of structure SB-THREAD:THREAD
...
> (GET-INTERNAL-REAL-TIME)
> 0]

This seems to be related to the 64t change as when I try to rebuild on armhf I see a change in the grovelled data:

(sid_armhf-dchroot)pvaneynd@abel:~/sbcl-2.3.7$ diff -u ./crossbuild-runner/backends/arm/stuff-groveled-from-headers.lisp output/stuff-groveled-from-headers.lisp
--- ./crossbuild-runner/backends/arm/stuff-groveled-from-headers.lisp 2023-07-29 07:59:39.000000000 +0000
+++ output/stuff-groveled-from-headers.lisp 2023-07-29 07:59:39.000000000 +0000
@@ -30,7 +30,7 @@
 (define-alien-type off-t (signed 64))
 (define-alien-type size-t (unsigned 32))
 (define-alien-type ssize-t (signed 32))
-(define-alien-type time-t (signed 32))
+(define-alien-type time-t (signed 64))
 (define-alien-type suseconds-t (signed 32))
 (define-alien-type uid-t (unsigned 32))
 ;; Types in src/runtime/wrap.h. See that file for explantion.
@@ -141,6 +141,7 @@
 (defconstant clock-process-cputime-id 2) ; #x2
 (defconstant clock-realtime-alarm 8) ; #x8
 (defconstant clock-realtime-coarse 5) ; #x5
+(defconstant clock-tai 11) ; #xb
 (defconstant clock-monotonic-coarse 6) ; #x6
 (defconstant clock-monotonic-raw 4) ; #x4
 (defconstant clock-boottime 7) ; #x7
@@ -149,11 +150,11 @@
 ;;; structures
 (define-alien-type nil
   (struct timeval
- (tv-sec (signed 32))
- (tv-usec (signed 32))))
+ (tv-sec (signed 64))
+ (tv-usec (signed 64))))
 (define-alien-type nil
   (struct timespec
- (tv-sec (signed 32))
+ (tv-sec (signed 64))
           (tv-nsec (signed 32))))

I think we need to change

❯ git grep observed-internal-real-time-delta-sec
src/code/thread-structs.lisp: #-64-bit (observed-internal-real-time-delta-sec 0 :type sb-vm:word)
src/code/unix.lisp: (sb-thread::thread-observed-internal-real-time-delta-sec thr))

to account for the 64bit nature of tv-sec and friends, even on 32 bit architectures.

Best regards, Peter

Revision history for this message
Douglas Katzman (dougk) wrote :

The easiest solution may be to put some C code in wrap.c that returns a timespec the way lisp expects it to look because I don't see anything in the Lisp side that seems wrong. Nanoseconds are immediately scaled down by 10^6 after calling clock_gettime so that get-internal-real-time remains a fixnum. Therefore anything stored in the 'observed-delta' slot should not be too large.
I'll bet it's a discrepancy in what Lisp thinks a 64-bit integer looks in a C structure in this ABI (is it most-significant half first or second? I don't know). I see no other instance where Lisp cared about that for a 32-bit build. A few kernel types were 64-bit already, but none of them very interesting to Lisp: dev_t, ino_t for example

Revision history for this message
Sean Whitton (spwhitton) wrote :

Hello,

The error output is slightly different with sbcl 2.4.5 but it would seem to be the same bug:

    "obj/from-xc/src/code/late-globaldb.lisp-obj"
    Internal error #88 "Object is not of type UNSIGNED-BYTE-32." at 0x4f859610
 SC: 0, Offset: 0 $1= 0x51267e0f: other pointer
    fatal error encountered in SBCL pid 31092:
    internal error too early in init, can't recover

    Welcome to LDB, a low-level debugger for the Lisp runtime environment.
    ldb>

Revision history for this message
Douglas Katzman (dougk) wrote :

As mentioned in comment #2 the values in the fields haven't actually become large enough for this to matter yet, so this can't be "actual" overflow; it has to be just a calling convention issue that Lisp is getting wrong.
I imagine that no active SBCL committer has access to a machine using the wider time types (i.e. afflicted with the problem) to figure it out; I certainly don't.
There should be a straightforward way to diagnose further if not repair: intercept any of the time-related calls in C and print the arguments/results in hex; similarly print them in Lisp and visually compare to figure out what's going wrong. The fix should become self-evident. I don't know who can do it for you.

Revision history for this message
Douglas Katzman (dougk) wrote :

Please paste in the C definition of 'struct timespec' and 'struct timeval'

Revision history for this message
Peter Van Eynde (ubuntu-pvaneynd) wrote :

Hi,

So the definition is a bit complex as there are many indirections. I can give the `gdb` view of the world, which should be the ground truth:

For reference: on x86_64, so

❯ uname -a
Linux frost 6.9.9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.9.9-1 (2024-07-13) x86_64 GNU/Linux

we have:

(gdb) ptype /r test_timeval
type = struct timeval {
    __time_t tv_sec;
    __suseconds_t tv_usec;
}
(gdb) ptype /o test_timeval
/* offset | size */ type = struct timeval {
/* 0 | 8 */ __time_t tv_sec;
/* 8 | 8 */ __suseconds_t tv_usec;

                               /* total size (bytes): 16 */
                             }
(gdb) ptype /r test_timespec
type = struct timespec {
    __time_t tv_sec;
    __syscall_slong_t tv_nsec;
}
(gdb) ptype /o test_timespec
/* offset | size */ type = struct timespec {
/* 0 | 8 */ __time_t tv_sec;
/* 8 | 8 */ __syscall_slong_t tv_nsec;

                               /* total size (bytes): 16 */
                             }
(gdb) ptype /r __time_t
type = long
(gdb) ptype /r __suseconds_t
type = long
(gdb) ptype /r __syscall_slong_t
type = long

❯ getconf LONG_BIT
64

On armv8l, the target architecture there, we have:

(sid_armhf-dchroot)pvaneynd@amdahl:~$ uname -a
Linux amdahl 6.1.0-23-arm64 #1 SMP Debian 6.1.99-1 (2024-07-15) armv8l GNU/Linux

We have:

(gdb) ptype /r test_timeval
type = struct timeval {
    __time64_t tv_sec;
    __suseconds64_t tv_usec;
}
(gdb) ptype /o test_timeval
/* offset | size */ type = struct timeval {
/* 0 | 8 */ __time64_t tv_sec;
/* 8 | 8 */ __suseconds64_t tv_usec;

                               /* total size (bytes): 16 */
                             }
(gdb) ptype /r test_timespec
type = struct timespec {
    __time64_t tv_sec;
    long tv_nsec;
}
(gdb) ptype /o test_timespec
/* offset | size */ type = struct timespec {
/* 0 | 8 */ __time64_t tv_sec;
/* 8 | 4 */ long tv_nsec;
/* XXX 4-byte padding */

                               /* total size (bytes): 16 */
                             }

(gdb) ptype /r __time64_t
type = long long
(gdb) ptype /r __suseconds64_t
type = long long

(sid_armhf-dchroot)pvaneynd@amdahl:~$ getconf LONG_BIT
32

Best regards, Peter

Revision history for this message
Douglas Katzman (dougk) wrote :

Your timespec on 32-bit arm is 16 bytes but the Lisp definition was 12 bytes.
Trying building at https://sourceforge.net/p/sbcl/sbcl/ci/3fd92b8a which might have fixed this.

Revision history for this message
Peter Van Eynde (ubuntu-pvaneynd) wrote :

Hi,

I've tried rebuilding with version 3fd92b8ababa65527d8534425eb744a926bbbf96 but it still fails during build:

...
"obj/from-xc/src/code/target-format.lisp-obj"
"obj/from-xc/src/code/late-globaldb.lisp-obj"
fatal error encountered in SBCL pid 4150551:
internal error too early in init, can't recover

Internal error #88 "Object is not of type UNSIGNED-BYTE-32." at 0x4f85c700
    SC: 0, Offset: 0 $1= 0x5128947f: other pointer
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> backtrace
Backtrace:
   0: [I]0xf736f228 pc=0x4f85c700 {0x4f85c000+0700} GET-INTERNAL-REAL-TIME
   1: 0xf736f1f0 pc=0x4f684a48 {0x4f6845b8+0490} SB-C::MAKE-SOURCE-INFO
   2: 0xf736f1a0 pc=0x4fb704b8 {0x4fb6f000+14b8} (LAMBDA () :IN SB-C::COMPILE-IN-LEXENV)
   3: 0xf736f128 pc=0x4fb3e820 {0x4fb3e000+0820} (FLET SB-C::WITH-IT :IN SB-C::%WITH-COMPILATION-UNIT)
   4: 0xf736f0a4 pc=0x4fb6f7a0 {0x4fb6f000+07a0} SB-C::COMPILE-IN-LEXENV
   5: 0xf736f078 pc=0x4fb6c630 {0x4fb6c320+0310} COMPILE
   6: 0xf736f068 pc=0x506c2560 {0x506c2000+0560} SB-PRETTY::!PPRINT-COLD-INIT
   7: 0xf736f000 pc=0x509c39e8 {0x509c2000+19e8} SB-KERNEL::!COLD-INIT
Note: [I] = interrupted
ldb> print 0x5128947f
$1= 0x5128947f: other pointer
            header: 0x0000020a: bignum
          0x00000000_0acaa91d_00000001

I also tried with a clean sbcl rebuild (except for the patch for the arm architecture [1]) and get the same error.

Best regards, Peter

1
https://salsa.debian.org/common-lisp-team/sbcl/-/blob/master/debian/patches/armhf-is-not-v5.patch?ref_type=heads

Revision history for this message
Douglas Katzman (dougk) wrote :

ok can you try compiling and running this:
#include <time.h>
#include <stdio.h>

unsigned int x[4];
int main()
{
  struct timespec* ts = (void*)x;
  clock_gettime(CLOCK_REALTIME_COARSE, ts);
  printf("%08x %08x = %d %d\n%08x %08x = %d %d\n",
         x[0], x[1], x[0], x[1],
         x[2], x[3], x[2], x[3]);
  return 0;
}

On cfarm27.cfarm.net which is 32-bit x86 with 64-bit seconds + 32-bit nanoseconds I see:
$ ./a.out
669f11ab 00000000 = 1721700779 0
20d3f941 00000000 = 550762817 0

Here are some possibilties:
- If this resembles your output, then it probably means the arm 32-bit C->lisp convention is screwed up
- if the first number on the second line exceeds 1 billion then the nanoseconds are really screwed up
- if the second hex number on the second row is nonzero, then the padding is in the wrong place

Revision history for this message
Peter Van Eynde (ubuntu-pvaneynd) wrote :

Hello,

Running the test I get:

(sid_armhf-dchroot)pvaneynd@amdahl:~/t$ ./a.out
669f365d 00000000 = 1721710173 0
2f530d4a 00000000 = 793972042 0

I'm guessing the `long long` support might be the cause?

Best regards, PEter

Revision history for this message
Douglas Katzman (dougk) wrote :
Download full text (3.4 KiB)

there could be something else going on such as linker or C preprocessor tricks.

I wrote a test on a known-good 32-bit arm setup showing that passing 64-bit int seems ok.
Specifically I compiled this C function into the SBCL runtime:
int faketime(int ignore, unsigned int* foo)
{
  printf("Hi in faketime, ptr=%p\n", foo);
  // low part // high part
  foo[0] = 0x669f365d; foo[1] = 0x76543210;
  foo[2] = 0x2f530d4a; foo[3] = 0;
  return 0;
}
and then evaluated:
* (define-alien-type nil
  (struct mock-timespec
          (tv-sec (signed 64))
          (tv-nsec (signed 32))
          (pad (signed 32))))
NIL
* (defun tryit ()
    (with-alien ((ts (struct mock-timespec)))
      (alien-funcall (extern-alien "faketime"
                                   (function int int (* (struct mock-timespec))))
                     0 (addr ts))
      (format t "~x ~x ~x~%"
              (slot ts 'tv-sec)
              (slot ts 'tv-nsec) ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.