wanted: by-value structures in SB-ALIEN
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
SBCL |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
Passing and returning structures by value is not supported by SB-ALIEN.
As of 07.06.2019, there's a bounty on this feature: https:/
Changed in sbcl: | |
importance: | Undecided → Wishlist |
status: | New → Confirmed |
tags: |
added: alien removed: sb-alien |
Ekaterina Vaartis (vaartis) wrote : | #1 |
Drew Crampsie (drewc) wrote : | #2 |
Now? 02 January 2009 is not "now", it's over 10 years ago. Why do you point attention to it now? Has any progress been made over the last 10 years 5 months and 5 days?
If not, does inflation matter? $83.58 would be what my local bank says.
Also, "bunty" is probably not what you meant. :)
Stas Boukarev (stassats) wrote : | #3 |
What are you talking about?
Ekaterina Vaartis (vaartis) wrote : | #4 |
The bounty has been created just today. I posted this bounty here for anyone who would find this here and would want to claim the bounty.
Drew Crampsie (drewc) wrote : | #5 |
That's so very confusing, as the bounty page says :
"wanted: by-value structures in SB-ALIEN
02 January 2009 Posted by Nikodemus Siivola
Passing and returning structures by value is not supported by SB-ALIEN."
Can you please update it to say that you are offering the money, and this is not 2009 but 2019?
Can you see where people may get confused?
Ekaterina Vaartis (vaartis) wrote : | #6 |
Bountysource is a site where you offer money for solving bugs, this would be the only purpose for the bug to be there. The page there reflects the state of the bug on launchpad, i cannot alter neither the bug description, nor the date, as far as i know. The fact remains that the bounty is unclaimed and the feature is not implemented, the date does not mean much in the context, only the fact that it hasn't been implemented for ten years.
Ekaterina Vaartis (vaartis) wrote : | #7 |
Actually, i found a way to edit the bug description, so i will add the bounty note to the description.
description: | updated |
Drew Crampsie (drewc) wrote : | #8 |
Thank you for explaining that to me. Hopefully your editing will make people less confused about who is offering what and when. Many blessings.
Hey, already started working on this; I got the return value partially working. It seems like a lot of other code on sb-alien is guess-work, should i also guess optimiztions or should i require some flag when compiling?
Thanks in advance
Update here: I'm almost finished, problem is I need to call a function i made that fixes something on alien-funcall. I made prints but somehow they won't show when i run a test lisp program? i might need some help
Ekaterina Vaartis (vaartis) wrote : | #11 |
Did you try printing to a different stream with (print obj :output-stream stream)? Maybe stdout isn't set up there or something. Also, does anything that might show if it works work in there? Something like raising a condition?
I tried printing to different stream (stderr) to no avail; not sure what you mean by "raising a condition" but i can't think of a simple condition to conclude - i practically have 1 input in my control - the args, tried to do length mismatch but it's caught beforehand; the only hypothesis i have is that it might be optimized by a transform from aliencomp or something.. though i really am not knowledgeable in sbcl's optimizations to be honest.
so somehow the print shows on some contrib tests ... [visible confusion]
Update: somehow using defvar to extern-alien fixes the problem. I'll try to get rid of the cause root by the time I send the patch
Semester started so progress will be a bit slow, though it really is almost done
Changed in sbcl: | |
assignee: | nobody → g_o (go0) |
Changed in sbcl: | |
assignee: | g_o (go0) → nobody |
so i guess i'll just update that i got back to the issue and it partially works. what i have left is to deal with default alignment (not packing/inner padding), i.e: initial offset of the struct when passed. can anyone give me some pointers as to how to guess that offset on the sbcl side?
nevermind it seems like the optimization of passing args by registers is a little different for structs; and so i don't think it was really an offset problem. i'll try to fix it this week, hopefully it goes well and i'll finally have a patch
alright, so it works but now i get back to the problem i discussed with @vaartis where if i use dynamic binding (i.e: defvar/
Ekaterina Vaartis (vaartis) wrote : | #23 |
Could you perhaps upload your version somewhere with a simple test?
@vaartis i can send you my patch but the problem is that it's incredibly important to fix it because define-
anyway what i sent should suffice for pass and return, if you want to do both i can send an example file of how it'd be done under this patch from user standpoint. it's not a good solution though, so i really prefer an explanation as to why alien-funcall problematic for lexical vars and if it's a separate issue - open another bug.
to be clear: it doesn't matter *how* i change alien-funcall as far as i could tell, it just uses whatever i had previously. this last edge case allegedly can be solved simply by changing a let to a defvar in define-
- 0002-foreign-functions-added-struct-support-test.patch Edit (11.4 KiB, text/plain)
alright i think i fixed most of the problems
Changed in sbcl: | |
assignee: | nobody → g_o (go0) |
umm @vaartis mind giving some sort of feedback on the latest patch i sent?
Ekaterina Vaartis (vaartis) wrote : | #30 |
Sorry for not responding before. I'll try your patch tomorrow with the use case I had in mind initially and get back to you.
Ekaterina Vaartis (vaartis) wrote : | #31 |
- Test lisp file Edit (544 bytes, text/plain)
When building latest SBCL (d0243a9f9961f0
[314/314] src/code/last-file
; file: /home/vaartis/
; in: DEFUN SB-ALIEN:
; (T (ERROR "~S is not an alien function." SB-ALIEN:ALIEN))
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP:T
;
; compilation unit finished
; Undefined function:
; T
; caught 1 STYLE-WARNING condition
; printed 4 notes
While evaluating the form starting at line 23, column 0
of #P"/home/
However, afterwards the rest builds fine.
When trying the code you have and following the way you call the function in the tests, it seems like alien-funcall'ing the same externed function only works once, because afterwards the externed function value is modified to have an additional struct parameter (i'm trying a functon that returns a struct and has no arguments).
The tests do pass, but when I tried writing a function of my own and calling it, the data in the struct was all messed up. Not sure if i'm doing something wrong. The test lisp file is attached to this post. I'll also attach a C file to the following post since it seems like only one attachment is allowed.
Ekaterina Vaartis (vaartis) wrote : | #32 |
about the compilation pause - i did not experience it so that's weird, i'll see if i can replicate that.
about multiple calls - it should inject arguments to a copy but i probably forgot to change it in the one of the steps, i'll see what's up
about your own test - my first impression is that it's an edge case i didn't notice. i think i know what it is and it shouldn't take too long to fix
stay tuned i guess
- 0003-foreign-functions-added-struct-support-test.patch Edit (13.7 KiB, text/plain)
i think this should do the job
- 0004-foreign-functions-added-struct-support-test.patch Edit (13.5 KiB, text/plain)
cleaned up a bit and changed something tiny. i think it's done
Stas Boukarev (stassats) wrote : | #36 |
(defun alien-void-
(string-equal "(VALUES)" (princ-to-string type)))
doesn't look good...
Ekaterina Vaartis (vaartis) wrote : | #37 |
With the latest patch, the first value in the passed truct did work as expected. However, in the test file I attached before there were two values and the second one (y) is now always zero when passed into C. In fact, every second field is equal to zero, and the next field after them has their value. Seems like an issue with offsets, perhaps. When I added a third field, it had the value that the third field was supposed to have. This is all despite unsigned int and unsigned-int (in lisp) both being 32bits. Attached is a modified test lisp file. I will also attach the modifed C test file in the next post.
Another thing i've noticed is when I try to create a boolean field in lisp, i always get
debugger invoked on a TYPE-ERROR in thread
#<THREAD "main thread" RUNNING {1004AE0993}>:
The value
T
is not of type
CHARACTER
when binding CHAR
when declaring it
Ekaterina Vaartis (vaartis) wrote : | #38 |
Ekaterina Vaartis (vaartis) wrote : | #39 |
> third field was supposed to have
* the second field
@stassats oops forgot about that "^^, that hack should be easily replaceable though
@vaartis about the boolean - i'm not responsible for the typing system. i think i saw some
sbcl kludges that said that alien system doesn't really support less than byte types
(in case you're hoping for bitfield); i'm pretty sure i tested char,
try using it instead. in any case i think it might be out of the scope of the issue.
(i made no real distinction between types on my end anyways).
hmm about 32-bit types on 64 bit machine i'm not sure if there's an elegant way to solve
this; see, i let sbcl's existing argument packer deal with padding and offsetting.
i prefer being verbose as a workaround using prefered offsetting/
on gcc's end or on the lisp alien type end.
tbh at this point i had my fun with this issue and while it's nice to have i
don't care all that much about the beer money. so if anyone wanna improve this thing or try
something different he/she is very welcome, hopefully we'll share both.
i'm just really out of time for this issue, as another semester is coming my way
and with things being crazy as they are right now.
Changed in sbcl: | |
assignee: | g_o (go0) → nobody |
Rongcui Dong (rongcuid) wrote : | #41 |
- Milestone 1: prevent silent register/stack clobber Edit (1.2 KiB, text/plain)
Hello,
I would like to have a stab at this bug. I've spent a good part of last week to go through SBCL's source code, and I have 75% confidence that I can at least add support for struct passing/returning by argument for the few platforms I have experience on: ARM64/Darwin, X86-64 Windows/Unix, and Risc-V. However, since this will be a fairly large investment of time, I would like to break my patching into multiple milestones, so I can get some feedback from the devs, and patch up on my knowledge of the internals of SBCL (which I only used for a while).
I will try to make each patch I submit here stable enough to go into a release, gradually fixing problems and adding supports of platforms. My goal is that even if I cannot follow through the very end of this bug fix, each patch should leave the code base in a good state for other devs to pick up on my progress. I will also leave extensive documentation to all my changes.
## The patch
Attached in this comment is the first milestone. It's a very simple BREAKING CHANGE:
- Any attempt to define a foreign routine that passes or returns struct by value raises an error.
I consider this a necessary change that should be merged as soon as possible, even though it doesn't fix this bug. As of current release, when you define a function that passes or returns structs by value, SBCL generates assembly that can potentially corrupt the stack silently. This happen due to the following implementation:
- SB-ALIEN handles struct values by the RECORD alien type class
- The RECORD class inherits from this hierarchy: RECORD -> MEM-BLOCK -> ALIEN-VALUE -> SYSTEM-AREA-POINTER
- All platforms define how to pass a SYSTEM-AREA-POINTER as an argument
- No platform define how to pass a RECORD
- Therefore, RECORD is passed as a SYSTEM-AREA-POINTER
This behavior is almost always incorrect. It clobbers registers and stack by misaligning arguments and return values, which is potentially very dangerous. This patch throws an error to prevent this from happening. While I consider this the correct thing to do, please have a review and see if it breaks anything.
I did not add a test for this patch, as this milestone is only temporary, until I add concrete implementation to platforms.
## Some more proposals
These are some proposals which I have in mind, but since they will touch code that don't directly correlate to this bug, I think I should ask for permission to change anything:
1. The ARM64 c-call implementation is structured far differently from the ARM documentation, AAPCS64. This makes it pretty difficult to verify against the official documentation. May I refactor this part to more closely reflect the official docs?
2. I don't think RECORD should inherit from MEMORY-BLOCK. Actually, this class hierarchy is pretty problematic for the :ARG-TN and :RESULT-TN methods because there are so many platform-specific behavior. Perhaps they should be split into another class, or just not use OOP all together?
- Plus it took me about five days to figure out which method was called due to the explicit OOP system...
And that's all for the first patch.
Thank you,
Rongcui Dong
Rongcui Dong (rongcuid) wrote (last edit ): | #42 |
I am currently struggling with structures that can return two TNs in the :ARG-TN method. This can happen on ARM64 when a structure of size between 8 and 16 bytes is passed by value: it can take up two registers, or two slots on the stack.
However, currently each argument cannot seem to return multiple TNs. Would this mean I need to also change aliencomp.lisp's alien funcall optimizer? Or am I missing something obvious, such as an existing way to assign mutliple TNs to the same argument? Any guidance would help greatly.
EDIT:
I figured it out. I've implemented passing small structs by value on ARM64, and I'll put up a patch once I have full support of ARM64.
Christophe Rhodes (csr21-cantab) wrote (last edit ): | #43 |
Thanks. I'd like to take the milestone 1 patch, but I think I'd also like the test infrastructure around it if we can. What's the minimal possible working example that we can imagine? Something like:
struct foo {
int a;
int b;
};
struct foo bar(struct foo x) {
x.a ^= x.b;
x.b ^= x.a;
x.a ^= x.b;
return x;
}
and then
(define-alien-type foo (struct foo (int a) (int b)))
(with-alien ((foo foo))
(setf (slot foo 'a) 1 (slot foo 'b) 2)
(let ((rfoo (alien-funcall (extern-alien "bar" (function foo foo)) foo)))
(list (slot foo 'a) (slot foo 'b) (slot rfoo 'a) (slot rfoo 'b))))
expecting (1 2 2 1), corresponding to the C code
int main() {
struct foo foo;
foo.a = 1;
foo.b = 2;
struct foo rfoo = bar(foo);
printf("%d %d %d %d", foo.a, foo.b, rfoo.a, rfoo.b);
}
If I've understood all that correctly, could we have a test with something like this code that is marked as :fails-on :sbcl, as well as the patch in comment #41? (If I've not understood correctly, please help! :-)
Rongcui Dong (rongcuid) wrote : | #44 |
- struct-val.patch Edit (112.3 KiB, text/plain)
I have added preliminary pass-by-value struct for ARM64. Currently I've only tested passing struct as the first argument. There also seems to be some inefficient stack allocation, and the code can use a lot of refactoring.
This patch adds:
* Passing a struct aligned to 8-bytes by value as first argument
** Works for small structs (copied to registers) and large structs (copied to stack, then pass pointer)
* Test cases for the above
Major changes:
* In `aliencomp.lisp`, `%alien-funcall`, I added one additional return value, `preprocess-tns`, which contain functions which are called _before_ any arguments are copied. This is necessary due to ARM64 calling convention's Stage B, where arguments may be copied to memory and replaced by a pointer.
This change will not affect existing backends, but it is still in a common code path. If there's a better option, let me know!
Rongcui Dong (rongcuid) wrote : | #45 |
I cannot figure out how to make test cases for milestone 1. Neither `:fails-on` nor `assert-error` can catch the error, for some reason. I am guessing that's because it's at macro expansion time? I don't know.
Take this as reference:
(define-alien-type nil (struct tiny-align-8 (m0 (integer 64))))
(with-test (:name :struct-
(assert-error (define-
(assert-error (define-
(assert-error (define-
None of these errors are caught.
Rongcui Dong (rongcuid) wrote : | #46 |
I cannot find any way to catch the error, so I don't know how to write a test case for it.
To try out the error without merging from the later patch, just try this in REPL (after patching milestone 1):
(define-
This will throw an error that somehow cannot be caught by any handler. Maybe I just missed something.
Rongcui Dong (rongcuid) wrote : | #47 |
- 313202-milestone-1.patch Edit (6.9 KiB, text/plain)
Attached is the Milestone 1 patch with test cases. I reused tests from my feature branch, but simply asserted that defining routines will always fail.
However:
* Currently the test fixture isn't great at adding support on `:fails-on platform`, because I just asserted error
** I'll probably make the `(eval '(define...` changes in the development branch as well
* Error thrown is a SIMPLE-ERROR. I looked at CL's standard error types and didn't find a suitable one. Any suggestions?
Also, is `format-patch` the recommended way to submit patches? I grouped all of my commits into a single file, but not sure if I should do it.
The patch should be mergeable if there there are no problem.
Rongcui Dong (rongcuid) wrote : | #48 |
- 313202-milestone-2-preview.patch Edit (133.9 KiB, text/plain)
Attached is my first full implementation of struct pass by value on ARM64 backend. I have implemented everything in the ABI for passing arguments, except for scalable types, vector types, half float, quad floats, and quad integers, which are not supported by SBCL. However, the test framework has not been updated, and I will only consider this milestone "done" when the tests cover all of AAPCS64's cases.
Returning by value is not implemented yet.
Rongcui Dong (rongcuid) wrote : | #49 |
@csr21-cantab Sorry I missed the last sentence in your comment. The patch in #47 includes a few tests for Milestone 1.
Christophe Rhodes (csr21-cantab) wrote : | #50 |
Thanks.
What if we were to change the :arg-tn and :result-tn methods to error at runtime? I think it would be more in SBCL's normal style to do something like:
(define-
(declare (ignore type state))
(warn "Passing structs by value is unsupported on this platform.")
`(error "Passing structs by value is unsupported on this platform."))
instead? This should make it easier to test, too.
Rongcui Dong (rongcuid) wrote : | #51 |
How does this work? What you show returns a quoted error, but it gets returned into `make-call-
Rongcui Dong (rongcuid) wrote (last edit ): | #52 |
@csr21-cantab I think it would be difficult to make something as suggested in #50, based on my understanding of the backend.
Currently, the error is generated when defining an alien routine, which is called in the following order:
- define-
- make-call-out-tns
- :ARG-TN/:RETURN-TN -> error
However, to cause an error at runtime instead, this error must be emitted to the backend. So, it has to go through two stages.
When defining:
- define-
- make-call-out-tns
- :ARG-TN/:RETURN-TN, somehow returning the error, unevaluated
- make-call-out-tns returning this error
When running:
- %alien-funcall catching this error and throw it
- Or generate some sort of VOPs that throw an error
The main problem is that make-call-out-tns and :arg-tn/:return-tn are architecture-
Currently in my tests, I am wrapping all define-
Christophe Rhodes (csr21-cantab) wrote : | #53 |
Yeah, I agree, it's difficult. (I think we would have to do something weird like make a special kind of TN that errored if used. Almost certainly not worth it).
I've merged the patch from #47 (milestone 1). Thank you. format-patch and all-in-one-file is nice and easy for me to merge with just `git am`, so that works fine.
Stas Boukarev (stassats) wrote : | #54 |
The windows build fails with:
Passing structs by value is unsupported on this platform.
Backtrace for: #<HOST-
0: (HOST-SB-
1: (HOST-SB-
2: (INVOKE-DEBUGGER #<SIMPLE-ERROR "Passing structs by value is unsupported on this platform." {100578E8B3}>)
3: (ERROR "Passing structs by value is unsupported on this platform.")
4: (SB-ALIEN:
5: (SB-C:MAKE-
6: (SB-C::
7: (SB-C::
8: (SB-C:IR2-CONVERT #<SB-C:COMPONENT :NAME "DEFUN" {100543A3B3}>)
9: (SB-C::
10: (SB-C::
11: (SB-C::
12: (SB-C::
Christophe Rhodes (csr21-cantab) wrote : Re: [Bug 313202] Re: wanted: by-value structures in SB-ALIEN | #55 |
Stas Boukarev <email address hidden> writes:
> The windows build fails with:
>
> Passing structs by value is unsupported on this platform.
>
> Backtrace for: #<HOST-
> 0: (HOST-SB-
> 1: (HOST-SB-
> 2: (INVOKE-DEBUGGER #<SIMPLE-ERROR "Passing structs by value is unsupported on this platform." {100578E8B3}>)
> 3: (ERROR "Passing structs by value is unsupported on this platform.")
> 4: (SB-ALIEN:
> 5: (SB-C:MAKE-
> 6: (SB-C::
> :FUN #<SB-C::REF :LEAF #<SB-C::GLOBAL-VAR :%SOURCE-NAME
> SB-C:%ALIEN-FUNCALL :TYPE #1=#<SB-
> SB-ALIEN-
> :WHERE-FROM :DECLARED :KIND :GLOBAL-FUNCTION {10056B3F03}>
> {10056B4023}> :ARGS (#<SB-C::REF :LEAF #<SB-KERNEL:
> "GetFileAttribu
Thanks. Looks like some arguments should have been declared as pointers
to structs but aren't. I'll fix the compilation and verify that it's OK
with github's CI.
Christophe
Stas Boukarev (stassats) wrote : | #56 |
Will that unearth problems elsewhere?
Christophe Rhodes (csr21-cantab) wrote : | #57 |
> Will that unearth problems elsewhere?
Might there be code out there that mistakenly declares interfaces like
(define-
where they meant
(define-
and no-one noticed because sbcl does the right/wrong thing? Maybe. There was only a small amount of it in SBCL's code itself, for what it's worth: the diff that at least builds all cores is at <https:/
Rongcui Dong (rongcuid) wrote : | #58 |
In current Stable, (struct foo) is synonymous to (* (struct foo)) when passing arguments due to struct (or record) inheriting methods from SAP. On some platforms (like windows X86, I believe), this could work due to their ABI. However, this "working" is just accidental, and on many other platforms it will not work.
I am not familiar with Win32 API, but the last argument is supposed to be a struct pointer: https:/
Rongcui Dong (rongcuid) wrote : | #59 |
How is it possible that the expected error is not thrown:
https:/
::: UNEXPECTED-FAILURE :STRUCT-
"(EVAL
::: Running :STRUCT-
Is it the --without-sb-eval flag? How to get around this?
Christophe Rhodes (csr21-cantab) wrote : | #60 |
It's just the fact that it's being interpreted, not compiled. You can probably reproduce locally with
cd tests/
sh ./run-tests.sh --evaluator-mode interpret alien-struct-
Presumably this is because the function that is generated is not compiled and hence not macroexpanded.
We could probably force the issue by attempting to *call* the alien-routine as well as define it. Testing a fix now.
Rongcui Dong (rongcuid) wrote : | #61 |
%ALIEN-FUNCALL is defined under a `defoptimizer` block. I am not completely sure what it does, but does that contribute to the difference in interpreted vs compiled mode? So, in compiled mode, MAKE-OUT-CALL-TNS is called at compile time while in interpret mode it's called at run time?
Christophe Rhodes (csr21-cantab) wrote : | #62 |
Basically, yes: the only guarantee about make-call-out-tns is that it must be called by the time the alien function itself is called.
(define-
(defun foo ()
(with-alien ((...))
(values (alien-funcall ...))))
If you are in compiled mode, then this function body gets compiled, including macroexpansion all the optimizations that happen at compile-time for things like alien-funcall.
If you are in interpret mode, then this function body is simply saved, and only when (foo) is called does any macroexpansion happen. So at the point of (eval '(define-
I adjusted the test to do both definition and call, for now.
Rongcui Dong (rongcuid) wrote (last edit ): | #63 |
- 313202-milestone-2-preview.patch Edit (171.0 KiB, text/plain)
I am almost done with the ARM64 support, but I encountered a very strange crash, and I think I need some help here. Attached is the current progress. You can run the alien-struct-
The crash happens when a large struct is passed by value and not as the first argument. A memory fault is generated at -0x7, in a lambda returned from the preprocessing step:
- PRE-PADDING-
- In MAKE-CALL-OUT-TNS, this PP-G is called with the current frame pointer offset, emitting another closure, PP
- In %ALIEN-FUNCALL, PP is called to emit assembly code required for copying the struct to stack
However, the generated assembly contains something very strange:
; 1E0: BF2A00B9 STR WZR, [THREAD, #40] ; pseudo-atomic-bits
; 1E4: BE2E40B9 LDR WLR, [THREAD, #44] ; pseudo-atomic-bits
; 1E8: 5E0000B4 CBZ LR, L2
; 1EC: 200120D4 BRK #9 ; Pending interrupt trap
; 1F0: L2: 43915FF8 LDR NL3, [R0, #-7]
; 1F4: E00303AA MOV NL0, NL3
See 1F0 here. This caused the memory fault. I don't understand why R0 is here, or what it represents, because this assembly seems to happen before my copying code, or MOVE-STRUCT-
Stas Boukarev (stassats) wrote (last edit ): | #64 |
I have no idea what's wrong with the code above. But just style notes.
You can use LDP instead of two LDR.
You don't need (load-store-offset 8), 8 is an immediate.
Don't put non-ascii stuff like ® into the code.
Rongcui Dong (rongcuid) wrote : | #65 |
I'll try to use LDP once I deal with the problem. I also don't see any ® so I am not sure what you are referring to? I know there's @ which is an address offset. I can clean up load-store-offset for anything that are immediates, though.
I triaged the problematic code to be generated from my vop MOVE-TO-STACK, but I am still figuring out why it generates wrong code only when the struct is not the first argument. Somehow, when struct is the first arg, it uses R1 instead of R0 for the crashing instruction, which doesn't result in a crash.
Rongcui Dong (rongcuid) wrote : | #66 |
The offending code is actually generated even if I don't define anything in the body of the vop:
(define-vop (move-to-stack)
(:args (from-ptr :scs (sap-reg)) ;; NL2
(fp :scs (any-reg))) ;; NL1
(:info size fpoff)
(:temporary (:scs (non-descriptor
(:generator 0))
Stas Boukarev (stassats) wrote : | #67 |
I don't know how to link to ®, so just grep for it?
Rongcui Dong (rongcuid) wrote : | #68 |
I figured out the bug. It was my mistake because I pushed some list in the wrong order.
I found the ®, it's in the comments. I've replaced it with (R).
Rongcui Dong (rongcuid) wrote : | #69 |
- Milestone 2 Edit (211.6 KiB, text/plain)
This is the Milestone 2 patch. I think this patch can be reviewed and prepare to merge.
This patch includes:
- Passing struct by value on ARM64, including Darwin
- Test cases for struct passing
What it currently lacks:
- Returning struct by value
- Bulletproof handling of alignment-adjusted types
Other notes:
- I highly doubt varargs is implemented correctly for ARM64. I didn't implement it, and I don't see any code handling them. In addition, there is no test case on variadic foreign functions that I know of.
Stas Boukarev (stassats) wrote : | #70 |
Varargs work on arm64, so I doubt your doubts.
Stas Boukarev (stassats) wrote : | #71 |
And they don't need a test because they are used throughout the runtime and contribs.
Rongcui Dong (rongcuid) wrote (last edit ): | #72 |
- 313202-milestone-2.zip Edit (81.0 KiB, application/zip)
Interesting. I'll take your word, then. I removed the comments about them.
Rongcui Dong (rongcuid) wrote : | #73 |
I have two failed tests. I will fix them
Rongcui Dong (rongcuid) wrote : | #74 |
- 0001-Bugfix-313202-Milestone-2.patch Edit (51.1 KiB, text/plain)
This patch passes all tests, on my computer at least.
Charles (karlosz) wrote : | #75 |
In aliencomp.lisp, you added reader conditionals for arm (not arm64!) guarding preprocess-tns. Is that intentional?
Rongcui Dong (rongcuid) wrote : | #76 |
In aliencomp.lisp, the arm conditionals were there originally. Since it flips the argument order, and thus tns, and I am adding a preprocess list, I need to flip the preprocess list as well.
Rongcui Dong (rongcuid) wrote : | #77 |
This is just a side note thing. I find that the %ALIEN-FUNCALL routine quite difficult to go through, because although it's a common path for all the architectures, it contains lots of architecture-
For instance, currently ARG-TNS can be a function, a TN, or a list, depending on the arch and type. However, for a "general" implementation, a function is strictly more powerful than a list or a name. Perhaps keeping only the TN for convenience and the function for advance emitting would make it easier to port to other architectures. Similar can be said for RESULT-TNS, where there are many architecture-
Of course, a refactor like this will affect code in every architecture, so I am not rushing to do it. However, I do think it will be a good thing to do at some point.
Rongcui Dong (rongcuid) wrote : | #78 |
Return by value proves to be much more complicated than passing by value. I will stop working on that for now and move to X86-64 struct pass by value.
The main difficulty is because a return struct could have unlimited extent. SBCL, as far as I know, doesn't have garbage-collected alien objects/system memory, so I am unsure how to deal with the return semantics. Should a returned value stay on (lisp) stack and go out of scope on return? Or should it be inside system area and require freeing? Neither seem very safe to me.
Rongcui Dong (rongcuid) wrote : | #79 |
- 0002-Removed-some-debug-comments.patch Edit (1005 bytes, text/plain)
I removed some leftover debug comments.
Charles (karlosz) wrote : | #80 |
Having the returned value go on the Lisp stack makes the most sense to me: This matches how strict by values are treated in C, and we don’t try to protect against use-after-free for heap allocated points passed to Lisp either. Users could safely use the return value by binding it to an alien declared via with-alien which copies the return value slots into the alien with a lexically delimited scope.
Also +1 for refactoring the alien code. You may want to do that before porting to x86-64 to save some work: ideally every supported architecture should support the same alien primitives to avoid platform feature conditional explosion.
Rongcui Dong (rongcuid) wrote : | #81 |
Ok. How do I test whether my refactors break other architectures? I don't have every machine on the planet. Can I make a PR or something to use the CI?
There's also something I cannot figure out, and earlier an email sent to the sbcl-devel mailing list somehow got sent back because it couldn't be delivered, so I have to ask here.
SBCL represents alien structs as a SAP. Therefore, when I return a struct by value, I have to store the returned struct somewhere that can be pointed to by the SAP. However, I don't really know where. Currently, I've tried pushing the value to control stack, but apparently the wrong value was read back to lisp. Perhaps control stack is not the right place for this? From CMUCL documentation, it's used for garbage collected lisp objects.
I think number stack is the right place to push, but then I don't know how to push it without clobbering the caller's stack frame. I don't think I can just write something to NSP and increment it, because then when the caller returns its frame pointer would be reset incorrectly.
If I need to make the caller allocate stack space, where should I look for? Or alternatively, is CSP the right thing to use, or is there already a way to return something from stack?
Stas Boukarev (stassats) wrote : | #82 |
Can we not refactor anything? I equate refactoring with "let's break code without any benefits".
Rongcui Dong (rongcuid) wrote : | #83 |
Rongcui Dong (rongcuid) wrote : | #84 |
I disagree with that view. If refactor breaks things, it means that tests are not good enough. Easing maintenance is a very good benefit: I cite https:/
If you really don't want that, I can skip it, but I personally don't think that's a good idea. The original backend code has a lot of architecture specific code mixed into the common path, and it's quite a lot of work to reverse engineer their mini-DSLs embedded as lists. You can have a look at my draft PR for a comparison. I've only done it for ARM64 and X86 with an exact translation, and I am using the PR only for the CI.
I am going to pause until we reach an agreement of what to do, but the ARM64 patches are complete and can be tested. The patches I sent do not include the refactor.
Charles (karlosz) wrote : | #85 |
I agree, it’s absurd to ask for not refactoring anything when it’s directly beneficial for adding a large feature.
The number stack should be used for allocating space for struct return values. You can’t have random numbers on the descriptor stack, especially for precise platforms like arm, where the GC could treat them as Lisp pointers.
It’s okay to use the number stack writing to the stack and incrementing the NSP. The NSP gets reset correctly to the NFP. Just don’t change the NFP and things should work.
Stas Boukarev (stassats) wrote : | #86 |
Of course you would disagree, because your refactorings is how I came to that conclusion.
Christophe Rhodes (csr21-cantab) wrote : | #87 |
> If refactor breaks things, it means that tests are not good enough.
OK. I'm perfectly happy to go on record to say that our tests are not good enough, in general, but certainly in the specific area of %alien-funcall.
That doesn't mean that refactorings / reworkings / rearrangements / rearchitecting can't happen, but it does raise the burden on the refactorer, and they carry with them an implicit promise to deal with any subsequently-
As for how to test: the cfarm compile farm project offers ssh access to hosts on a variety of architectures and operating systems; it should be possible to build and test things there. Github CI is inadequate for this, as it does not cover most supported architectures or most supported operating systems.
I'd strongly advise first auditing the existing tests (and possibly the implicit tests from e.g. loading and testing systems in quicklisp) to get a sense for what documented functionality is tested at all, what is not, and how big a job it is to write tests to cover a reasonable amount of the difference.
Charles (karlosz) wrote : | #88 |
Re: testing and CI: there's a cross-build runner you (Rongcui) can use which tests if you break building for other architectures (the parts which only exercise the cross-compiler). You can also push to a local fork if you want github CI test runners to work. Unfortunately, this doesn't actually run the tests on architecures beside x86-64 and arm64. You would have to do that yourself. I can recommend requesting access for the gcc farm machines which are available to free software porters if you want machines; they have machines for every platform available. It's annoying for sure, but I think that's the best you can do to make sure you don't break code on other machines. Of course, the less tested platforms naturally bitrot anyway.
Stas, I appreciate your judgements on my own work, but I don't know how valuable that is to the general discussion. To say "don't refactor" to someone who's clearly spent a lot of work in the code and gotten a big and involved feature working on one platform who has insights on what exactly is making it difficult to write the code to port to other architectures doesn't make much sense in this case. I don't know if I could convince you, maybe this needs to come from someone else.
Rongcui Dong (rongcuid) wrote : | #89 |
Would you mind elaborating on my refactors? I am open to API change if you think it's not a good encapsulation. The PR is a draft PR, and is meant to break. I have only two archs done so far, and I am merely using the CI for machines I don't have.
Stas Boukarev (stassats) wrote : | #90 |
It's not a judgement. Writing code without bugs is nigh impossible. So refactoring introduces bugs but provides no benefits of faster/smaller code.
Stas Boukarev (stassats) wrote : | #91 |
>Would you mind elaborating on my refactors?
Why was multiple values changed to a list?
>and I am merely using the CI for machines I don't have.
The CI can be run in a personal fork, which I do all the time. Pull requests seem to eat up Cirrus CI compute credits. I'll have to investigate how to disable that.
Rongcui Dong (rongcuid) wrote : | #92 |
I am not going to touch anything outside ALIEN-FUNCALL, so hopefully I don't need to audit everything else... Also, I will make test cases for anything I add as thoroughly as possible.
If I were to continue, I will still start with Github CI, as it's the simplest thing to use. I'll take the advice of machine farms.
Rongcui Dong (rongcuid) wrote : | #93 |
> Why was multiple values changed to a list?
This is because every architecture has a different set of requirements. For instance, only ARM64 currently has a preprocessing step. Only X86 needs to set/reset FPU state. I am using destructuring-bind with keywords to make intention clear. Changing back to multiple-
I have no problem with using values. I start with readable code when refactoring, and should any performance issue arise, I can optimize.
> Pull requests seem to eat up Cirrus CI compute credits.
I see. I can move to personal fork for testing.
Stas Boukarev (stassats) wrote : | #94 |
Multiple values can be used with keyword arguments too.
Rongcui Dong (rongcuid) wrote : | #95 |
> Multiple values can be used with keyword arguments too.
Can they? I don't see it mentioned in CLHS
Stas Boukarev (stassats) wrote : | #96 |
(multiple-
Rongcui Dong (rongcuid) wrote : | #97 |
> (multiple-
Ok, I see. That's an easy enough change.
So, do I continue with the refactor?
Rongcui Dong (rongcuid) wrote : | #98 |
@karlosz
> The number stack should be used for allocating space for struct return values
Does this actually work? I tried to push the return value by pushing to number stack and decrementing NSP in ALIEN-FUNCALL. It works when I have a C function of form `struct foo return_foo(void);`, but whenever I add an argument SBCL will crash due to memory fault. The PC pointer looks nothing like the disassembly suggests, so I assume that the crash happened after the alien function returns.
Basically, is this correct?
* Lisp function
* Calls alien routine
** Call C function
** Push result to number stack
* ???
That is, if I push the number stack _inside_ the alien function, is that still valid after the function returns?
Charles (karlosz) wrote : | #99 |
No, I don't think allocating space on the number stack only inside ALIEN-FUNCALL can work. There needs to be coordination with the caller so that the values can be copied before return.
I think you have to have the caller allocate the space for the return-type as well, which unfortunately means you have to restrict alien-funcall for struct return types to only work when the alien function type is known. This is OK because the majority of alien-funcall uses fall under this case. (For the case where you don't know the type in advance, alien-funcall checks the return type and if it's a struct, it barfs).
Look at the transforms for alien-funcall in aliencomp.lisp. You can analyze the struct return type there and allocate the proper amount of space on the number stack there, which is going to be the real backing sap. The sap to that space can then be passed for alien-funcall to copy the struct it gets from C into the backing storage.
Does that plan make sense?
Yukari Hafner (shinmera) wrote : | #101 |
Hey, really excited to finally see this make some headway!
I wanted to talk about an additional use-case that I need, and which not even cffi-libffi can help me with: structs-by-value in callback arguments and returns.
I've so far gotten by without needing this, but I recently started on the Framebuffers [1] project, which I would like to have a backend to Apple's Cocoa library. Apple looooves callbacks, so you have to implement a bunch of those to change and implement certain behaviours. Some of those callbacks look like this:
struct rect update_rect(struct rect input)
Which in CFFI would be defined something like this:
(cffi:
So I need to be able to return a struct to C somehow.
For the struct-returns of a call, CFFI usually gives you a "translated" version of the struct in the form of a plist. That way lifetime questions are avoided, since it's just a regular lisp object. I'm not sure if sb-alien should take the same approach though.
Rongcui Dong (rongcuid) wrote : | #102 |
There are two transforms:
deftransform alien-funcall ((function &rest args)
deftransform alien-funcall ((function &rest args) * * :node node)
Which one should I use?
Charles (karlosz) wrote : | #103 |
You should use the second one, that tells what the function type is at compile time so you can decide how much space to allocate on the number stack at compile time in the caller. This space can then be passed via a SAP to %alien-funcall which does the copying into that region when it gets the returned struct from C. This is how you extend the lifetime of the stack memory to the duration of the caller frame.
Rongcui Dong (rongcuid) wrote : | #104 |
Also, the crash was caused by misaligned stack. My original implementation which pushed NSP from %ALIEN-FUNCALL "worked", but is that just accidental?
Charles (karlosz) wrote : | #105 |
I'd have to check again whether on arm64 (or the other platforms where the Lisp stack is not the number stack), function calls tear down the number stack frame. It might be the case that on arm64, its OK. But on x86-64, the C stack is the Lisp stack, so when %ALIEN-FUNCALL returns that memory definitely no longer protected by the stack pointer, so it would be dangerous.
Rongcui Dong (rongcuid) wrote : | #106 |
I see. I am still reading the transform function and trying to figure out how to reserve stack space there. Do you have some quick pointers/examples for doing so?
For callbacks, I may look into them later. I would guess that they are a bit easier, but I'll of course need to figure out how to copy struct data from C to Lisp.
Charles (karlosz) wrote : | #107 |
Look at the %alien-funcall ir2-convert method to get a feeling of how the number stack is manipulated by %alien-funcall itself. You would have to lift up some machinery from there to get it to work in the caller. You can see that the number stack space is reset there once %alien-funcall returns, so probably how you have it working is only accidental.
For an initial prototype, you can reserve the space in the caller with (%primitive alloc-alien-
Charles (karlosz) wrote : | #108 |
MAKE-LOCAL-ALIEN is how WITH-ALIEN works, by the way. (which reserves number stack space in the current frame)
Rongcui Dong (rongcuid) wrote : | #109 |
> You can see that the number stack space is reset there once %alien-funcall returns, so probably how you have it working is only accidental.
Actually, I am aware that %alien-funcall resets NSP (if you are referring to dealloc-
I will see what make-local-alien does.
Rongcui Dong (rongcuid) wrote : | #110 |
What's the difference between SB-C::VOP and SB-C::%PRIMITIVE? I was doing exactly what make-local-alien does, except with VOP. I also see a FIXME saying that %PRIMITIVE means the same thing as VOP?
Charles (karlosz) wrote : | #111 |
I think what you're doing at that line looks fine then, actually. As long as the NSP protects it, the memory is valid. It will get cleaned up when the Lisp function returns (see e.g. compiler/
Charles (karlosz) wrote : | #112 |
VOP and %PRIMITIVE work at different levels. %PRIMITIVE is for normal CL code that is written, while VOP is used as part of codegen methods (hence it takes IR2 structures as part of its arguments, which %PRIMITIVE doesn't do; those are supplied by the codegen phase of the compiler)
Rongcui Dong (rongcuid) wrote : | #113 |
> It will get cleaned up when the Lisp function returns (see e.g. compiler/
That's the part I am not 100% certain about: does this happen at the end of an alien call as well? Or is %alien-funcall all there is? Or, are alien functions wrapped inside a lisp function call which I am not seeing in %alien-funcall routine?
Charles (karlosz) wrote : | #114 |
The primitive %ALIEN-FUNCALL is not wrapped in its own Lisp function when the function type is known; this is what the transform does (it changes a call to ALIEN-FUNCALL directly into that sequence of steps in the caller via %ALIEN-FUNCALL, which is compiled as a primitive, not as a function call). However, when the function type is not known, the primitive is wrapped by the Lisp function ALIEN-FUNCALL. That's why I mentioned that it's probably only possible to do struct returns when the function type is known to the compiler.
Rongcui Dong (rongcuid) wrote : | #115 |
I see. But doesn't the compiler always know the function types? DEFINE-
Charles (karlosz) wrote : | #116 |
No. You could do something like (defun f (x) (alien-funcall x ...)). So yes, the compiler knows what the types are for a given alien function, but it doesn't necessarily know what alien function is being alien-funcalled. So if you did (f <alien-
(defun bar (x)
(alien-funcall (extern-alien "returns_struct" ...) x))
that gets rewritten by the compiler since it knows the alien function being invoked and hence th return type (and so it can rewrite alien-funcall to the stuff in the transform with %alien-funcall; now it works to allocate space in the caller). Does that make sense?
Rongcui Dong (rongcuid) wrote : | #117 |
I see. Thank you very much.
Rongcui Dong (rongcuid) wrote : | #118 |
- 0004-ARM64-struct-return-by-value.patch Edit (24.5 KiB, text/plain)
This patch adds the ability to return structure by value on ARM64.
Rongcui Dong (rongcuid) wrote : | #119 |
- 0005-Document-memory-layout-for-return-struct.patch Edit (1.6 KiB, text/plain)
Updated doc strings for return by value
Charles (karlosz) wrote : | #120 |
Could you update the manual as well (if you haven't already?) It still says struct pass by value is unsupported.
Rongcui Dong (rongcuid) wrote : | #121 |
- 0006-Update-manual.patch Edit (948 bytes, text/plain)
Maybe more can be explained, but I'll just add a few sentence.
Rongcui Dong (rongcuid) wrote : | #122 |
Summer semester starts so my work on this will be slow. My next target is alien callbacks on ARM64 to complete full support for ARM64. Then, I will move on to X86-64.
Rongcui Dong (rongcuid) wrote : | #123 |
I no longer have time to work on this issue, so I will stop development for now. I will still provide any help if existing patches need to be merged.
There's now a bunty on this: https:/ /www.bountysour ce.com/ issues/ 75202399- wanted- by-value- structures- in-sb-alien.