VOP for product of unsigned-reg and constant isn't used

Bug #1946470 reported by Hugo Sansaqua
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
New
Undecided
Unassigned

Bug Description

Here my focus is on SBCL 2.1.9 for x86-64.

The following function multiplies unsigned-reg and constant, and returns the 63rd bit.

(defun test (x)
  (declare (optimize (speed 3) (safety 0))
           ((unsigned-byte 64) x))
  (ldb (byte 1 63) (ldb (byte 64 0) (* x (expt 10 15)))))

I think sb-vm::fast-*-c/unsigned=>unsigned is intended to be used in this kind of situation. Looking at the disassembly for TEST, however, that's not what happens:

;; Size: 35 bytes. Origin: #x52DE0272 ; TEST
;; 72: 48BB0080C6A47E8D0300 MOV RBX, 1000000000000000
;; 7C: 488BC2 MOV RAX, RDX
;; 7F: 48F7E3 MUL RAX, RBX
;; 82: 488BC8 MOV RCX, RAX
;; 85: 48C1E93F SHR RCX, 63
;; 89: 48D1E1 SHL RCX, 1
;; 8C: 488BD1 MOV RDX, RCX
;; 8F: 488BE5 MOV RSP, RBP
;; 92: F8 CLC
;; 93: 5D POP RBP
;; 94: C3 RET

https://github.com/sbcl/sbcl/blob/20790cb3013cebd2d48e0ef716e93ea8ba5634d8/src/compiler/x86-64/arith.lisp#L515
As shown in this line, the cost for the "constant" VOP is equal to the "non-constant" one. So I reduced the former cost by 1 and the disassembly became as intended:

;; Size: 29 bytes. Origin: #x53639042 ; TEST
;; 42: 488BC2 MOV RAX, RDX
;; 45: 48F725CCFFFFFF MUL RAX, [RIP-52] ; [#x53639018] = #x38D7EA4C68000
;; 4C: 488BC8 MOV RCX, RAX
;; 4F: 48C1E93F SHR RCX, 63
;; 53: 48D1E1 SHL RCX, 1
;; 56: 488BD1 MOV RDX, RCX
;; 59: 488BE5 MOV RSP, RBP
;; 5C: F8 CLC
;; 5D: 5D POP RBP
;; 5E: C3 RET

Currently other "constant" VOPs for four arithemtic operations have less cost by 1 than their "non-constant" equivalents. My question is, is there any reason why this VOP is special?

https://github.com/sbcl/sbcl/commit/2230ea0c1765a95fd2aa0a8996b3555b93ba3745
These two VOPs, fast-*/unsigned=>unsigned and fast-*-c/unsigned=>unsigned, were independently added in this commit.

Revision history for this message
Stas Boukarev (stassats) wrote :

I'm not sure loading constants from memory is a good idea, though.

Revision history for this message
Hugo Sansaqua (privet-kitty) wrote :

https://gist.github.com/privet-kitty/12d301145803faf175490e33b397fa4e

Thanks, I did a micro-benchmark as above and confirmed the current setting is better at least in my environment.

with fast-*/unsigned=>unsigned:
; disassembly for TEST1
; Size: 57 bytes. Origin: #x533D0779 ; TEST1
; 79: 31DB XOR EBX, EBX
; 7B: 31C9 XOR ECX, ECX
; 7D: EB25 JMP L1
; 7F: 90 NOP
; 80: L0: 488BC1 MOV RAX, RCX
; 83: 48D1F8 SAR RAX, 1
; 86: 48BA0780C6A47E8D0300 MOV RDX, 1000000000000007
; 90: 48F7E2 MUL RAX, RDX
; 93: 48C1E83F SHR RAX, 63
; 97: 48D1E0 SHL RAX, 1
; 9A: 4831D8 XOR RAX, RBX
; 9D: 488BD8 MOV RBX, RAX
; A0: 4883C102 ADD RCX, 2
; A4: L1: 4839F1 CMP RCX, RSI
; A7: 7CD7 JL L0
; A9: 488BD3 MOV RDX, RBX
; AC: 488BE5 MOV RSP, RBP
; AF: F8 CLC
; B0: 5D POP RBP
; B1: C3 RET
Evaluation took:
  3.900 seconds of real time
  3.901911 seconds of total run time (3.895302 user, 0.006609 system)
  100.05% CPU
  8,178,361,002 processor cycles
  0 bytes consed

with fast-c-*/unsigned=>unsigned:
; disassembly for TEST2
; Size: 51 bytes. Origin: #x533D1339 ; TEST2
; 39: 31DB XOR EBX, EBX
; 3B: 31C9 XOR ECX, ECX
; 3D: EB1F JMP L1
; 3F: 90 NOP
; 40: L0: 488BC1 MOV RAX, RCX
; 43: 48D1F8 SAR RAX, 1
; 46: 48F725CBFFFFFF MUL RAX, [RIP-53] ; [#x533D1318] = #x38D7EA4C68007
; 4D: 48C1E83F SHR RAX, 63
; 51: 48D1E0 SHL RAX, 1
; 54: 4831D8 XOR RAX, RBX
; 57: 488BD8 MOV RBX, RAX
; 5A: 4883C102 ADD RCX, 2
; 5E: L1: 4839F1 CMP RCX, RSI
; 61: 7CDD JL L0
; 63: 488BD3 MOV RDX, RBX
; 66: 488BE5 MOV RSP, RBP
; 69: F8 CLC
; 6A: 5D POP RBP
; 6B: C3 RET
Evaluation took:
  4.240 seconds of real time
  4.246100 seconds of total run time (4.246071 user, 0.000029 system)
  100.14% CPU
  8,899,773,071 processor cycles
  0 bytes consed

I'm a bit curious to see if this will happen for other "constant" VOPs as well, but for this issue I think the status quo is fine.

Revision history for this message
Stas Boukarev (stassats) wrote :

But it should load small constants directly, that part is true.

Revision history for this message
Hugo Sansaqua (privet-kitty) wrote :

Let me clarify my understanding. You mean `imul reg, reg, imm32' can be used for some constants?

Revision history for this message
Stas Boukarev (stassats) wrote :

Yes, signed 32 bits fit into one instruction.
You can see what gcc or clang generate:
https://godbolt.org/z/MMT5Mojf8

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.