11.2.0-rc1: llvmpipe tests fail if built on skylake

Bug #1549849 reported by Timo Aaltonen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mesa
Confirmed
Medium
mesa (Ubuntu)
Fix Released
Undecided
Timo Aaltonen

Bug Description

if built on broadwell all is well. 11.1 was fine, so it's a regression in 11.2.x

Timo Aaltonen (tjaalton)
Changed in mesa (Ubuntu):
assignee: nobody → Timo Aaltonen (tjaalton)
status: New → Triaged
Revision history for this message
In , Timo Aaltonen (tjaalton) wrote :

building on skylake will fail on llvmpipe tests:

make check-TESTS
make[5]: Entering directory '/«PKGBUILDDIR»/build/src/gallium/drivers/llvmpipe'
make[6]: Entering directory '/«PKGBUILDDIR»/build/src/gallium/drivers/llvmpipe'
PASS: lp_test_printf
../../../../../bin/test-driver: line 107: 32508 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: lp_test_conv
../../../../../bin/test-driver: line 107: 32509 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: lp_test_blend
../../../../../bin/test-driver: line 107: 32513 Illegal instruction (core dumped) "$@" > $log_file 2>&1
../../../../../bin/test-driver: line 107: 32511 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: lp_test_arit
FAIL: lp_test_format

broadwell is fine, and using the same llvm version (3.8-rc) on 11.1.x works fine, so this is a regression in the 11.2 branch

Changed in mesa:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
In , Sroland-vmware (sroland-vmware) wrote :

Could you show the instruction where it crashed (and the disassembly)?

Revision history for this message
In , Timo Aaltonen (tjaalton) wrote :

how exactly? I've tried gdb:

(gdb) run
Starting program: /home/tjaalton/src/pkg-xorg/lib/mesa.git/build/src/gallium/drivers/llvmpipe/lp_test_format
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Testing PIPE_FORMAT_B8G8R8A8_UNORM (float) ...

Program received signal SIGILL, Illegal instruction.
0x00007ffff7ff5004 in ?? ()
(gdb) bt
#0 0x00007ffff7ff5004 in ?? ()
#1 0x0000000000000000 in ?? ()
(gdb) bt full
#0 0x00007ffff7ff5004 in ?? ()
No symbol table info available.
#1 0x0000000000000000 in ?? ()
No symbol table info available.

Revision history for this message
In , Timo Aaltonen (tjaalton) wrote :

Also, compiz doesn't run on SKL/KBL when using llvmpipe, it keeps restarting with 'trap: invalid opcode'. I guess these are related.

Revision history for this message
In , Sroland-vmware (sroland-vmware) wrote :

(In reply to Timo Aaltonen from comment #2)
> how exactly? I've tried gdb:

Usually you could use x/i <address> if it's in jit code when gcc can't figure out the function (or just follow up to the caller and disassemble from there). But it looks like the stack got smashed so I don't know if that really would provide much insight.
Is that a debug build?

Revision history for this message
In , Timo Aaltonen (tjaalton) wrote :

llvm-3.8 misdetects skylake features, this is fixed in 3.9-snapshot..

Revision history for this message
In , Jfonseca-e (jfonseca-e) wrote :

It's not the first time LLVM misidentifies modern CPUs.

I thought that all the logic in src/gallium/auxiliary/gallivm/lp_bld_misc.cpp for setting +/-foo mattrs would save us from this sort of grief.

On the other hand, I suppose that actually knowing the exact CPU model allows it to better model instruction latency/throughput.

Revision history for this message
In , Sroland-vmware (sroland-vmware) wrote :

(In reply to Jose Fonseca from comment #6)
> It's not the first time LLVM misidentifies modern CPUs.
>
> I thought that all the logic in
> src/gallium/auxiliary/gallivm/lp_bld_misc.cpp for setting +/-foo mattrs
> would save us from this sort of grief.

For features we already know about (I think I even mentioned that back then, hoping it wouldn't be a problem)...
If I look at the list of skylake features, I'd nearly bet the winner is avx512 (and/or any subvariant).

Revision history for this message
In , Timo Aaltonen (tjaalton) wrote :

Actually it wasn't avx512, that was the first one I tried :) It's enabled also on 3.7 and that version works fine. Only one that was added in 3.8 is PKU, but dropping just that didn't help.

I did try dropping all non-client features (AVX512, CDI, DQI, BWI, VLX, PKU) and that worked. Maybe one of CDI/DQI/BWI/VLX is somewhat broken on 3.8?

Revision history for this message
In , Sroland-vmware (sroland-vmware) wrote :

(In reply to Timo Aaltonen from comment #8)
> Actually it wasn't avx512, that was the first one I tried :) It's enabled
> also on 3.7 and that version works fine. Only one that was added in 3.8 is
> PKU, but dropping just that didn't help.
>
> I did try dropping all non-client features (AVX512, CDI, DQI, BWI, VLX, PKU)
> and that worked. Maybe one of CDI/DQI/BWI/VLX is somewhat broken on 3.8?

Which is why I said "or any subvariant" ;-).
ERI, CDI, PFI, DQI, BWI, VLX are all avx512 variants (omg naming???), though that skylake in the llvm 3.8 list doesn't suport ERI and PFI. I'm not sure, but probably dropping avx512 manually when a enhanced variant still gets enabled won't do anything. I don't think PKU would matter (but no guarantee...). I suppose we should explicitly disable all of them via mattrs too (not that it's a battle we can win, there will be some extensions at some point...).

Revision history for this message
In , Timo Aaltonen (tjaalton) wrote :

Oh, I didn't know they were subvariants :)

I've dropped them from our llvm-3.8 for now at least..

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

fixed by llvm update in xenial, yakkety fixed it in mesa

Changed in mesa (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.