Mir/Unity8/USC crashes/freezes on nouveau (nv50) in pushbuf_kref() especially with multiple monitors, webbrowser-app or system settings

Bug #1553328 reported by Alexander Langanke on 2016-03-04
88
This bug affects 15 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Critical
Unassigned
Mir
Triaged
High
Unassigned
Nouveau Xorg driver
In Progress
Medium
Unity System Compositor
High
Unassigned
libdrm (Ubuntu)
High
Unassigned
Nominated for Xenial by Alberto Salvia Novella
mesa (Ubuntu)
High
Unassigned
Nominated for Xenial by Alberto Salvia Novella
mir (Ubuntu)
High
Unassigned
Nominated for Xenial by Alberto Salvia Novella
qtmir (Ubuntu)
High
Gerry Boland
Nominated for Xenial by Alberto Salvia Novella
qtubuntu (Ubuntu)
High
Gerry Boland
Nominated for Xenial by Alberto Salvia Novella

Bug Description

Unit8 froze up while I was trying to open system settings.

ProblemType: Crash
DistroRelease: Ubuntu 16.04
Package: unity8 8.11+16.04.20160216.1-0ubuntu1
ProcVersionSignature: Ubuntu 4.4.0-9.24-generic 4.4.3
Uname: Linux 4.4.0-9-generic x86_64
ApportVersion: 2.20-0ubuntu3
Architecture: amd64
Date: Fri Mar 4 19:12:54 2016
ExecutablePath: /usr/bin/unity8
InstallationDate: Installed on 2015-05-10 (299 days ago)
InstallationMedia: Ubuntu 15.04 "Vivid Vervet" - Release amd64 (20150422)
ProcCmdline: unity8
SegvAnalysis:
 Segfault happened at: 0x7f58d568706c: mov 0x8(%rsi),%edx
 PC (0x7f58d568706c) ok
 source "0x8(%rsi)" (0x00000008) not located in a known VMA region (needed readable region)!
 destination "%edx" ok
 Stack memory exhausted (SP below stack segment)
SegvReason: reading NULL VMA
Signal: 11
SourcePackage: unity8
StacktraceTop:
 ?? () from /usr/lib/x86_64-linux-gnu/libdrm_nouveau.so.2
 ?? () from /usr/lib/x86_64-linux-gnu/libdrm_nouveau.so.2
 ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
 ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
 ?? () from /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1
Title: unity8 crashed with SIGSEGV
UpgradeStatus: Upgraded to xenial on 2015-11-07 (118 days ago)
UserGroups: adm autopilot cdrom dip lpadmin plugdev sambashare sudo

Related branches

Created attachment 118838
kernel log

I've encountered an easily reproducible segfault using the Firefox OS emulator while I was hacking the said operating. The Firefox OS emulator [1] is a fork of the Android emulator which is in turn a fork of qemu. In both cases the graphics part is untouched so it might be possible to reproduce the same issue in qemu even though I didn't have the time to try it.

Here's my full STR:

1) Build the Firefox OS emulator using the emulator-x86-kk target device ( git clone https://github.com/mozilla-b2g/B2G.git ; cd B2G ; ./config.sh emulator-x86-kk ; ./build.sh )
2) Launch it from the tree using the run-emulator.sh script
3) Once Firefox OS has started quickly click on any application and keep clicking on buttons / input boxes / etc... The segfault will normally happen in a matter of seconds

I've reproduced the bug both on Fedora 22 and Gentoo so it doesn't look like distro-specific, these are the versions number taken from my Gentoo installation:

xf86-video-nouveau 1.0.11
libdrm 2.4.59
mesa 10.3.7
xorg-server 1.16.4
kernel 4.0.5

I've captured a stack trace of the segfault with gdb:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xc3dfeb40 (LWP 9387)]
0xf689a323 in pushbuf_kref () from /usr/lib32/libdrm_nouveau.so.2
(gdb) bt
#0 0xf689a323 in pushbuf_kref () from /usr/lib32/libdrm_nouveau.so.2
#1 0xf689ab9f in pushbuf_validate () from /usr/lib32/libdrm_nouveau.so.2
#2 0xf6ce47e8 in nv50_state_validate () from /usr/lib32/dri/nouveau_dri.so
#3 0xf6cf0a49 in nv50_draw_vbo () from /usr/lib32/dri/nouveau_dri.so
#4 0xf6b3846d in cso_draw_vbo () from /usr/lib32/dri/nouveau_dri.so
#5 0xf6a5f29e in st_draw_vbo () from /usr/lib32/dri/nouveau_dri.so
#6 0xf6a30cd3 in vbo_draw_arrays () from /usr/lib32/dri/nouveau_dri.so
#7 0xf6a30f37 in vbo_exec_DrawArrays () from /usr/lib32/dri/nouveau_dri.so
#8 0xf72ca52b in glDrawArrays (mode=4, first=0, count=6) at sdk/emulator/opengl/host/libs/Translator/GLES_V2/GLESv2Imp.cpp:576
#9 0xf74b9965 in gl2_decoder_context_t::decode (this=0xc3dfdfd4, buf=0xc47ff008, len=5452, stream=0xc6400768)
    at out/host/linux-x86/obj/STATIC_LIBRARIES/libGLESv2_dec_intermediates/gl2_dec.cpp:565
#10 0xf74b662c in RenderThread::Main (this=0xc6400788) at sdk/emulator/opengl/host/libs/libOpenglRender/RenderThread.cpp:128
#11 0xf74cdc3d in osUtils::Thread::thread_main (p_arg=0xc6400788) at sdk/emulator/opengl/shared/OpenglOsUtils/osThreadUnix.cpp:83
#12 0xf7f9711f in start_thread () from /lib32/libpthread.so.0
#13 0xf7d5f79e in clone () from /lib32/libc.so.6

I'm attaching the kernel log and the X log. Those may be "polluted" by other stuff as my machine has been running for some time since I've hit the bug. I'll try to provide cleaner ones right after I hit the bug. If more detailed information is needed (e.g. a backtrace with finer-grained debug information, etc...) I can provide it given some time to gather it.

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Firefox_OS/Using_the_B2G_emulators

Created attachment 118839
X log

Does this still happen with recent mesa (and libdrm)?

I can test it, which version of mesa and libdrm do you suggest? The portage tree seems to have all recent versions.

I've started testing with different versions of mesa and libdrm. On my first try I could still repro with mesa 10.4.6 and libdrm 2.4.59.

Still repros on mesa 10.4.6 and libdrm 2.4.65.

FYI, latest Mesa release is 11.0.3. 10.4 was branched out in December 2014 (though it did receive some additional fixes up to March 2015), so you might want to try at least 10.6.x, or even better, 11.0.x.

After some fiddling with the various dependency I'm now testing on mesa 11.0.3. I haven't hit the bug just yet but I want to run the emulator for a while to be sure it's not just luck on my part.

I can still reproduce on mesa 11.0.3 / libdrm 2.4.65 though it takes longer to trigger the bug (a few minutes of usage). This is the backtrace when using these versions of mesa and libdrm, it's somewhat different than the previous one but the bottom frames look the same:

#0 0xf67fc403 in pushbuf_kref () from /usr/lib32/libdrm_nouveau.so.2
#1 0xf67fcc7f in pushbuf_validate () from /usr/lib32/libdrm_nouveau.so.2
#2 0xf6c80f9b in nv50_flush () from /usr/lib32/dri/nouveau_dri.so
#3 0xf69c93a4 in st_flush () from /usr/lib32/dri/nouveau_dri.so
#4 0xf69c93f4 in st_glFlush () from /usr/lib32/dri/nouveau_dri.so
#5 0xf6879b76 in _mesa_flush () from /usr/lib32/dri/nouveau_dri.so
#6 0xf6879f19 in _mesa_make_current () from /usr/lib32/dri/nouveau_dri.so
#7 0xf69fab96 in st_api_make_current () from /usr/lib32/dri/nouveau_dri.so
#8 0xf6ac28f6 in dri_unbind_context () from /usr/lib32/dri/nouveau_dri.so
#9 0xf6ac2337 in driUnbindContext () from /usr/lib32/dri/nouveau_dri.so
#10 0xf73ff3b6 in dri2_unbind_context () from /usr/lib32/libGL.so.1
#11 0xf73d89fb in glXMakeCurrentReadSGI () from /usr/lib32/libGL.so.1
#12 0xf747a89c in EglOS::makeCurrent (dpy=0x8c13608, read=0x0, draw=0x0, ctx=0x0) at sdk/emulator/opengl/host/libs/Translator/EGL/EglX11Api.cpp:263
#13 0xf747eb11 in eglMakeCurrent (display=0x8c20e50, draw=0x0, read=0x0, context=0x0) at sdk/emulator/opengl/host/libs/Translator/EGL/EglImp.cpp:684
#14 0xf74b0787 in FrameBuffer::unbind_locked (this=0x8c20d68) at sdk/emulator/opengl/host/libs/libOpenglRender/FrameBuffer.cpp:790
#15 0xf74ad7f6 in ColorBuffer::create (p_width=320, p_height=266, p_internalFormat=6408) at sdk/emulator/opengl/host/libs/libOpenglRender/ColorBuffer.cpp:108
#16 0xf74b181f in FrameBuffer::createColorBuffer (this=0x8c20d68, p_width=320, p_height=266, p_internalFormat=6408)
    at sdk/emulator/opengl/host/libs/libOpenglRender/FrameBuffer.cpp:489
#17 0xf74b7c2d in rcCreateColorBuffer (width=320, height=266, internalFormat=6408) at sdk/emulator/opengl/host/libs/libOpenglRender/RenderControl.cpp:215
#18 0xf74b9754 in renderControl_decoder_context_t::decode (this=0xc6301d5c, buf=0xacee6008, len=20, stream=0xc6300990)
    at out/host/linux-x86/obj/STATIC_LIBRARIES/lib_renderControl_dec_intermediates/renderControl_dec.cpp:245
#19 0xf74b8655 in RenderThread::Main (this=0xc6301d30) at sdk/emulator/opengl/host/libs/libOpenglRender/RenderThread.cpp:138
#20 0xf74cfc3d in osUtils::Thread::thread_main (p_arg=0xc6301d30) at sdk/emulator/opengl/shared/OpenglOsUtils/osThreadUnix.cpp:83
#21 0xf7f9811f in start_thread () from /lib32/libpthread.so.0
#22 0xf7d6079e in clone () from /lib32/libc.so.6

I'm attaching a new dmesg/Xorg.0.log couple taken just after I hit the bug, they'll be probably easier to parse than the previous one. Especially the dmesg output has some nouveau-related error messages at the bottom.

Created attachment 118859
kernel log

Created attachment 118860
X log

[35345.800105] nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 6 [emulator-x86[1371]] get 0x00200f9878 put 0x00200f98d0 ib_get 0x000000ed ib_put 0x000000f6 state 0x80000000 (err: INVALID_CMD) push 0x00406040

This is a long-standing and utterly undiagnosed issue. Somehow an invalid command manages to make it onto the pushbuf. [A command as in a FIFO command, like "BEGIN_NV04" style, not like the actual data being fed to the engine.]

There has never been a reproducible way of triggering it. Only seems to happen on tesla GPUs. Either something *very* subtle is going on in mesa, or we're driving the hardware wrong, or... who knows.

Search around for 406040 -- tons of bugs open about this.

Thanks for the info. I can reproduce this bug fairly easily so I can help diagnosing it, would a trace of the OpenGL calls help? This often happens after only a few seconds so it's not going to be a huge one.

BTW I'm not running a GPU from the Tesla family, this is a GeForce 9600GT/NV94.

(In reply to Gabriele Svelto from comment #13)
> Thanks for the info. I can reproduce this bug fairly easily so I can help
> diagnosing it, would a trace of the OpenGL calls help? This often happens
> after only a few seconds so it's not going to be a huge one.
>
> BTW I'm not running a GPU from the Tesla family, this is a GeForce
> 9600GT/NV94.

NV94 is a tesla. Tesla is G80:MCP89 (aka nv50 family).

If you can make an apitrace that repros it for you consistently on replay that'd be incredibly useful. However I suspect there's more to it and a plain apitrace won't be able to repro the issue.

(In reply to Ilia Mirkin from comment #14)
> NV94 is a tesla. Tesla is G80:MCP89 (aka nv50 family).

Right, my bad, I've got confused with the actual Tesla card.

> If you can make an apitrace that repros it for you consistently on replay
> that'd be incredibly useful. However I suspect there's more to it and a
> plain apitrace won't be able to repro the issue.

I gave it a spin today and the trace I gathered doesn't crash when replaying as you suggested. The bottom of the trace contains an incomplete glFlush() call followed by a glDrawArrays() which is also incomplete. I'll try grabbing more traces in the hope of finding one that reproduces the crash.

Alternatively if you have the appropriate hardware to reproduce I can walk you through how to build this stuff and reproduce yourself (it's a procedure that might take a few hours though, the full Firefox OS emulator build is *large*).

In your kernel log I see

[30903.432610] nouveau E[steam[21937]] fail set_domain
[30903.432614] nouveau E[steam[21937]] validating bo list
[30903.432618] nouveau E[steam[21937]] validate: -22

I believe that error should be fixed by the kernel patch at the end of bug #92504. However I don't think that's your main issue. Still, worth giving it a shot.

Separately, some very subtle fence-related issues should be fixed in the soon-to-be-released Mesa 11.0.4 (or mesa's master branch). Again, don't think they're the root cause of your problems, but... wouldn't hurt to try.

Thanks for the tip. From the comments in that bug I should be able to test that patch in the 4.3+ kernels. I'm rolling my own build so I can try that out as soon as I have some spare time. I'll also try mesa 11.0.4 when it will be available in portage (which should be shortly after release).

I'd also like to do another experiment: traces gathered with apitrace always end up with the seg-faulting command and always work fine when repeating them - possibly because the last command needs to be fully executed to cause the problem. I'll try gathering a much longer trace using the software render and then try to replay that one with apitrace to see if I can trigger the bug. Alternatively this might be a race condition of some sort (I can see in the trace that there's at least 4 threads calling GL commands); possibly re-executing the trace a lot of times might trigger it sooner or later.

I'll report back if I find anything useful.

Quick update, I'm on mesa 11.0.4 now and the issue is still present but it manifests itself differently: when I hit the issue I can still see the error messages being dumped into the kernel log but the application isn't seg-faulting anymore. I'm not sure if this is good or bad; it does help with my use-case as things keep running afterwards but it might make this bug harder to track.

Disregards my previous comment, I've just hit another segfault.

Download full text (3.2 KiB)

I'm able to reproduce a crash with a very similar stack on Fedora 22, using the glium Rust OpenGL bindings: https://github.com/tomaka/glium

Here's the stack trace I get:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff02f66fc in pushbuf_kref () from /lib64/libdrm_nouveau.so.2
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.6-14.fc22.x86_64 elfutils-libelf-0.163-4.fc22.x86_64 elfutils-libs-0.163-4.fc22.x86_64 expat-2.1.0-10.fc22.x86_64 libattr-2.4.47-10.fc22.x86_64 libcap-2.24-7.fc22.x86_64 libdrm-2.4.61-3.fc22.x86_64 libedit-3.1-12.20150325cvs.fc22.x86_64 libffi-3.1-7.fc22.x86_64 libgcc-5.1.1-4.fc22.x86_64 libpciaccess-0.13.3-0.3.fc22.x86_64 libselinux-2.3-10.fc22.x86_64 libstdc++-5.1.1-4.fc22.x86_64 libwayland-client-1.7.0-1.fc22.x86_64 libwayland-server-1.7.0-1.fc22.x86_64 libX11-1.6.3-1.fc22.x86_64 libX11-devel-1.6.3-1.fc22.x86_64 libXau-1.0.8-4.fc22.x86_64 libxcb-1.11-8.fc22.x86_64 libXcursor-devel-1.1.14-4.fc22.x86_64 libXdamage-1.1.4-6.fc22.x86_64 libXext-1.3.3-2.fc22.x86_64 libXfixes-5.0.1-4.fc22.x86_64 libXi-devel-1.7.4-2.fc22.x86_64 libXrender-0.9.9-1.fc22.x86_64 libxshmfence-1.2-1.fc22.x86_64 libXxf86vm-devel-1.1.4-1.fc22.x86_64 llvm-libs-3.5.0-9.fc22.x86_64 mesa-dri-drivers-10.6.9-1.20151008.fc22.x86_64 mesa-libEGL-10.6.9-1.20151008.fc22.x86_64 mesa-libgbm-10.6.9-1.20151008.fc22.x86_64 mesa-libGL-10.6.9-1.20151008.fc22.x86_64 mesa-libglapi-10.6.9-1.20151008.fc22.x86_64 ncurses-libs-5.9-18.20150214.fc22.x86_64 pcre-8.37-5.fc22.x86_64 systemd-libs-219-25.fc22.x86_64 xz-libs-5.2.0-2.fc22.x86_64 zlib-1.2.8-7.fc22.x86_64
(gdb) where
#0 0x00007ffff02f66fc in pushbuf_kref () from /lib64/libdrm_nouveau.so.2
#1 0x00007ffff02f6db9 in pushbuf_validate () from /lib64/libdrm_nouveau.so.2
#2 0x00007ffff0c28ecd in nvc0_state_validate () from /usr/lib64/dri/nouveau_dri.so
#3 0x00007ffff0c33670 in nvc0_draw_vbo () from /usr/lib64/dri/nouveau_dri.so
#4 0x00007ffff08ec867 in st_draw_vbo () from /usr/lib64/dri/nouveau_dri.so
#5 0x00007ffff08be81c in vbo_draw_arrays () from /usr/lib64/dri/nouveau_dri.so
#6 0x00005555555b21b7 in tutorial_02::gl::Gl::DrawArrays (self=0x7ffff6868010, mode=4, first=0, count=3) at target/debug/build/glium-e8f4c0a69cec8d2d/out/gl_bindings.rs:11032
#7 0x00005555555acee9 in tutorial_02::ops::draw::draw<glium::uniforms::uniforms::EmptyUniforms,&glium::vertex::buffer::VertexBuffer<tutorial_02::main::Vertex>> (context=0x7ffff6868010, framebuffer=..., vertex_buffers=0x7fffffffd548, indices=..., program=0x7fffffffd038, uniforms=0x5555557fc297 <const16333>, draw_parameters=0x7fffffffccb0, dimensions=...) at src/ops/draw.rs:320
#8 0x00005555555aa751 in tutorial_02::Frame.Surface::draw<&glium::vertex::buffer::VertexBuffer<tutorial_02::main::Vertex>,&glium::index::NoIndices,glium::uniforms::uniforms::EmptyUniforms> (self=0x7fffffffce48, vertex_buffer=0x7fffffffd548, index_buffer=0x7fffffffd210, program=0x7fffffffd038, uniforms=0x5555557fc297 <const16333>, draw_parameters=0x7fffffffccb0) at src/lib.rs:1083
#9 0x0000555555576f84 in tutorial_02::main () at examples/tutorial-02.rs:48
#10 0x00005555557ccf85 in sys_common::unwind::try::try_fn::h13449604025847140769 ()
#11 0x00005555557ca969 ...

Read more...

Download full text (3.4 KiB)

I have no idea what this code is supposed to be doing, but here's what I can infer:

The call to pushbuf_kref crashes because the `struct nouveau_bo *bo` argument is NULL, and cli_push_get tries to use it. The caller of pushbuf_kref, pushbuf_validate, is iterating over a list of brefs; the current bref's bo field is NULL.

This bref was created by nouveau_bufctx_refn, which was passed a NULL `bo` argument. Its caller is nvc0_add_resident, which was passed a `struct nv04_resource *` whose `bo` field is NULL.

This nv04_resource was created by a call to nouveau_buffer_create in which buffer->domain is never set to anything other than 0. Looking at nouveau_buffer_allocate, it seems like a domain of zero is a legitimate value; the last branch of the if-else chain asserts that this is the case. Since that path doesn't set buf->bo, it seems it's legitimate for buf->bo to be NULL.

pushbuf_kref seems adamant that bo should be non-NULL; both cli_push_get and cli_kref_get require it. At this point I'm lost: should nouveau_bufctx_refn never be passed a NULL bo? Should such a bufref never make it onto the list that pushbuf_validate sees? I'm not sure.

Here's the stack trace at the call to nouveau_bufctx_refn:

#0 nouveau_bufctx_refn (bctx=0x555555b3eea0, bin=bin@entry=1, bo=0x0, flags=256) at bufctx.c:126
#1 0x00007ffff0c33154 in nvc0_add_resident (flags=256, res=0x555555bf4800, bin=1, bufctx=<optimized out>) at nvc0/nvc0_winsys.h:29
#2 nvc0_validate_vertex_buffers_shared (nvc0=0x555555b3cf30) at nvc0/nvc0_vbo.c:407
#3 nvc0_vertex_arrays_validate (nvc0=0x555555b3cf30) at nvc0/nvc0_vbo.c:504
#4 0x00007ffff0c28e8e in nvc0_state_validate (nvc0=nvc0@entry=0x555555b3cf30, mask=mask@entry=4294967295, words=words@entry=8) at nvc0/nvc0_state_validate.c:651
#5 0x00007ffff0c33670 in nvc0_draw_vbo (pipe=0x555555b3cf30, info=0x7fffffffb9f0) at nvc0/nvc0_vbo.c:893
#6 0x00007ffff08ec867 in st_draw_vbo (ctx=<optimized out>, prims=0x7fffffffbab0, nr_prims=1, ib=0x0, index_bounds_valid=<optimized out>, min_index=0, max_index=2, tfb_vertcount=0x0, indirect=0x0) at state_tracker/st_draw.c:286
#7 0x00007ffff08be81c in vbo_draw_arrays (ctx=0x7ffff7f22010, mode=4, start=0, count=3, numInstances=1, baseInstance=0) at vbo/vbo_exec_array.c:645
#8 0x00005555555b21b7 in tutorial_02::gl::Gl::DrawArrays (self=0x7ffff6868010, mode=4, first=0, count=3) at target/debug/build/glium-e8f4c0a69cec8d2d/out/gl_bindings.rs:11032
#9 0x00005555555acee9 in tutorial_02::ops::draw::draw<glium::uniforms::uniforms::EmptyUniforms,&glium::vertex::buffer::VertexBuffer<tutorial_02::main::Vertex>> (context=0x7ffff6868010, framebuffer=..., vertex_buffers=0x7fffffffdc08, indices=..., program=0x7fffffffd6f8, uniforms=0x5555557fc297 <const16333>, draw_parameters=0x7fffffffd370, dimensions=...) at src/ops/draw.rs:320
#10 0x00005555555aa751 in tutorial_02::Frame.Surface::draw<&glium::vertex::buffer::VertexBuffer<tutorial_02::main::Vertex>,&glium::index::NoIndices,glium::uniforms::uniforms::EmptyUniforms> (self=0x7fffffffd508, vertex_buffer=0x7fffffffdc08, index_buffer=0x7fffffffd8d0, program=0x7fffffffd6f8, uniforms=0x5555557fc297 <const16333>, draw_parameters=0x7fffffffd370) at src/lib.rs:1083
#11...

Read more...

Steps to reproduce:

1) Install Rust from www.rust-lang.org, which provides the 'cargo' command.
2) git clone <email address hidden>:tomaka/glium.git
3) cd glium
4) cargo build
5) cargo build --example tutorial-02
6) ./target/debug/examples/tutorial-02

(In reply to Jim Blandy from comment #21)
> I have no idea what this code is supposed to be doing, but here's what I can
> infer:
>
> The call to pushbuf_kref crashes because the `struct nouveau_bo *bo`
> argument is NULL, and cli_push_get tries to use it. The caller of
> pushbuf_kref, pushbuf_validate, is iterating over a list of brefs; the
> current bref's bo field is NULL.
>
> This bref was created by nouveau_bufctx_refn, which was passed a NULL `bo`
> argument. Its caller is nvc0_add_resident, which was passed a `struct
> nv04_resource *` whose `bo` field is NULL.
>
> This nv04_resource was created by a call to nouveau_buffer_create in which
> buffer->domain is never set to anything other than 0. Looking at
> nouveau_buffer_allocate, it seems like a domain of zero is a legitimate
> value; the last branch of the if-else chain asserts that this is the case.
> Since that path doesn't set buf->bo, it seems it's legitimate for buf->bo to
> be NULL.
>
> pushbuf_kref seems adamant that bo should be non-NULL; both cli_push_get and
> cli_kref_get require it. At this point I'm lost: should nouveau_bufctx_refn
> never be passed a NULL bo? Should such a bufref never make it onto the list
> that pushbuf_validate sees? I'm not sure.
>
> Here's the stack trace at the call to nouveau_bufctx_refn:
>
> #0 nouveau_bufctx_refn (bctx=0x555555b3eea0, bin=bin@entry=1, bo=0x0,
> flags=256) at bufctx.c:126
> #1 0x00007ffff0c33154 in nvc0_add_resident (flags=256, res=0x555555bf4800,
> bin=1, bufctx=<optimized out>) at nvc0/nvc0_winsys.h:29
> #2 nvc0_validate_vertex_buffers_shared (nvc0=0x555555b3cf30) at
> nvc0/nvc0_vbo.c:407

Whoa, great analysis! And makes a *ton* more sense than my thought, which was that the GPU hung and we ran out of GEM handles making pushbufs.

So this is one of those idiotic bo-less resources. Ugh. Will check if your repro makes it happen for me too.

Hmmm... looks like rust isn't quite ready for prime-time -- there is no 'cargo' gentoo ebuild. (Apparently cargo can't be built hermetically?) So I can't repro.

I'm a little worried about how that buffer object came to be. It suggests we're missing some PIPE_BIND_* flags somewhere, or something is creating a buffer without any bind flags set. Can you look at the resource's "bind" property and see what it is?

I believe this is now possible with ARB_direct_state_access. While I add support for this and add some asserts so that this doesn't happen in the future, would either of you mind retesting with

MESA_EXTENSION_OVERRIDE=-GL_ARB_direct_state_access

which should prevent those endpoints from appearing? I somewhat doubt that this is the b2g issue though, that seems to be using GLES2 which doesn't have anything like this.

(In reply to Ilia Mirkin from comment #24)
> Hmmm... looks like rust isn't quite ready for prime-time -- there is no
> 'cargo' gentoo ebuild. (Apparently cargo can't be built hermetically?) So I
> can't repro.

You really should try the tarballs on rust-lang.org. It's not as nice as a package, but it's straightforward enough for a one-shot activity like this.

(In reply to Ilia Mirkin from comment #25)
> I believe this is now possible with ARB_direct_state_access. While I add
> support for this and add some asserts so that this doesn't happen in the
> future, would either of you mind retesting with
>
> MESA_EXTENSION_OVERRIDE=-GL_ARB_direct_state_access
>
> which should prevent those endpoints from appearing? I somewhat doubt that
> this is the b2g issue though, that seems to be using GLES2 which doesn't
> have anything like this.

Thanks for looking into this! I still get segfaults. The entry in extension_table for GL_ARB_direct_state_access is:

   { "GL_ARB_draw_buffers", o(dummy_true), GL, 2002 },

so I don't think that env var setting will do anything, just stick the name in cant_disable_extensions. I infer from the error message in get_extension_override that it's a permanently enabled extension.

(OT: I think in many ways Rust is ready for prime time, but the project has only begun to try to take into account distributions' needs. Check this out: https://internals.rust-lang.org/t/perfecting-rust-packaging/2623)

Created attachment 119323
deal with bind == 0

Oh blast, you're right. Can't disable GL_ARB_direct_state_access, since it's o(dummy_true). Try this mesa patch, which should fix it (but might break things on nv30... need to figure that out).

(In reply to Jim Blandy from comment #26)
> Thanks for looking into this! I still get segfaults. The entry in
> extension_table for GL_ARB_direct_state_access is:
>
> { "GL_ARB_draw_buffers", o(dummy_true), GL, 2002 },

Err, I meant, it's the line above that:

   { "GL_ARB_direct_state_access", o(dummy_true), GLC, 2014 },

(In reply to Ilia Mirkin from comment #27)
> Created attachment 119323 [details]
> deal with bind == 0
>
> Oh blast, you're right. Can't disable GL_ARB_direct_state_access, since it's
> o(dummy_true). Try this mesa patch, which should fix it (but might break
> things on nv30... need to figure that out).

Okay, cool. Building Mesa from source isn't something I can dive into right now, but I'm hoping I'll have more time for this project in the near future. (I tried just whacking the domain value into *buffer with GDB and seeing if the code could proceed, but optimized code foiled me.)

(In reply to Jim Blandy from comment #29)
> (In reply to Ilia Mirkin from comment #27)
> > Created attachment 119323 [details]
> > deal with bind == 0
> >
> > Oh blast, you're right. Can't disable GL_ARB_direct_state_access, since it's
> > o(dummy_true). Try this mesa patch, which should fix it (but might break
> > things on nv30... need to figure that out).
>
> Okay, cool. Building Mesa from source isn't something I can dive into right
> now, but I'm hoping I'll have more time for this project in the near future.
> (I tried just whacking the domain value into *buffer with GDB and seeing if
> the code could proceed, but optimized code foiled me.)

You can change your application to pretend that GL_ARB_direct_state_access isn't enabled, which should prevent it from using glCreateBuffers() and the glNamedBuffer* calls.

Should be easy enough to modify src/context/extensions.rs by e.g. introducing a typo into the GL_ARB_direct_state_access string. Or doing a fixup after-the-fact.

[As an aside, in src/buffer/alloc.rs, you appear to have at least one instance that only checks for GL4.5 and not the ext as well... oops? I'd personally recommend never checking for explicit GL versions and only looking at the exts. You can also set the ext bools based on the GL versions.

    if ctxt.version >= &Version(Api::Gl, 4, 5) {
        ctxt.gl.NamedBufferSubData(self.id, offset_bytes as ...

]

I've pushed out a couple of changes which might improve the situation. Please give your software a shot and see if it stops crashing. This should definitely resolve the glium issues... Gabriele, would ideally hear from you if they fix anything on your end.

Specifically these two commits:

http://cgit.freedesktop.org/mesa/mesa/commit/?id=079f713754a9e5d7802b655d54320bd37f24fbfa
http://cgit.freedesktop.org/mesa/mesa/commit/?id=ad5f6b03e793b9390e3b9f3eca68bd43f9d809eb

I'll try rebuilding mesa using the mesa-9999 packages which should get me the current master and will report back ASAP.

Download full text (35.5 KiB)

OK, I've just tried to reproduce the bug with the latest master build and while I haven't managed to crash the emulator yet I've found this output. This was in the kernel log:

[ 1958.357837] nouveau E[emulator-x86[5663]] multiple instances of buffer 5 on validation list
[ 1958.357841] nouveau E[emulator-x86[5663]] validate_init
[ 1958.357843] nouveau E[emulator-x86[5663]] validate: -22

This is from the emulator output itself:

nouveau: kernel rejected pushbuf: Invalid argument
nouveau: ch0: krec 0 pushes 1 bufs 12 relocs 0
nouveau: ch0: buf 00000000 00000005 00000004 00000004 00000000
nouveau: ch0: buf 00000001 00000005 00000004 00000004 00000000
nouveau: ch0: buf 00000002 00000019 00000002 00000000 00000002
nouveau: ch0: buf 00000003 00000048 00000002 00000002 00000000
nouveau: ch0: buf 00000004 00000104 00000002 00000002 00000000
nouveau: ch0: buf 00000005 00000107 00000002 00000002 00000000
nouveau: ch0: buf 00000006 0000010b 00000002 00000002 00000000
nouveau: ch0: buf 00000007 00000065 00000002 00000002 00000000
nouveau: ch0: buf 00000008 000000f8 00000002 00000002 00000000
nouveau: ch0: buf 00000009 00000108 00000002 00000002 00000000
nouveau: ch0: buf 0000000a 0000002f 00000002 00000002 00000000
nouveau: ch0: buf 0000000b 00000032 00000002 00000002 00000000
nouveau: ch0: psh 00000001 0000071e24 0000073838
nouveau: 0x00046f00
nouveau: 0x0000007c
nouveau: 0x41506f04
nouveau: 0x3bcccccd
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x80000000
nouveau: 0xbb888889
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0xbf800000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x43800000
nouveau: 0x43800000
nouveau: 0x42800000
nouveau: 0x43800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x42a00000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x3e800000
nouveau: 0x3f800000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00000000
nouveau: 0x00046f00
nouveau: 0x0000...

Download full text (3.9 KiB)

I've been running the emulator for a while and it hasn't crashed yet but I've found yet more warning/error messages in my kernel log and they seem related to it:

[ 3053.470749] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470769] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470789] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 1 MP 0: INVALID_OPCODE at 07fb80 warp 0, opcode 00000000 00000000
[ 3053.470805] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 1 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470830] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 2 MP 0: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470846] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 2 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470850] nouveau E[ PGRAPH][0000:01:00.0] ch 5 [0x001f768000 emulator-x86[5663]] subc 3 class 0x8297 mthd 0x0f04 data 0x3f800000
[ 3053.470903] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470920] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470942] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 1 MP 0: INVALID_OPCODE at 07fb80 warp 0, opcode 00000000 00000000
[ 3053.470958] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 1 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.470980] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 2 MP 0: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471000] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 2 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471005] nouveau E[ PGRAPH][0000:01:00.0] ch 5 [0x001f768000 emulator-x86[5663]] subc 3 class 0x8297 mthd 0x0f04 data 0x00000000
[ 3053.471060] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471077] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471098] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 1 MP 0: INVALID_OPCODE at 07fb80 warp 0, opcode 00000000 00000000
[ 3053.471115] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 1 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471136] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 2 MP 0: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471153] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 2 MP 1: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471158] nouveau E[ PGRAPH][0000:01:00.0] ch 5 [0x001f768000 emulator-x86[5663]] subc 3 class 0x8297 mthd 0x0f04 data 0x00000000
[ 3053.471213] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 07ff00 warp 1, opcode 00000000 00000000
[ 3053.471231] nouveau E[ PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07ff00 wa...

Read more...

I've updated mesa again and running from the current master I encounter the segfault again. This time though this is the only message that pops up in the kernel log:

[ 4684.326211] emulator-x86[19034]: segfault at 4 ip 00000000f5f49403 sp 00000000ac6fc660 error 4 in libdrm_nouveau.so.2.0.0[f5f46000+7000]

I'm wondering if running a master mesa with a non-master libdrm is supported at all.

(In reply to Gabriele Svelto from comment #35)
> I've updated mesa again and running from the current master I encounter the
> segfault again. This time though this is the only message that pops up in
> the kernel log:
>
> [ 4684.326211] emulator-x86[19034]: segfault at 4 ip 00000000f5f49403 sp
> 00000000ac6fc660 error 4 in libdrm_nouveau.so.2.0.0[f5f46000+7000]
>
> I'm wondering if running a master mesa with a non-master libdrm is supported
> at all.

If it build again given libdrm{,-nouveau} version it should work. Bugs on the other hand are something that occasionally exist.

Which version (sha) is the above libdrm-nouveau ? Can you disassemble at the given offset ?

(In reply to Gabriele Svelto from comment #35)
> I've updated mesa again and running from the current master I encounter the
> segfault again. This time though this is the only message that pops up in
> the kernel log:
>
> [ 4684.326211] emulator-x86[19034]: segfault at 4 ip 00000000f5f49403 sp
> 00000000ac6fc660 error 4 in libdrm_nouveau.so.2.0.0[f5f46000+7000]
>
> I'm wondering if running a master mesa with a non-master libdrm is supported
> at all.

Most likely the emulator is doing GL stuff from different threads, which doesn't work with nouveau. Sorry :(

Yes, the emulator seems to be calling GL commands from multiple threads, or at least the traces I got seemed to have different thread identifiers for different commands.

(In reply to Gabriele Svelto from comment #38)
> Yes, the emulator seems to be calling GL commands from multiple threads, or
> at least the traces I got seemed to have different thread identifiers for
> different commands.

Unfortunately nouveau doesn't protect against concurrency and when multiple threads call into the GL simultaneously, will fall flat on its face.

Fixing this is on my todo list... but my todo list is long, and this is a tricky task to do so as not to pessimize the 99.99% use-case of single-threaded GL usage.

StacktraceTop:
 pushbuf_kref () from /tmp/apport_sandbox_ZyMo9z/usr/lib/x86_64-linux-gnu/libdrm_nouveau.so.2
 pushbuf_validate () from /tmp/apport_sandbox_ZyMo9z/usr/lib/x86_64-linux-gnu/libdrm_nouveau.so.2
 nv50_flush () from /tmp/apport_sandbox_ZyMo9z/usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
 st_glFlush () from /tmp/apport_sandbox_ZyMo9z/usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
 dri2_make_current () from /tmp/apport_sandbox_ZyMo9z/usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1

Changed in unity8 (Ubuntu):
importance: Undecided → Medium
tags: removed: need-amd64-retrace

Given the stack trace seems more a crash in the nouveau driver.

Changed in unity8 (Ubuntu):
status: New → Incomplete
information type: Private → Public
summary: - unity8 crashed with SIGSEGV
+ unity8 crashed with SIGSEGV on nouveau, in eglMakeCurrent() ...
+ nv50_flush() ... pushbuf_kref()
Changed in mir (Ubuntu):
status: New → Invalid
Changed in mir:
status: New → Invalid
affects: unity8 (Ubuntu) → qtmir (Ubuntu)
David Planella (dpm) on 2016-04-08
tags: added: unity8-desktop

We should sanity-check the correctness of the eglMakeCurrent call in Mir src/server/graphics/nested/display_buffer.cpp:62

But other than that, it seems to be purely a nouveau bug.

Daniel van Vugt (vanvugt) wrote :

Done. There doesn't appear to be any way that Mir itself can screw up that call.

It's possible qtmir might be messing up with an incorrect 'this' pointer to the Mir DisplayBuffer. But I think much more likely that it's a nouveau driver bug. If it wasn't then we'd be seeing similar crashes with intel and radeon, but we don't.

Changed in libdrm (Ubuntu):
importance: Undecided → High
Daniel van Vugt (vanvugt) wrote :

Needs retesting. The symptoms might have changed: bug 1620934

Confirmed by duplicate/similar bug 1623507

summary: - unity8 crashed with SIGSEGV on nouveau, in eglMakeCurrent() ...
- nv50_flush() ... pushbuf_kref()
+ Mir crashes on nouveau (nv50) in pushbuf_kref()
Changed in qtmir (Ubuntu):
status: Incomplete → Invalid
Changed in mesa (Ubuntu):
importance: Undecided → High
Changed in libdrm (Ubuntu):
status: New → Confirmed
Changed in mesa (Ubuntu):
status: New → Confirmed
Changed in canonical-devices-system-image:
status: New → Confirmed
importance: Undecided → High
tags: added: nouveau
Daniel van Vugt (vanvugt) wrote :

Upstream says nouveau is known to not support multi-threaded rendering and will crash:
   https://bugs.freedesktop.org/show_bug.cgi?id=92438
So that completely explains yesterday's crash in bug 1623507.

Hopefully that vaguely explains this one too.

summary: - Mir crashes on nouveau (nv50) in pushbuf_kref()
+ Mir crashes on nouveau (nv50) in pushbuf_kref() especially with multiple
+ monitors
Changed in canonical-devices-system-image:
importance: High → Critical
Changed in libdrm (Ubuntu):
importance: High → Critical
Changed in mesa (Ubuntu):
importance: High → Critical
Changed in mesa (Ubuntu):
status: Confirmed → Invalid
Changed in mir (Ubuntu):
importance: Undecided → Critical
Changed in qtmir (Ubuntu):
importance: Medium → Critical
Changed in libdrm (Ubuntu):
status: Confirmed → Triaged

I guess even with a single display, multi-threaded apps like the web browser could still trigger this nouveau crash or similar by rendering from multiple threads.

There's a comment here suggesting the same crash might also happen in some processes like webbrowser-app even under Unity7:
  https://bugs.launchpad.net/ubuntu/+source/unity8-desktop-session/+bug/1595238/comments/15
which further supports the theory.

Based on upstream's comment https://bugs.freedesktop.org/show_bug.cgi?id=92438#c39 I would expect nouveau to continue to crash under both multi-monitor and other multi-threaded rendering conditions (like webbrowser-app?).

summary: - Mir crashes on nouveau (nv50) in pushbuf_kref() especially with multiple
- monitors
+ Mir/Unity8 crashes on nouveau (nv50) in pushbuf_kref() especially with
+ multiple monitors
summary: Mir/Unity8 crashes on nouveau (nv50) in pushbuf_kref() especially with
- multiple monitors
+ multiple monitors or opening the web browser app
summary: Mir/Unity8 crashes on nouveau (nv50) in pushbuf_kref() especially with
- multiple monitors or opening the web browser app
+ multiple monitors, webbrowser-app or system settings
summary: - Mir/Unity8 crashes on nouveau (nv50) in pushbuf_kref() especially with
- multiple monitors, webbrowser-app or system settings
+ Mir/Unity8 crashes/freezes on nouveau (nv50) in pushbuf_kref()
+ especially with multiple monitors, webbrowser-app or system settings

Looks like this is related to broken multi-threading in nouveau

Download full text (6.8 KiB)

I am getting the same with libdrm-2.4.74, mesa=13.0.2, crashes when the html5test web page is opened in the Konqueror browser with QTWebengine-5.7.1:
libdrm-2.4.74/nouveau/pushbuf.c:727: nouveau_pushbuf_data: Assertion „kref“failed

[New Thread 0x7fff655dc700 (LWP 21049)]
konqueror: /var/tmp/portage/x11-libs/libdrm-2.4.74/work/libdrm-2.4.74/nouveau/pushbuf.c:727: nouveau_pushbuf_data: Předpoklad „kref“ nesplněn.

Thread 1 "konqueror" received signal SIGABRT, Aborted.
0x00007ffff7767228 in raise () from /lib64/libc.so.6
(gdb) where
#0 0x00007ffff7767228 in raise () from /lib64/libc.so.6
#1 0x00007ffff77686aa in abort () from /lib64/libc.so.6
#2 0x00007ffff7760167 in ?? () from /lib64/libc.so.6
#3 0x00007ffff7760212 in __assert_fail () from /lib64/libc.so.6
#4 0x00007fffe0619e54 in nouveau_pushbuf_data (push=push@entry=0x6d0d80, bo=0x6a58b0, offset=331932, length=496)
    at /var/tmp/portage/x11-libs/libdrm-2.4.74/work/libdrm-2.4.74/nouveau/pushbuf.c:727
#5 0x00007fffe0619d9b in nouveau_pushbuf_data (push=push@entry=0x6d0d80, bo=bo@entry=0x0, offset=offset@entry=0, length=length@entry=0)
    at /var/tmp/portage/x11-libs/libdrm-2.4.74/work/libdrm-2.4.74/nouveau/pushbuf.c:719
#6 0x00007fffe0619ee9 in pushbuf_submit (push=push@entry=0x6d0d80, chan=<optimized out>, chan=<optimized out>)
    at /var/tmp/portage/x11-libs/libdrm-2.4.74/work/libdrm-2.4.74/nouveau/pushbuf.c:330
#7 0x00007fffe061a18f in pushbuf_flush (push=push@entry=0x6d0d80) at /var/tmp/portage/x11-libs/libdrm-2.4.74/work/libdrm-2.4.74/nouveau/pushbuf.c:405
#8 0x00007fffe061ad50 in nouveau_pushbuf_kick (push=0x6d0d80, chan=<optimized out>) at /var/tmp/portage/x11-libs/libdrm-2.4.74/work/libdrm-2.4.74/nouveau/pushbuf.c:779
#9 0x00007fffe0f54d06 in PUSH_KICK (push=<optimized out>) at /var/tmp/portage/media-libs/mesa-13.0.2/work/mesa-13.0.2/src/gallium/drivers/nouveau/nouveau_winsys.h:59
#10 nv50_flush (pipe=0x2733370, fence=<optimized out>, flags=<optimized out>)
    at /var/tmp/portage/media-libs/mesa-13.0.2/work/mesa-13.0.2/src/gallium/drivers/nouveau/nv50/nv50_context.c:40
#11 0x00007fffe0c2bc1b in st_finish (st=st@entry=0x276c570) at /var/tmp/portage/media-libs/mesa-13.0.2/work/mesa-13.0.2/src/mesa/state_tracker/st_cb_flush.c:98
#12 0x00007fffe0c2bc80 in st_glFinish (ctx=<optimized out>) at /var/tmp/portage/media-libs/mesa-13.0.2/work/mesa-13.0.2/src/mesa/state_tracker/st_cb_flush.c:136
#13 0x00007ffff39916a9 in QOpenGLWidgetPrivate::beginCompose (this=0x2667c50) at kernel/qopenglwidget.cpp:727
#14 0x00007ffff395f629 in QWidgetPrivate::sendComposeStatus (w=<optimized out>, end=end@entry=false) at kernel/qwidget.cpp:12227
#15 0x00007ffff395f5e7 in QWidgetPrivate::sendComposeStatus (w=<optimized out>, end=end@entry=false) at kernel/qwidget.cpp:12233
#16 0x00007ffff395f5e7 in QWidgetPrivate::sendComposeStatus (w=<optimized out>, end=end@entry=false) at kernel/qwidget.cpp:12233
#17 0x00007ffff395f5e7 in QWidgetPrivate::sendComposeStatus (w=<optimized out>, end=end@entry=false) at kernel/qwidget.cpp:12233
#18 0x00007ffff395f5e7 in QWidgetPrivate::sendComposeStatus (w=<optimized out>, end=end@entry=false) at kernel/qwidget.cpp:12233
#19 0x00007ffff395f5e7 in QWidgetPrivate::sendC...

Read more...

kevin gunn (kgunn72) on 2017-03-06
Changed in canonical-devices-system-image:
assignee: nobody → Stephen M. Webb (bregma)
milestone: none → u8c-1

I'm not sure this is useful to assign to that milestone. The bug and the fix are in the nouveau driver.

Although each GL client could work around it by moving all their GL rendering to a single thread, we really shouldn't have to do that just to support nouveau.

Daniel van Vugt (vanvugt) wrote :

See above comment

Changed in canonical-devices-system-image:
status: Confirmed → Incomplete
Gerry Boland (gerboland) wrote :

I have Qt configured (in QtUbuntu/QtMir) to use multi-threaded GL rendering, so we're probably hitting Nouveau's limitations here.

As workaround, I can add code to QtUbuntu/QtMir use single-threaded GL for Nouveau.

If this is easily reproduced, can you try

initctl set-env --global QSG_RENDER_LOOP=basic

and see if everything is more stable? If so, the workaround should do the trick

Changed in qtmir (Ubuntu):
status: Invalid → Incomplete
Gerry Boland (gerboland) wrote :

Actually, might be easier/more-reliable just to edit /etc/environment and add "QSG_RENDER_LOOP=basic" and restart.

Daniel van Vugt (vanvugt) wrote :

Regardless of whether you have a single GPU or multiple GPUs, using OpenGL is a matter of "send it commands quickly and then forget (the GPU will complete them later)". So I'm not sure why QSG would bother with multi-threading GL at all. Similarly Mir doesn't really need to do the multi-threaded compositing it does, but most days it doesn't hurt us.

summary: - Mir/Unity8 crashes/freezes on nouveau (nv50) in pushbuf_kref()
+ Mir/Unity8/USC crashes/freezes on nouveau (nv50) in pushbuf_kref()
especially with multiple monitors, webbrowser-app or system settings
summary: - Mir/Unity8/USC crashes/freezes on nouveau (nv50) in pushbuf_kref()
- especially with multiple monitors, webbrowser-app or system settings
+ nouveau (nv50) crashes/freezes in pushbuf_kref()
summary: - nouveau (nv50) crashes/freezes in pushbuf_kref()
+ Mir/Unity8/USC crashes/freezes on nouveau (nv50) in pushbuf_kref()
+ especially with multiple monitors, webbrowser-app or system settings
Daniel van Vugt (vanvugt) wrote :

Per comment #17 it might actually be easiest to build a workaround in Mir for now... Either disable secondary compositor threads, or build a SingleThreadedCompositor to replace MultiThreadedCompositor.

In my spare time I'm working on an idea where post() could be made non-blocking. So that would eliminate the need for Mir's compositor to be threaded, and nouveau should then work (as well as it ever did for X).

Changed in mir:
status: Invalid → Confirmed
importance: Undecided → Medium
tags: added: multimonitor
Changed in qtmir (Ubuntu):
importance: Critical → Medium
status: Incomplete → Confirmed
Daniel van Vugt (vanvugt) wrote :

Even easier workaround for Mir/USC: Just default to clone mode instead of side-by-side, which is the problem in today's duplicate bug 1672793. Mir's legacy clone mode will at least ensure there is only one compositor thread, so nouveau should be stable then.

Changed in unity-system-compositor:
importance: Undecided → Medium
status: New → Confirmed
Changed in canonical-devices-system-image:
status: Incomplete → Triaged
Changed in canonical-devices-system-image:
milestone: u8c-1 → u8c-2
Daniel van Vugt (vanvugt) wrote :

I made an attempt at a workaround for nouveau crashes today (and discovered more nouveau bugs).

I can confirm with mir-demos that forcing the compositor into single-threaded mode makes it stable. The only problem is the unity-system-compositor option for doing this gets ignored (Unity8 overrides the display config to suit itself when it sees a second display). So you can't apply the workaround yourself.

So yes, medium term we could work around some of the nouveau stability issues by hacking Mir/USC/Unity8 to only use single threaded rendering. But that requires code changes in multiple places.

I suggest a short-term workaround that should do the trick is:
  1. Unplug all but one monitor; and
  2. Add to /etc/environment: QSG_RENDER_LOOP=basic

Sadly I can't even test that much myself, because of bug 1677125.

Gerry Boland (gerboland) wrote :

I've an ancient NVidia box at home, can try it out. I'm attaching patches for qtubuntu/qtmir to force Qt to use single threaded GL on nouveau.

Changed in qtubuntu (Ubuntu):
status: New → In Progress
assignee: nobody → Gerry Boland (gerboland)
Changed in qtmir (Ubuntu):
assignee: nobody → Gerry Boland (gerboland)
status: Confirmed → In Progress
Changed in qtubuntu (Ubuntu):
importance: Undecided → High
Changed in qtmir (Ubuntu):
importance: Medium → High
Changed in mir (Ubuntu):
importance: Critical → High
status: Invalid → Triaged
Changed in unity-system-compositor:
importance: Medium → High
status: Confirmed → Triaged
Changed in mir:
importance: Medium → High
status: Confirmed → Triaged

The nouveau isn't threadsafe and the Qt renderer uses threaded GL by default.

We don't know if Mir's default renderer would also exhibit the same issue when running multimonitor. (The Mir renderer uses a thread per monitor, but this is a simpler usage pattern than Qt and may not manifest problems.)

Changed in mir:
status: Triaged → Incomplete
importance: High → Undecided
Changed in mir (Ubuntu):
status: Triaged → Incomplete
importance: High → Undecided
Changed in canonical-devices-system-image:
assignee: Stephen M. Webb (bregma) → nobody
Daniel van Vugt (vanvugt) wrote :

Alan, that question was already answered in comment #20. Single threaded rendering works.

Changed in mir:
status: Incomplete → Triaged
importance: Undecided → High
Changed in mir (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → High

Daniel, you are saying comment #20 states that Mir's thread-per-monitor rendering manifests the problem? (I don't have any nouveau based kit to test on.)

Daniel van Vugt (vanvugt) wrote :

Well, yes. The problem description before that covers the fact that Mir's multi-threaded compositor manifests the problem (assuming you have multiple monitors). And comment #20 states that avoiding multiple threads (a single headed or cloned config) avoids the instability.

I don't think the problem description is clear that both the Mir and Unity8 compositors are affected. (It can be read as applying to the stack including Unity8.) Thanks for clarifying.

Daniel van Vugt (vanvugt) wrote :

Actually, the snapshotting/screencasting thread could also trigger it. If that's still a thing...?

Changed in mesa (Ubuntu):
status: Invalid → Triaged
importance: Critical → High
Changed in libdrm (Ubuntu):
importance: Critical → High
Changed in nouveau:
importance: Unknown → Medium
status: Unknown → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.