Mir

[regression] Mir EGL on intel graphics kills nested server

Bug #1699484 reported by Alan Griffiths
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Fix Released
Medium
Alan Griffiths

Bug Description

Simpler scenario to reproduce:

1. Take a Zesty system with Intel graphics and Mir trunk

2. Apply lp:~alan-griffiths/mir/dont-handle-nested-input-on-rpc-thread
   (To avoid the issue seen in comment #1)

3. Start a host server running mesa-kms (CNR using mesa-x11)
   $ sudo bin/mir_demo_server --vt 4 --arw-file --window-manager system-compositor

4. Start a nested server with an EGL client
   $ bin/mir_demo_server --no-file --host /tmp/mir_socket --launch bin/mir_demo_client_egltriangle

5. Resize the window (Alt-Middle drag) until the display freezes (doesn't take long)

Expect: the display doesn't freeze
Actual: the nested server (step 3) has exited with "intel_do_flush_locked failed: No such file or directory" (but the client is still running).

=== original description ===

This is on zesty, with Mir 0.27, MirAL 1.4 and rebuilds of QtMir and USC (in silo 2818 at the time of report)

https://code.launchpad.net/~mir-team/mir/0.27/+merge/325738
https://code.launchpad.net/~alan-griffiths/miral/1.4/+merge/325737
https://code.launchpad.net/~alan-griffiths/qtmir/mir-0.27-compat/+merge/325749
https://code.launchpad.net/~alan-griffiths/unity-system-compositor/mir-0.27-compat/+merge/325751

with a "live" USB. Steps to reproduce:

1. boot off the USB
2. sudo apt-add-repository ppa:ci-train-ppa-service/2818
3. sudo apt update
4. sudo apt upgrade -y
5. sudo service lightdm restart
6. Log out
7. select U8 session
8. sign in as ubuntu/<blank>
8. Search "ter" & down arrow to select Terminal
Lockup

The lockup happens in Unity8's renderer thread, once it calls
eglSwapBuffers after adding the client surface to the scene.
eglSwapBuffers prints a familiar error:

intel_do_flush_locked failed: No such file or directory

and calls __GI_exit() on the renderer thread, which causes Qt's threads
to block is a awkward way.

So something has changed in Mir 0.27 that Mesa/i915 is unhappy with.

Related branches

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :
Download full text (4.5 KiB)

This may not be Unity8 specific. I got a similar lockup in miral-shell.

1. Use mir_demo_shell as a host:
   $ sudo mir_demo_server --arw-file --vt 4 --window-manager system-compositor

2. Run miral-shell as a guest:
   $ miral-app --host /tmp/mir_socket

3. Start egltriangle & eglplasma and start switching, resizing, maximizing etc. After a while it seems the "RPC thread" starts making blocking calls to the client API.

(gdb) info threads
  Id Target Id Frame
* 1 Thread 0x7f1811eafc80 (LWP 4525) "miral-shell" 0x00007f181036e18d in poll () at ../sysdeps/unix/syscall-template.S:84
  2 Thread 0x7f180a305700 (LWP 4526) "RPC Thread" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  3 Thread 0x7f1803fff700 (LWP 4532) "Mir/Snapshot" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  4 Thread 0x7f18037fe700 (LWP 4533) "Mir/Comp" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  5 Thread 0x7f1802ffd700 (LWP 4534) "Mir/Input Reade" 0x00007f181036e18d in poll () at ../sysdeps/unix/syscall-template.S:84
  6 Thread 0x7f18027fc700 (LWP 4535) "Mir/IPC" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  7 Thread 0x7f1801ffb700 (LWP 4536) "Mir/IPC" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  8 Thread 0x7f18017fa700 (LWP 4537) "RPC Thread" 0x00007f181036e18d in poll () at ../sysdeps/unix/syscall-template.S:84
  9 Thread 0x7f1800ff9700 (LWP 4538) "miral-shell" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
(gdb) t 2
[Switching to thread 2 (Thread 0x7f180a305700 (LWP 4526))]
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) bt
#0 0x00007f180f842510 in pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f181090580c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f1810e3c18b in wait () at /usr/include/c++/6/condition_variable:99
#3 0x00007f1810e3c18b in wait_for_result (this=0x7f180a303e00) at ./src/client/mir_render_surface_api.cpp:43
#4 0x00007f1810e3c18b in mir_connection_create_render_surface_sync (connection=<optimised out>, width=<optimised out>, height=<optimised out>) at ./src/client/mir_render_surface_api.cpp:126
#5 0x00007f180ffc4f67 in set_cursor_image (image=..., this=0x55c1dbfb8540)
    at ./src/server/graphics/nested/mir_client_host_connection.cpp:177
#6 0x00007f180ffc4f67 in set_cursor_image (this=0x55c1dbd70940, image=...)
    at ./src/server/graphics/nested/mir_client_host_connection.cpp:491
#7 0x00007f180ffcc076 in mir::input::CursorController::update_cursor_image_locked(std::unique_lock<std::mutex>&) [clone .constprop.589] (this=this@entry=0x55c1dbe82f10, lock=...) at ./src/server/input/cursor_controller.cpp:248
#8 0x00007f180ff248...

Read more...

Changed in mir:
assignee: nobody → Alan Griffiths (alan-griffiths)
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

This isn't Unity8 specific.

But comment #1 is a compounding problem: If I avoid the "RPC thread" deadlock[1] then, using only the mir-demos I can get the "intel_do_flush_locked failed: No such file or directory" error.

[1] lp:~alan-griffiths/mir/dont-handle-nested-input-on-rpc-thread

summary: - [regression] Mir EGL on intel graphics lock up Unity8
+ [regression] Mir EGL on intel graphics kills nested server
Changed in mir:
status: Confirmed → In Progress
Changed in mir:
milestone: none → 0.27.0
description: updated
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Interestingly, with the 0.27 release silo (2806) on Artful (which doesn't yet contain the "RPC thread" deadlock fix) I can't reproduce the "freeze" behaviour.

OTOH it still doesn't work correctly: when the client window is made smaller it becomes badly corrupted, but is restored when made larger.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

...
intel_do_flush_locked failed: No such file or directory
[Switching to Thread 0x7fffed046700 (LWP 3809)]

Thread 5 "Mir/Comp" hit Breakpoint 1, __GI_exit (status=1) at exit.c:105
105 exit.c: No such file or directory.
(gdb) info threads
...
* 5 Thread 0x7fffed046700 (LWP 3809) "Mir/Comp" __GI_exit (status=1) at exit.c:105
...
(gdb) bt
#0 0x00007ffff70892b0 in __GI_exit (status=1) at exit.c:105
#1 0x00007fffeebcd831 in () at /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#2 0x00007fffeebdc97d in () at /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#3 0x00007ffff42ce749 in () at /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1
#4 0x00007ffff42bdc9f in eglSwapBuffers () at /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1
#5 0x00007ffff5df2757 in mir::graphics::nested::detail::DisplayBuffer::swap_buffers() (this=0x555555cb6210) at /home/alan/display_server/mir3/src/server/graphics/nested/display_buffer.cpp:101
#6 0x00007ffff5e1f2a5 in mir::renderer::gl::CurrentRenderTarget::swap_buffers() (this=0x7fffe00008c8) at /home/alan/display_server/mir3/src/renderers/gl/renderer.cpp:75
#7 0x00007ffff5e1ff41 in mir::renderer::gl::Renderer::render(std::vector<std::shared_ptr<mir::graphics::Renderable>, std::allocator<std::shared_ptr<mir::graphics::Renderable> > > const&) const (this=0x7fffe00008c0, renderables=std::vector of length 2, capacity 2 = {...})
    at /home/alan/display_server/mir3/src/renderers/gl/renderer.cpp:215
#8 0x00007ffff5ca94a2 in mir::compositor::DefaultDisplayBufferCompositor::composite(std::vector<std::shared_ptr<mir::compositor::SceneElement>, std::allocator<std::shared_ptr<mir::compositor::SceneElement> > >&&) (this=0x7fffe046d9a0, scene_elements=<unknown type in /home/alan/display_server/mir3/cmake-build-debug/bin/../lib/libmirserver.so.44, CU 0x62223c, DIE 0x629e56>)
    at /home/alan/display_server/mir3/src/server/compositor/default_display_buffer_compositor.cpp:84

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Stack trace with added symbols

#0 0x00007ffff70892b0 in __GI_exit (status=1) at exit.c:105
#1 0x00007fffeebcd831 in () at /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#2 0x00007fffeebdc97d in () at /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#3 0x00007ffff42ce749 in dri2_swap_buffers (drv=<optimised out>, disp=<optimised out>, draw=0x555555cc31c0) at ../../../src/egl/drivers/dri2/platform_mir.c:466
#4 0x00007ffff42bdc9f in eglSwapBuffers (dpy=0x555555a239e0, surface=<optimised out>)
    at ../../../src/egl/main/eglapi.c:1213
#5 0x00007ffff5df26d6 in mir::graphics::nested::detail::DisplayBuffer::swap_buffers() (this=0x555555cb6560) at /home/alan/display_server/mir3/src/server/graphics/nested/display_buffer.cpp:100
#6 0x00007ffff5e1f225 in mir::renderer::gl::CurrentRenderTarget::swap_buffers() (this=0x7fffe40008c8) at /home/alan/display_server/mir3/src/renderers/gl/renderer.cpp:75
#7 0x00007ffff5e1fec1 in mir::renderer::gl::Renderer::render(std::vector<std::shared_ptr<mir::graphics::Renderable>, std::allocator<std::shared_ptr<mir::graphics::Renderable> > > const&) const (this=0x7fffe40008c0, renderables=std::vector of length 2, capacity 2 = {...})
    at /home/alan/display_server/mir3/src/renderers/gl/renderer.cpp:215
#8 0x00007ffff5ca9462 in mir::compositor::DefaultDisplayBufferCompositor::composite(std::vector<std::shared_ptr<mir::compositor::SceneElement>, std::allocator<std::shared_ptr<mir::compositor::SceneElement> > >&&) (this=0x7fffe4058290, scene_elements=<unknown type in /home/alan/display_server/mir3/cmake-build-debug/bin/../lib/libmirserver.so.44, CU 0x62223c, DIE 0x629e56>)
    at /home/alan/display_server/mir3/src/server/compositor/default_display_buffer_compositor.cpp:84
#9 0x00007ffff5cb0c6a in mir::compositor::CompositingFunctor::operator()() (this=0x555555ddfa40) at /home/alan/display_server/mir3/src/server/compositor/multi_threaded_compositor.cpp:141

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

What mesa version does your zesty have?

I wonder if zesty needs a fix for Mir clients like artful did recently...

mesa (17.1.2-2ubuntu2) artful; urgency=medium

  * egl-platform-mir.patch
  * egl-platform-rs.patch
    - Fix configure.ac so that the Mir EGL platforms are actually
      built. (LP: #1526658) (LP: #1696797)

 -- Christopher James Halse Rogers <email address hidden> Fri, 16 Jun 2017 17:51:50 +1000

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

1. The recent patch to artful fixes an unrelated bug.
2. As mentioned in comment #3 artful also breaks (differently)
3. The Mesa version in zesty is 17.0.3-1ubuntu1

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Having built my own version of mesa, I can see where the problem happens:

    do_flush_locked() in src/mesa/drivers/dri/i965/intel_batchbuffer.c

The path being followed is to call:

   if (brw->has_llc) {
      drm_intel_bo_unmap(batch->bo);

followed by:

   if (!brw->screen->no_hw) {
      ...
      if (brw->hw_ctx == NULL || batch->ring != RENDER_RING) {
         ...
      } else {
         ret = drm_intel_gem_bo_context_exec(batch->bo, brw->hw_ctx,
                                                 4 * USED_BATCH(*batch), flags);

It is the latter call that fails. Sadly, I don't know the expected semantics of these calls and my google foo fails me.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

INTEL_DEBUG=aub,buf gives the following log:

bo_unreference final: 9 (batchbuffer)
bo_create: buf 27 (batchbuffer) 32768b
drm_intel_gem_bo_purge_vma_cache: cached=5, open=9, limit=-1
bo_map: 27 (batchbuffer) -> 0x7fa3ab145000
Resized to 457x443
Resized to 454x441
bo_unreference final: 34 (prime)
drm_intel_gem_bo_purge_vma_cache: cached=6, open=8, limit=-1
drm_intel_gem_bo_purge_vma_cache: cached=7, open=7, limit=-1
 0: 10 (pipe_control workaround)
 1: 11 (program cache)
 2: 37 (prime)
 3: 33 (streamed data)
 4: 36 (miptree)
 5: 28 (batchbuffer)@0x00000000 0000001c -> 10 (pipe_control workaround)@0x00000000 7dc53000 + 0x00000004
 5: 28 (batchbuffer)@0x00000000 00000030 -> 28 (batchbuffer)@0x00000000 6e011000 + 0x00000001
 5: 28 (batchbuffer)@0x00000000 00000034 -> 28 (batchbuffer)@0x00000000 6e011000 + 0x00000001
 5: 28 (batchbuffer)@0x00000000 0000003c -> 11 (program cache)@0x00000000 0e232000 + 0x00000001
 5: 28 (batchbuffer)@0x00000000 000000a4 -> 28 (batchbuffer)@0x00000000 6e011000 + 0x00007fc0
 5: 28 (batchbuffer)@0x00000000 000000a8 -> 28 (batchbuffer)@0x00000000 6e011000 + 0x00007fe3
 5: 28 (batchbuffer)@0x00000000 000000b4 -> 28 (batchbuffer)@0x00000000 6e011000 + 0x00007fa0
 5: 28 (batchbuffer)@0x00000000 000000b8 -> 28 (batchbuffer)@0x00000000 6e011000 + 0x00007fbf
 5: 28 (batchbuffer)@0x00000000 00007e84 -> 37 (prime)@0x00000000 00000000 + 0x00000000
 5: 28 (batchbuffer)@0x00000000 00000290 -> 10 (pipe_control workaround)@0x00000000 7dc53000 + 0x00000004
 5: 28 (batchbuffer)@0x00000000 000002cc -> 10 (pipe_control workaround)@0x00000000 7dc53000 + 0x00000004
 5: 28 (batchbuffer)@0x00000000 00007c44 -> 37 (prime)@0x00000000 00000000 + 0x00000000
 5: 28 (batchbuffer)@0x00000000 00007c24 -> 1 (image)@0x00000000 00000000 + 0x00000000
 5: 28 (batchbuffer)@0x00000000 000004e0 -> 33 (streamed data)@0x00000000 00000000 + 0x00000000
 5: 28 (batchbuffer)@0x00000000 000004e4 -> 33 (streamed data)@0x00000000 00000000 + 0x00000051
 5: 28 (batchbuffer)@0x00000000 00000534 -> 10 (pipe_control workaround)@0x00000000 7dc53000 + 0x00000004
 5: 28 (batchbuffer)@0x00000000 00007a24 -> 36 (miptree)@0x00000000 6e001000 + 0x00000000
 5: 28 (batchbuffer)@0x00000000 0000069c -> 33 (streamed data)@0x00000000 00000000 + 0x00000050
 5: 28 (batchbuffer)@0x00000000 000006a0 -> 33 (streamed data)@0x00000000 00000000 + 0x000000a1
intel_do_flush_locked failed: No such file or directory

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

OK,the problem was introduced in...

revno: 4182 [merge]
author: Kevin DuBois <email address hidden>, Alan Griffiths <email address hidden>
committer: Tarmac
branch nick: development-branch
timestamp: Wed 2017-05-31 17:56:19 +0000
message:
  prepare the nested platform for providing separate display and rendering hooks. Now the nested platform loads the appropriate /rendering/ platform instead of the 'guest' platform.

*BUT NOTE*

This also made EGL on Mir on Mir impossible, so you need to merge -c 4195 before you can actually run the test scenario.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

...and the problem can seemingly be avoided by hacking src/server/graphics/nested/platform.cpp to ignore the size of the buffer:

    bool passthrough_candidate(mir::geometry::Size /*size*/, mg::BufferUsage usage)
    {
        return connection->supports_passthrough(usage) /*&&
            (size.width >= mir::geometry::Width{480}) && (size.height >= mir::geometry::Height{480})*/;
    }

(And no, I don't know what those magic numbers are for.)

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

One final observation before EOD:

if I hack mgn::NestedBufferPlatform::create_buffer_allocator() to only ever use rendering_platform->create_buffer_allocator(); then on starting an EGL client the server immediately hits the "intel_do_flush_locked failed: No such file or directory" error.

So I suspect that the mgm::BufferAllocator simply doesn't work in this configuration.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Re comment #11, ignoring the size of the buffer and allowing passthrough always sounds like a good idea. At least superficially. That should be always faster and use less resources.

So we might want to find out why that 480 threshold exists at all. I would hazard a guess it is 480 because the logic did/does relate to fullscreen passthrough only at some stage. So if you're only handling fullscreen surfaces it makes sense to not bother with the overhead of allocating smaller surfaces as scanout-able. However if the logic was on its way to being generic and not for fullscreen only then I really don't know why the 480 threshold should still be there.

Revision history for this message
Mir CI Bot (mir-ci-bot) wrote :

Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.27.0

Changed in mir:
status: In Progress → Fix Committed
Changed in mir:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.