Severe graphical corruption (mostly horizontal streaks/lines) running software clients (including Xmir) on android
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | Canonical Pocket Desktop |
Critical
|
kevin gunn | ||
| | Mir |
Fix Released
|
Critical
|
Kevin DuBois | |
| | mir (Ubuntu) |
Undecided
|
Unassigned | ||
| | xorg-server (Ubuntu) |
Critical
|
Unassigned | ||
Bug Description
I'm seeing severe graphical corruption running mir_demo_
Start a few instances of the client (or just one flicker and some other clients) and you'll see the mir_demo_
I don't think it's an overlays issue. The problem occurs even without overlays.
Related branches
- Mir development team: Pending requested 2015-11-13
-
Diff: 393 lines (+125/-9)19 files modifieddebian/changelog (+7/-2)
include/renderers/gl/mir/renderer/gl/texture_source.h (+1/-0)
src/client/buffer_stream.cpp (+3/-2)
src/common/CMakeLists.txt (+1/-0)
src/common/graphics/android/android_native_buffer.cpp (+46/-0)
src/gl/recently_used_cache.cpp (+9/-0)
src/include/common/mir/graphics/android/android_native_buffer.h (+25/-0)
src/include/common/mir/graphics/android/native_buffer.h (+3/-0)
src/platforms/android/server/buffer.cpp (+7/-3)
src/platforms/android/server/buffer.h (+1/-0)
src/platforms/android/server/ipc_operations.cpp (+2/-0)
src/server/compositor/multi_threaded_compositor.cpp (+5/-1)
tests/acceptance-tests/test_latency.cpp (+1/-1)
tests/include/mir/test/doubles/mock_android_native_buffer.h (+3/-0)
tests/include/mir/test/doubles/stub_android_native_buffer.h (+3/-0)
tests/mir_test_doubles/mock_egl.cpp (+2/-0)
tests/unit-tests/client/android/test_gralloc_registrar.cpp (+2/-0)
tests/unit-tests/graphics/android/test_android_alloc_adaptor.cpp (+2/-0)
tests/unit-tests/graphics/android/test_native_buffer.cpp (+2/-0)
- PS Jenkins bot: Approve (continuous-integration) on 2015-11-20
- Alberto Aguirre: Approve on 2015-11-18
-
Diff: 69 lines (+34/-3)3 files modifiedinclude/client/mir_toolkit/mir_buffer_stream.h (+4/-2)
src/client/buffer_stream.cpp (+2/-1)
tests/unit-tests/client/test_client_buffer_stream.cpp (+28/-0)
| tags: | added: nexus4 |
| Daniel van Vugt (vanvugt) wrote : Re: Severe graphical corruption running software clients on android | #1 |
| summary: |
- Severe graphical corruption running mir_demo_client_flicker on mako + Severe graphical corruption running software clients on android |
| tags: |
added: android xmir removed: mako nexus4 |
| Changed in xorg-server (Ubuntu): | |
| status: | New → Triaged |
| importance: | Undecided → High |
| Changed in mir: | |
| importance: | Medium → High |
| summary: |
- Severe graphical corruption running software clients on android + Severe graphical corruption running software clients (including Xmir) on + android |
| Daniel van Vugt (vanvugt) wrote : Re: Severe graphical corruption running software clients (including Xmir) on android | #2 |
I recall Alberto fixed a corruption bug just like this using magic HWC flags on the affected buffer. Not sure where that change went or if we still have it.
| Daniel van Vugt (vanvugt) wrote : | #3 |
Alberto's previous fix is here:
int mga::AndroidAll
{
switch (usage)
{
case mga::BufferUsag
return (GRALLOC_
case mga::BufferUsag
return (GRALLOC_
case mga::BufferUsag
return (GRALLOC_
default:
return -1;
}
}
| Daniel van Vugt (vanvugt) wrote : | #4 |
Is it safe to mix HW and SW flags?
| Alberto Aguirre (albaguirre) wrote : | #5 |
@Daniel,
Yeah, those flags are used by the implementations to know when they need to do cache coherency operations (flush, invalidate, etc).
In android, software buffers CAN be used as backing to textures for opengl and can be passed directly to Hardware Composer HAL (HW Composer), hence they are marked as such.
| Daniel van Vugt (vanvugt) wrote : | #6 |
I'm fairly confident the problem is entirely in Mir. Not in Xmir. Especially given it's easy to reproduce with mir-demos.
| Changed in xorg-server (Ubuntu): | |
| status: | Triaged → Invalid |
| Daniel van Vugt (vanvugt) wrote : | #7 |
OK, I think this might be the problem. The client API is releasing its shared pointer the the software buffer region prematurely (before mir_buffer_
void mir_buffer_
MirBufferStream *buffer_stream,
MirGraphics
try
{
mcl:
auto secured_region = bs->secure_
region_
region_
region_
region_
region_
}
catch (std::exception const& ex)
{
MIR_
}
| Changed in mir: | |
| assignee: | nobody → Daniel van Vugt (vanvugt) |
| milestone: | none → 0.17.0 |
| status: | Triaged → In Progress |
| Daniel van Vugt (vanvugt) wrote : | #8 |
Hmm, no. BufferStream:
| Daniel van Vugt (vanvugt) wrote : | #9 |
Should we be concerned about this?...
src/platforms/
//TODO: we should make use of the android egl fence extension here to update the fence.
// if the extension is not available, we should pass out a token that the user
// will have to keep until the completion of the gl draw
| Changed in mir: | |
| assignee: | Daniel van Vugt (vanvugt) → nobody |
| milestone: | 0.17.0 → none |
| status: | In Progress → Triaged |
| summary: |
- Severe graphical corruption running software clients (including Xmir) on - android + Severe graphical corruption (mostly horizontal streaks/lines) running + software clients (including Xmir) on android |
| Daniel van Vugt (vanvugt) wrote : | #11 |
In case anyone is thinking it, switching Xmir to glamor to work around this bug is not a good idea. We default to glamor in Xmir on desktop and it's still too buggy (work in progress). The full list of issues for Xmir+glamor is here: https:/
| Changed in mir: | |
| assignee: | nobody → Kevin DuBois (kdub) |
| milestone: | none → 0.18.0 |
| status: | Triaged → In Progress |
| Kevin DuBois (kdub) wrote : | #12 |
re:
Should we be concerned about this?...
no, these comments are a bit stale
| Kevin DuBois (kdub) wrote : | #13 |
The problem looks like a cpu cache problem.
We call gralloc's unlock function, and digging through the gralloc code, it looks like the internal native buffer flag that should trigger a cache flush is behaving properly. Now seeing what I can find out in the kernel....
| Changed in mir: | |
| importance: | High → Critical |
| Changed in xorg-server (Ubuntu): | |
| importance: | High → Critical |
| Changed in canonical-pocket-desktop: | |
| assignee: | nobody → kevin gunn (kgunn72) |
| importance: | Undecided → Critical |
| status: | New → In Progress |
| Kevin DuBois (kdub) wrote : | #14 |
I've been able to trace the calls down into the kernel, and it does look like the kernel is flushing the cache at the appropriate time before sending the buffer back to the client.
| kevin gunn (kgunn72) wrote : | #15 |
Just discovered something while working on bug 1513815
Basically, i was running pd on N4 set this env variable
SAL_USE_
which according to bjoern "forces LO to ignore the gtk plugin and uses the old and trusty X11-only backend" ...figure this is true for all Xapps?
anyhow, i launched FF in full screen mode...no corruption, then connected mouse...i see corruption...moused to full screen, no corruption. but there's a black bar when no corruption, but the contents are all shown...
see video as it's easier to understand
https:/
| Kevin DuBois (kdub) wrote : | #16 |
Its seeming like the gpu-facing cache is the one that needs invalidation from the GL/gralloc driver, but is not being flushed. This is a CPU/GPU coordination issue, and the GPU-based scenarios all work through the same cache and do not experience this problem. If we use overlays, this is not an issue, as the MDP/CPU cache sync seems to work. (Unfortunately, overlays are not supported on external displays)
| Kevin DuBois (kdub) wrote : | #17 |
A (bad) workaround is to map the buffer server-side, and then use glTexImage2D (instead of glTargetTexture2D)
| Kevin DuBois (kdub) wrote : | #18 |
External indication that adreno drivers have an issue flushing their texture caches: https:/
| Kevin DuBois (kdub) wrote : | #19 |
glFinish and other flushing-commands (eg, eglClientWaitSy
| tags: | added: nexus4 nexus7 |
| tags: | removed: nexus4 nexus7 |
| Kevin DuBois (kdub) wrote : | #20 |
The corruption appears on krillin too, so probably not a specific chipset problem.
| Kevin DuBois (kdub) wrote : | #21 |
After today's investigation, I think the cachelines that are appearing as corruption are not from an lines that should have been invalidated, but are the flushed lines from the next frame. IE, we're releasing the buffer too early, and the flushed lines from the backbuffer render are appearing as corruption in the previous frame. I've been able to use the sync extensions from EGL_KHR_fence_sync to stabilize cpu rendering and make sure that OES_EGL_
It might take a bit more work to come up with the proper patch that makes both android and mesa happy though.
| Kevin DuBois (kdub) wrote : | #22 |
my spike branch that averts the problem is here, based on 0.17.1:
lp:~kdub/mir/0.17.1-fix-1406725
This branch doesnt have unit tests, and needs a bit more work on the new internal interfaces, so I'll probably have to rework the branch to get it to land in lp:mir.
| Kevin DuBois (kdub) wrote : | #23 |
Testing the EGLSyncFence extensions with xmir still has some corruption. Using these extensions stabilizes the mir_demo_
| Kevin DuBois (kdub) wrote : | #24 |
So, we have two issues in this bug.
1) The majority of the xmir corruption was caused by calling mir_buffer_
2) The mir_demo_
A quick fix for both issues is in here: lp:~kdub/mir/0.17.1-fix-1406725. I'll repropose tests+fixes to lp:mir, but this should get xmir going in the meantime.
2) is the more extensive fix, so I'll split that bug out into a different report
| PS Jenkins bot (ps-jenkins) wrote : | #25 |
Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.18.0
| Changed in mir: | |
| status: | In Progress → Fix Committed |
| Changed in canonical-pocket-desktop: | |
| status: | In Progress → Fix Committed |
| Launchpad Janitor (janitor) wrote : | #26 |
This bug was fixed in the package mir - 0.18.0+
---------------
mir (0.18.0+
[ Kevin DuBois ]
* New upstream release 0.18.0 (https:/
- ABI summary: Only servers need rebuilding;
. Mirclient ABI unchanged at 9
. Mirserver ABI bumped to 36
. Mircommon ABI unchanged at 5
. Mirplatform ABI unchanged at 11
. Mirprotobuf ABI unchanged at 3
. Mirplatformgraphics ABI bumped to 7
. Mirclientplatform ABI unchanged at 3
. Mirinputplatform ABI added. Current version is 4
- Enhancements:
. Use libinput by default, and remove the android input stack
. Add x11 input probing
. Add alternative buffer swapping mechanism internally, available with
--nbuffers 0
. Automatic searching and selection of input platforms
. Better support for themed cursors
. Add demo client that uses multiple buffer streams in one surface
. Improve fingerpaint demo to use touch pressure
. Allow for configuring cursor acceleration, scroll speed and left or
right handed mice
. Allow for setting a base display configuration via client api
. Various nested server multimonitor fixes and stability improvements
. Remove DepthId from the SurfaceStack
- Bug fixes:
. Unit test failures in Display.* on Android (LP: #1519276)
. Build failure due to missing dependency of client rpc code on mir
protobuf (LP: #1518372)
. Test failure in
(LP: #1517990)
. CI test failures in various NesterServer tests (LP: #1517781)
. FTBFS with -DMIR_PLATFORM=
. Nesting Mir servers with assorted display configs causes lockup
(LP: #1516670)
. [testsfail] RaiseSurfaces.
(LP: #1515931)
. CI test failures in GLMark2Test (LP: #1515660)
. Shells that inject user input events need to agree with the system
compositor on the clock to use (LP: #1515515)
. mircookie-dev is missing nettle-dev dependency (LP: #1514391)
. Segmentation fault on server shutdown with mesa-kms (LP: #1513901)
. mircookie requires nettle but libmircookie-dev doesn't depend on it
(LP: #1513792)
. libmircookie1 package does not list libnettle as dependency
(LP: #1513225)
. display configuration not reset when application exits (LP: #1511798)
. unplugging external monitor causes nested server to throttle client
(LP: #1511723)
. 1/2 screen on external monitor (LP: #1511538)
. unity-system-
(LP: #1511095)
. [regression] arm64/powerpc cross compile doesn't build any more
(LP: #1510778)
. mir_connection_
RTLD_LAZY (LP: #1510218)
. [multimonitor] nested server surface positioning incorrect
(LP: #1506846)
. unity-system-
...
| Changed in mir (Ubuntu): | |
| status: | New → Fix Released |
| Changed in mir: | |
| status: | Fix Committed → Fix Released |
| Changed in canonical-pocket-desktop: | |
| status: | Fix Committed → Fix Released |

This bug is now hurting Xmir on phones.