Mir

Severe graphical corruption (mostly horizontal streaks/lines) running software clients (including Xmir) on android

Bug #1406725 reported by Daniel van Vugt
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Pocket Desktop
Fix Released
Critical
kevin gunn
Mir
Fix Released
Critical
Kevin DuBois
mir (Ubuntu)
Fix Released
Undecided
Unassigned
xorg-server (Ubuntu)
Invalid
Critical
Unassigned

Bug Description

I'm seeing severe graphical corruption running mir_demo_client_flicker on mako.

Start a few instances of the client (or just one flicker and some other clients) and you'll see the mir_demo_client_flicker window contains significant corruption.

I don't think it's an overlays issue. The problem occurs even without overlays.

Tags: xmir android

Related branches

tags: added: nexus4
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: Severe graphical corruption running software clients on android

This bug is now hurting Xmir on phones.

summary: - Severe graphical corruption running mir_demo_client_flicker on mako
+ Severe graphical corruption running software clients on android
tags: added: android xmir
removed: mako nexus4
Changed in xorg-server (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in mir:
importance: Medium → High
summary: - Severe graphical corruption running software clients on android
+ Severe graphical corruption running software clients (including Xmir) on
+ android
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: Severe graphical corruption running software clients (including Xmir) on android

I recall Alberto fixed a corruption bug just like this using magic HWC flags on the affected buffer. Not sure where that change went or if we still have it.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Alberto's previous fix is here:

int mga::AndroidAllocAdaptor::convert_to_android_usage(BufferUsage usage)
{
    switch (usage)
    {
    case mga::BufferUsage::use_hardware:
        return (GRALLOC_USAGE_HW_TEXTURE | GRALLOC_USAGE_HW_RENDER);
    case mga::BufferUsage::use_framebuffer_gles:
        return (GRALLOC_USAGE_HW_RENDER | GRALLOC_USAGE_HW_COMPOSER | GRALLOC_USAGE_HW_FB);
    case mga::BufferUsage::use_software:
        return (GRALLOC_USAGE_SW_WRITE_OFTEN | GRALLOC_USAGE_SW_READ_OFTEN | GRALLOC_USAGE_HW_COMPOSER | GRALLOC_USAGE_HW_TEXTURE);
    default:
        return -1;
    }
}

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Is it safe to mix HW and SW flags?

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

@Daniel,

Yeah, those flags are used by the implementations to know when they need to do cache coherency operations (flush, invalidate, etc).

In android, software buffers CAN be used as backing to textures for opengl and can be passed directly to Hardware Composer HAL (HW Composer), hence they are marked as such.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I'm fairly confident the problem is entirely in Mir. Not in Xmir. Especially given it's easy to reproduce with mir-demos.

Changed in xorg-server (Ubuntu):
status: Triaged → Invalid
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, I think this might be the problem. The client API is releasing its shared pointer the the software buffer region prematurely (before mir_buffer_stream_get_graphics_region even returns). This means we are calling Android's gralloc->unlock() way before the client has even started filling the buffer!

void mir_buffer_stream_get_graphics_region(
    MirBufferStream *buffer_stream,
    MirGraphicsRegion *region_out)
try
{
    mcl::ClientBufferStream *bs = reinterpret_cast<mcl::ClientBufferStream*>(buffer_stream);

    auto secured_region = bs->secure_for_cpu_write();
    region_out->width = secured_region->width.as_uint32_t();
    region_out->height = secured_region->height.as_uint32_t();
    region_out->stride = secured_region->stride.as_uint32_t();
    region_out->pixel_format = secured_region->format;
    region_out->vaddr = secured_region->vaddr.get();
}
catch (std::exception const& ex)
{
    MIR_LOG_UNCAUGHT_EXCEPTION(ex);
}

Changed in mir:
assignee: nobody → Daniel van Vugt (vanvugt)
milestone: none → 0.17.0
status: Triaged → In Progress
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Hmm, no. BufferStream::secured_region seems to hold it for us.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Should we be concerned about this?...

src/platforms/android/server/buffer.cpp:
    //TODO: we should make use of the android egl fence extension here to update the fence.
    // if the extension is not available, we should pass out a token that the user
    // will have to keep until the completion of the gl draw

Changed in mir:
assignee: Daniel van Vugt (vanvugt) → nobody
milestone: 0.17.0 → none
status: In Progress → Triaged
summary: - Severe graphical corruption running software clients (including Xmir) on
- android
+ Severe graphical corruption (mostly horizontal streaks/lines) running
+ software clients (including Xmir) on android
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

In case anyone is thinking it, switching Xmir to glamor to work around this bug is not a good idea. We default to glamor in Xmir on desktop and it's still too buggy (work in progress). The full list of issues for Xmir+glamor is here: https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bugs?field.tag=xmir

Kevin DuBois (kdub)
Changed in mir:
assignee: nobody → Kevin DuBois (kdub)
milestone: none → 0.18.0
status: Triaged → In Progress
Revision history for this message
Kevin DuBois (kdub) wrote :

re:
Should we be concerned about this?...
no, these comments are a bit stale

Revision history for this message
Kevin DuBois (kdub) wrote :

The problem looks like a cpu cache problem.
We call gralloc's unlock function, and digging through the gralloc code, it looks like the internal native buffer flag that should trigger a cache flush is behaving properly. Now seeing what I can find out in the kernel....

kevin gunn (kgunn72)
Changed in mir:
importance: High → Critical
Changed in xorg-server (Ubuntu):
importance: High → Critical
Changed in canonical-pocket-desktop:
assignee: nobody → kevin gunn (kgunn72)
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
Kevin DuBois (kdub) wrote :

I've been able to trace the calls down into the kernel, and it does look like the kernel is flushing the cache at the appropriate time before sending the buffer back to the client.

Revision history for this message
kevin gunn (kgunn72) wrote :

Just discovered something while working on bug 1513815

Basically, i was running pd on N4 set this env variable
SAL_USE_VCLPLUGIN=gen
which according to bjoern "forces LO to ignore the gtk plugin and uses the old and trusty X11-only backend" ...figure this is true for all Xapps?
anyhow, i launched FF in full screen mode...no corruption, then connected mouse...i see corruption...moused to full screen, no corruption. but there's a black bar when no corruption, but the contents are all shown...
see video as it's easier to understand
https://www.youtube.com/watch?v=PHbyv25Ekx0

Revision history for this message
Kevin DuBois (kdub) wrote :

Its seeming like the gpu-facing cache is the one that needs invalidation from the GL/gralloc driver, but is not being flushed. This is a CPU/GPU coordination issue, and the GPU-based scenarios all work through the same cache and do not experience this problem. If we use overlays, this is not an issue, as the MDP/CPU cache sync seems to work. (Unfortunately, overlays are not supported on external displays)

Revision history for this message
Kevin DuBois (kdub) wrote :

A (bad) workaround is to map the buffer server-side, and then use glTexImage2D (instead of glTargetTexture2D)

Revision history for this message
Kevin DuBois (kdub) wrote :

External indication that adreno drivers have an issue flushing their texture caches: https://codereview.chromium.org1241433003/

Revision history for this message
Kevin DuBois (kdub) wrote :

glFinish and other flushing-commands (eg, eglClientWaitSyncKHR with the flush bit) seem to be other alternatives that work, but those aren't very good alternatives either.

Kevin DuBois (kdub)
tags: added: nexus4 nexus7
tags: removed: nexus4 nexus7
Revision history for this message
Kevin DuBois (kdub) wrote :

The corruption appears on krillin too, so probably not a specific chipset problem.

Revision history for this message
Kevin DuBois (kdub) wrote :

After today's investigation, I think the cachelines that are appearing as corruption are not from an lines that should have been invalidated, but are the flushed lines from the next frame. IE, we're releasing the buffer too early, and the flushed lines from the backbuffer render are appearing as corruption in the previous frame. I've been able to use the sync extensions from EGL_KHR_fence_sync to stabilize cpu rendering and make sure that OES_EGL_image_external is synchronized. Will continue testing before saying its figured out, but seems to be improved.

It might take a bit more work to come up with the proper patch that makes both android and mesa happy though.

Revision history for this message
Kevin DuBois (kdub) wrote :

my spike branch that averts the problem is here, based on 0.17.1:
lp:~kdub/mir/0.17.1-fix-1406725
This branch doesnt have unit tests, and needs a bit more work on the new internal interfaces, so I'll probably have to rework the branch to get it to land in lp:mir.

Revision history for this message
Kevin DuBois (kdub) wrote :

Testing the EGLSyncFence extensions with xmir still has some corruption. Using these extensions stabilizes the mir_demo_client_flicker rendering. So, its looking like there's two problems going on here.

Revision history for this message
Kevin DuBois (kdub) wrote :

So, we have two issues in this bug.

1) The majority of the xmir corruption was caused by calling mir_buffer_stream_get_graphics_region() when xmir wanted to know the buffer size. On android, this would call down into gralloc, and invalidate/flush the cache when called. Its reasonable to just call into gralloc on once per swapbuffers.

2) The mir_demo_client_flicker corruption was caused by releasing the buffer resource a bit too early on the server side. This could occur in xmir as well, but seems to be rare for xmir's use case. This issue is corrected by using eglSyncFenceKHR extensions to ensure that the texture is uploaded to the gpu before releasing the buffer back to the server. (This is the more extensive

A quick fix for both issues is in here: lp:~kdub/mir/0.17.1-fix-1406725. I'll repropose tests+fixes to lp:mir, but this should get xmir going in the meantime.

2) is the more extensive fix, so I'll split that bug out into a different report

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.18.0

Changed in mir:
status: In Progress → Fix Committed
kevin gunn (kgunn72)
Changed in canonical-pocket-desktop:
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.3 KiB)

This bug was fixed in the package mir - 0.18.0+16.04.20151216.1-0ubuntu1

---------------
mir (0.18.0+16.04.20151216.1-0ubuntu1) xenial; urgency=medium

  [ Kevin DuBois ]
  * New upstream release 0.18.0 (https://launchpad.net/mir/+milestone/0.18.0)
    - ABI summary: Only servers need rebuilding;
      . Mirclient ABI unchanged at 9
      . Mirserver ABI bumped to 36
      . Mircommon ABI unchanged at 5
      . Mirplatform ABI unchanged at 11
      . Mirprotobuf ABI unchanged at 3
      . Mirplatformgraphics ABI bumped to 7
      . Mirclientplatform ABI unchanged at 3
      . Mirinputplatform ABI added. Current version is 4
    - Enhancements:
      . Use libinput by default, and remove the android input stack
      . Add x11 input probing
      . Add alternative buffer swapping mechanism internally, available with
        --nbuffers 0
      . Automatic searching and selection of input platforms
      . Better support for themed cursors
      . Add demo client that uses multiple buffer streams in one surface
      . Improve fingerpaint demo to use touch pressure
      . Allow for configuring cursor acceleration, scroll speed and left or
        right handed mice
      . Allow for setting a base display configuration via client api
      . Various nested server multimonitor fixes and stability improvements
      . Remove DepthId from the SurfaceStack
    - Bug fixes:
      . Unit test failures in Display.* on Android (LP: #1519276)
      . Build failure due to missing dependency of client rpc code on mir
        protobuf (LP: #1518372)
      . Test failure in
        NestedServer.display_configuration_reset_when_application_exits
        (LP: #1517990)
      . CI test failures in various NesterServer tests (LP: #1517781)
      . FTBFS with -DMIR_PLATFORM=android (LP: #1517532)
      . Nesting Mir servers with assorted display configs causes lockup
        (LP: #1516670)
      . [testsfail] RaiseSurfaces.motion_events_dont_prevent_raise
        (LP: #1515931)
      . CI test failures in GLMark2Test (LP: #1515660)
      . Shells that inject user input events need to agree with the system
        compositor on the clock to use (LP: #1515515)
      . mircookie-dev is missing nettle-dev dependency (LP: #1514391)
      . Segmentation fault on server shutdown with mesa-kms (LP: #1513901)
      . mircookie requires nettle but libmircookie-dev doesn't depend on it
        (LP: #1513792)
      . libmircookie1 package does not list libnettle as dependency
        (LP: #1513225)
      . display configuration not reset when application exits (LP: #1511798)
      . unplugging external monitor causes nested server to throttle client
        (LP: #1511723)
      . 1/2 screen on external monitor (LP: #1511538)
      . unity-system-compositor crash, no interaction on windowed mode
        (LP: #1511095)
      . [regression] arm64/powerpc cross compile doesn't build any more
        (LP: #1510778)
      . mir_connection_get_egl_pixel_format() crashes if libEGL is loaded
        RTLD_LAZY (LP: #1510218)
      . [multimonitor] nested server surface positioning incorrect
        (LP: #1506846)
      . unity-system-compositor fails to build against lp:mir r3027
   ...

Read more...

Changed in mir (Ubuntu):
status: New → Fix Released
Kevin DuBois (kdub)
Changed in mir:
status: Fix Committed → Fix Released
Changed in canonical-pocket-desktop:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.