Mir

Backward frame jumps on mali (Nexus 10 & krillin)

Bug #1270245 reported by Kevin DuBois
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mir
Fix Released
High
Kevin DuBois
mir (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Alexandros, Andreas, and I have noted at different points that we see frame skips on the nexus 10. The frames look like they move in a 1,3,2 pattern, particularly when unlocking the device.

Related branches

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

If the issue is in Mir then we should be able to reproduce it with mir_demo_server_shell, which has sleep/resume support also using the Android power button.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Confirmed. I can see jitter when scrolling around Unity. It's running:
0.1.3+14.04.20140108-0ubuntu1

Changed in mir:
status: New → Confirmed
milestone: none → 0.1.5
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Reproduced on Nexus 10 with:
  mir_demo_server_shell
  mir_demo_client_egltriangle

It seems to happen when I drag the surface around (3 fingers).

Changed in mir:
status: Confirmed → Triaged
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The bug is even more pronounced with mir_demo_client_scroll. But it seems to go away if I disable the resize() calls in demo-shell. This seems to suggest the bug is related to resize().

I wonder if Unity8 is calling resize() often?

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Hmm, not just during resize. This will do it too:
    mir_demo_client_progressbar 100

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

No issue on Nexus 4. Do we have any Nexus 10-only code paths?

Revision history for this message
Kevin DuBois (kdub) wrote :

most drivers return their buffers to us using ANativeWindow::queueBuffer(), which means, 'Here's the buffer back, its been rendered to"

there is another hook, cancelBuffer() that I've seen this class of drivers using, which means, "here's the buffer back, but its contents are indeterminate".

We currently treat those two hooks as the same (which is a TODO). I'll see if i can correlate the driver calling that hook to the times when the jump is seen.

kevin gunn (kgunn72)
Changed in mir:
importance: Medium → Critical
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

kdub: Bug 1270685 (which is also N10-specific) seems to have a possibly related call stack.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

kdub:
cancelBuffer is called once on client start-up. But not thereafter. It's certainly not being called when the glitches occur. So that's not the problem :(

Feel free to verify for yourself...

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The driver behaviour for N10 appears to be quite different to N4 or N7. The N10 is the only device that uses the fenced version of dequeueBuffer on the client, although the fence_fd is always -1 (no fence waiting). That might be significant(?)...

Nexus 4:
Server:
dequeueBufferAndWait 0x18fcae0
queueBuffer 0x18fcae0 46
dequeueBufferAndWait 0x18fccb8
queueBuffer 0x18fccb8 46
Client:
dequeueBufferAndWait 0x41f00d88
queueBuffer 0x41f00d88 8
dequeueBufferAndWait 0x41f00ba8
queueBuffer 0x41f00ba8 8

Nexus 7:
Server:
dequeueBufferAndWait 0x13db488
queueBuffer 0x13db488 -1
dequeueBufferAndWait 0x13db6f8
queueBuffer 0x13db6f8 -1
Client:
dequeueBufferAndWait 0x41100e90
queueBuffer 0x41100e90 -1
dequeueBufferAndWait 0x41100780
queueBuffer 0x41100780 -1

Nexus 10:
Server:
queueBuffer 0x17a6318 63
dequeueBuffer 0x17a6500 63
queueBuffer 0x17a6500 65
dequeueBuffer 0x17a6318 65
Client:
queueBuffer 0xb5e013e8 23
dequeueBuffer 0xb5e00ed8 -1
queueBuffer 0xb5e00ed8 23
dequeueBuffer 0xb5e015d8 -1

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

It occurred to me that visible frame skips might not be a buffer ordering problem. It could just be the textures we're compositing that are wrong...

I can't tell if this is a fix or a workaround, but adding glFinish() after:
       egl_extensions->glEGLImageTargetTexture2DOES(GL_TEXTURE_2D, image);
seems to solve the problem, as well as slow rendering down noticeably, sometimes :(

Changed in mir:
milestone: 0.1.5 → 0.1.6
Revision history for this message
Kevin DuBois (kdub) wrote :

As for adding glFinish(), I'm not sure what to make of that, it could just be the slowdown from that command alters the timing pattern that's giving us problems. I don't see anything obvious from the trace, we should have fencing around the display posting implemented. A fence of -1 means the buffer is available for use immediately.

If we want to eliminate the fencing from the list of potential problem areas, we could look for tearing. Fencing problems generally result in tearing, so if we don't see tearing, maybe we could focus on making sure we have the right texture uploaded.

Revision history for this message
Kevin DuBois (kdub) wrote :

As for adding glFinish(), I'm not sure what to make of that, it could just be the slowdown from that command alters the timing pattern that's giving us problems.

A fence of -1 means that the buffer is immediately usable, so that should be handled in the mir code. I don't see anything immediately obvious in the trace, that seems pretty normal.

If we're failing to wait on a fence, then I'd expect some tearing during the animation problem. So if we see tearing during the problem, then we could focus on looking at the fence handling, if not, we could focus on the texture/buffers we are using.

Revision history for this message
Kevin DuBois (kdub) wrote :

Sorry for the two differently phrased comments, I was having launchpad problems. They are intended to have the same meaning.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

There's an interesting mention of the N10 driver being quite different here:
   https://code.launchpad.net/~kdub/mir/fix-1270685/+merge/205053

It could be related also.

tags: added: performance
Changed in mir:
milestone: 0.1.6 → 0.1.7
Changed in mir:
milestone: 0.1.7 → 0.1.8
Changed in mir:
milestone: 0.1.8 → 0.1.9
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

I can't seem to duplicate this.

I'm using a R275 image (4.4.2 drivers) and latest mir/devel branch,

mir_demo_server_shell
mir_demo_client_progressbar 100, or mir_demo_client_egltriangle, or mir_demo_client_scroll

I then modified mir_demo_server_shell to allow one surface position with one finger, moved window around but I don't see any 1, 3, 2 patterns

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Usually the simplest test case is:
  1. Run mir_demo_client_egltriangle.
  2. Drag it around and resize with 3 fingers.

You will see the skipping happens mostly during the resizing.

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Ah ok, I see the frame skipping but no 1, 3, 2 patterns.

Changed in mir:
assignee: nobody → Alberto Aguirre (albaguirre)
status: Triaged → In Progress
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

lool mentioned recently that he seems to think the issue is something related to "S3CFB_WIN_CONFIG":
https://android.googlesource.com/platform/hardware/samsung_slsi/exynos5/+/a9b5ce6b6c2276691325d25c1c3040dc6b701aed%5E1/libhwc/hwc.cpp

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

I finally see the 1, 3, 2 patterns which happens if you hold the surface with 3 fingers with very little movement.

It does seem related to the fence handling since if I wait for the fences inside mga::MirNativeWindow::queueBuffer and mga::MirNativeWindow::dequeueBuffer there's no more frame jumping or 1,3,2 patterns.

Changed in mir (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

I added some code that just reallocates a buffer periodically in SwithingBundle (basically simulating a resize but without any size changes) and it seems that introducing a new buffer is what causes the nexus10 driver to hiccup.

Using the eglcounter example with a 120FPs camera, I see the following patter every time a new allocated buffer is returned to the client:

1, 2, 3, 1, 2, 3, 7 - instead of 1,2,3,4,5,6,7

Also it seems to happen only when rendering is close to display rate - if I lower the rendering rate of the client app, I don't see the issue.

Another curious fact is that if I keep reallocating buffers ALL the time, I will only see the out-of-order sequence the first time a new buffer is introduced then it just chugs along without any frame order issues after that.

This seems to alude to an issue with Mali -(does not guarantee flush order across contexts) which is what we would like to have in our case:
https://chromium.googlesource.com/chromium/src/gpu/+/master%5E/config/gpu_driver_bug_list_json.cc

Since I'm convinced this is a driver bug and there's no sane workaround I'm declaring this as "won't fix".

Changed in mir:
status: In Progress → Won't Fix
milestone: 0.1.9 → none
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The resizing case is just the easiest manual test case for the mir demos. I'm not sure resizing is involved at all when we see the same jumping during unity8 touch usage on Nexus 10.

Also, even if Unity8 was resizing more than it needs to, Mir will ignore duplicate resize requests at multiple levels. So the original bug would have nothing to do with resizing or buffer allocation. Because buffers won't get reallocated during Unity8 touch usage. Unless there's some non-obvious behaviour in Unity8 I haven't considered... ?

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Daniel, I think you referring to this bug then:
https://bugs.launchpad.net/unity8/+bug/1282200

Which has already been addressed in unity.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

This bug can't be a duplicate of Unity8 bug 1282200 because this bug has simple steps to reproduce using Mir demos alone (no Unity).

Changed in mir (Ubuntu):
status: Triaged → Won't Fix
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Based on the original description it is indeed a duplicate .

What the demos. Re

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

The demos show an entirely different bug. We should separate the two.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Yep, I separated them an hour ago.

summary: - frame jumps on the nexus 10
+ frame jumps on mali (Nexus 10 & krillin)
Changed in mir:
status: Won't Fix → Confirmed
Changed in mir (Ubuntu):
status: Won't Fix → Confirmed
Changed in mir:
importance: Critical → High
Changed in mir (Ubuntu):
importance: Critical → High
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: frame jumps on mali (Nexus 10 & krillin)

It's not immediately obvious to most people I guess. Doesn't happen much and you have to look very closely, but the bug seems to exist on krillin too.

Changed in mir:
assignee: Alberto Aguirre (albaguirre) → nobody
milestone: none → 0.8.0
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

Yes, this is easily reproducible on krillin with the instructions mentioned previously (mir_demo_server_shell + progressbar or egltriangle).

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Based on the above comments, maybe this is the same Mali bug that Google had to work around like in...
https://codereview.chromium.org/10636006

tags: added: krillin mali
summary: - frame jumps on mali (Nexus 10 & krillin)
+ Backward frame jumps on mali (Nexus 10 & krillin)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

See attached another (dangerous) hack that appears to solve the problem on krillin at least. Although it's plausibly not really a fix, but just slowing us down to the point of working around the problem.

Changed in mir:
milestone: 0.8.0 → 0.9.0
Revision history for this message
Kevin DuBois (kdub) wrote :

I haven't been able to reproduce this today. My working theory as to why is its either
1) Maybe a new driver version has the skipping issue fixed
2) The threading changes to MultiThreadedCompositor in lp:mir rev1897 to fix lp: #1362841 has hidden the problem and its still latent.

I guess the proper state for this bug should be incomplete then.

Changed in mir:
status: Confirmed → Incomplete
Changed in mir (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Kevin DuBois (kdub) wrote :

Another thing to note about this bug is that it would only happen when we were using OpenGLES to render to the framebuffer. If we eliminated EGL and OpenGLES by using HWC overlay composition, the bug was also not seen, so this further supports the theory that the EGL/OpenGLES driver is where the problem is happening.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Confirmed bug still present on krillin with the latest image and latest Mir code.

Changed in mir:
status: Incomplete → Triaged
status: Triaged → Confirmed
Changed in mir (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

In krillin you can still see "frame jumps" or out of order frames with:

sudo stop lightdm
sudo mir_demo_server_basic --disable-overlays=true --launch-client mir_demo_client_egltriangle

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

See backwards frame jumps at 0:04 and 0:24 marks in the attached video.

Revision history for this message
Kevin DuBois (kdub) wrote :

I've was also able to reproduce the frame jump issue today, so my theories in comment #32 don't have a bearing on the bug.

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Not much info in the kernel log but here it is.

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Logcat log for krillin

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

It's slightly suspicious that I can reproduce identical-looking glitches (even on desktop) by releasing compositor buffers prematurely. The same might be happening by accident in our android/fencing logic.

Changed in mir:
milestone: 0.9.0 → 0.10.0
Changed in mir:
milestone: 0.10.0 → 0.11.0
Changed in mir:
milestone: 0.11.0 → 0.12.0
Kevin DuBois (kdub)
Changed in mir:
milestone: 0.12.0 → 0.13.0
Changed in mir:
milestone: 0.13.0 → 0.14.0
tags: added: android
Changed in mir:
milestone: 0.14.0 → 0.15.0
Changed in mir:
milestone: 0.15.0 → 0.16.0
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

If all we're doing is moving the milestone on this, we may as well un-target it. Less noise.

Changed in mir:
milestone: 0.16.0 → none
Revision history for this message
Kevin DuBois (kdub) wrote :

Started investigation into this bug again as part of investigation into https://bugs.launchpad.net/mir/+bug/1517205

Changed in mir:
assignee: nobody → Kevin DuBois (kdub)
milestone: none → 0.18.0
status: Confirmed → In Progress
Revision history for this message
Kevin DuBois (kdub) wrote :

I think the root cause here is that the mali based hwc implementations are not clearing the fences set in the FRAMEBUFFER_TARGET layer after a gl swap. Will come up with a quirk to compensate.

Kevin DuBois (kdub)
Changed in mir:
milestone: 0.18.0 → 0.19.0
Revision history for this message
Kevin DuBois (kdub) wrote :

retargeted for release in 0.18 (landed while waiting for other blocking bugs)

Changed in mir:
milestone: 0.19.0 → 0.18.0
Kevin DuBois (kdub)
Changed in mir:
status: In Progress → Fix Committed
tags: removed: performance
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.3 KiB)

This bug was fixed in the package mir - 0.18.0+16.04.20151216.1-0ubuntu1

---------------
mir (0.18.0+16.04.20151216.1-0ubuntu1) xenial; urgency=medium

  [ Kevin DuBois ]
  * New upstream release 0.18.0 (https://launchpad.net/mir/+milestone/0.18.0)
    - ABI summary: Only servers need rebuilding;
      . Mirclient ABI unchanged at 9
      . Mirserver ABI bumped to 36
      . Mircommon ABI unchanged at 5
      . Mirplatform ABI unchanged at 11
      . Mirprotobuf ABI unchanged at 3
      . Mirplatformgraphics ABI bumped to 7
      . Mirclientplatform ABI unchanged at 3
      . Mirinputplatform ABI added. Current version is 4
    - Enhancements:
      . Use libinput by default, and remove the android input stack
      . Add x11 input probing
      . Add alternative buffer swapping mechanism internally, available with
        --nbuffers 0
      . Automatic searching and selection of input platforms
      . Better support for themed cursors
      . Add demo client that uses multiple buffer streams in one surface
      . Improve fingerpaint demo to use touch pressure
      . Allow for configuring cursor acceleration, scroll speed and left or
        right handed mice
      . Allow for setting a base display configuration via client api
      . Various nested server multimonitor fixes and stability improvements
      . Remove DepthId from the SurfaceStack
    - Bug fixes:
      . Unit test failures in Display.* on Android (LP: #1519276)
      . Build failure due to missing dependency of client rpc code on mir
        protobuf (LP: #1518372)
      . Test failure in
        NestedServer.display_configuration_reset_when_application_exits
        (LP: #1517990)
      . CI test failures in various NesterServer tests (LP: #1517781)
      . FTBFS with -DMIR_PLATFORM=android (LP: #1517532)
      . Nesting Mir servers with assorted display configs causes lockup
        (LP: #1516670)
      . [testsfail] RaiseSurfaces.motion_events_dont_prevent_raise
        (LP: #1515931)
      . CI test failures in GLMark2Test (LP: #1515660)
      . Shells that inject user input events need to agree with the system
        compositor on the clock to use (LP: #1515515)
      . mircookie-dev is missing nettle-dev dependency (LP: #1514391)
      . Segmentation fault on server shutdown with mesa-kms (LP: #1513901)
      . mircookie requires nettle but libmircookie-dev doesn't depend on it
        (LP: #1513792)
      . libmircookie1 package does not list libnettle as dependency
        (LP: #1513225)
      . display configuration not reset when application exits (LP: #1511798)
      . unplugging external monitor causes nested server to throttle client
        (LP: #1511723)
      . 1/2 screen on external monitor (LP: #1511538)
      . unity-system-compositor crash, no interaction on windowed mode
        (LP: #1511095)
      . [regression] arm64/powerpc cross compile doesn't build any more
        (LP: #1510778)
      . mir_connection_get_egl_pixel_format() crashes if libEGL is loaded
        RTLD_LAZY (LP: #1510218)
      . [multimonitor] nested server surface positioning incorrect
        (LP: #1506846)
      . unity-system-compositor fails to build against lp:mir r3027
   ...

Read more...

Changed in mir (Ubuntu):
status: Confirmed → Fix Released
Kevin DuBois (kdub)
Changed in mir:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.