Mir server with QtUbuntu client can cause system crash or Android GPU driver stall (Qualcomm)

Bug #1566747 reported by Nick Dedekind
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Expired
High
Unassigned
mir (Ubuntu)
Expired
High
Unassigned
qtubuntu (Ubuntu)
Invalid
High
Unassigned

Bug Description

While investigating new surface occlusion work in unity8 I've noticed that resizing a client multiple times can cause the system to either crash or a kworker process to cycle at 100% while attempting to reset the GPU.
Only seems to occur on Qualcomm devices

To Reproduce:
I've written an extension for the demo server which can be found @ https://code.launchpad.net/~nick-dedekind/mir/occlusion-test-failures

Server:
mir_demo_server --test-resize [--test-resize-timeout=200]

Client:
QT_QPA_PLATFORM=ubuntumirclient qmlscene Test.qml

-----------------------------
[Test.qml]
Item {
   width: 700
   height: 1000
}
-----------------------------

Tags: nexus4 nexus7
Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

From dmesg

[ 231.829742] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| GPU PAGE FAULT: addr = C03AD000 pid = 3839
[ 231.829833] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| context = 0 FSR = 4001000A FSYNR0 = 9000008 FSYNR1 = 443542(read fault)
[ 231.829986] kgsl kgsl-3d0: ---- premature free ----
[ 231.830108] kgsl kgsl-3d0: [C0000000-C03C0000] (egl_surface) was already freed by pid 3839
[ 231.830230] kgsl kgsl-3d0: ---- nearby memory ----
[ 231.830322] kgsl kgsl-3d0: [C0303000 - C03AC000] (+guard) (pid = 3839) (egl_surface)
[ 231.830444] kgsl kgsl-3d0: <- fault @ C03AD000
[ 231.830505] kgsl kgsl-3d0: [C0400000 - C0702000] (+guard) (pid = 3839) (egl_surface)

Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

cpu backtrace for 100% lockup

[ 1995.808514] [<c08f0240>] (__irq_svc+0x40/0x70) from [<c03e7b70>] (adreno_isidle+0x38/0xb8)
[ 1995.808545] [<c03e7b70>] (adreno_isidle+0x38/0xb8) from [<c03e8a34>] (adreno_idle+0x74/0xa4)
[ 1995.808545] [<c03e8a34>] (adreno_idle+0x74/0xa4) from [<c03d63dc>] (_ringbuffer_start_common+0x2a8/0x2c8)
[ 1995.808575] [<c03d63dc>] (_ringbuffer_start_common+0x2a8/0x2c8) from [<c03d64a8>] (adreno_ringbuffer_start+0x54/0x68)
[ 1995.808606] [<c03d64a8>] (adreno_ringbuffer_start+0x54/0x68) from [<c03e9ccc>] (adreno_soft_reset+0x130/0x158)
[ 1995.808606] [<c03e9ccc>] (adreno_soft_reset+0x130/0x158) from [<c03e9d1c>] (adreno_reset+0x28/0x118)
[ 1995.808637] [<c03e9d1c>] (adreno_reset+0x28/0x118) from [<c03d8c4c>] (dispatcher_do_fault+0x9cc/0xb48)
[ 1995.808667] [<c03d8c4c>] (dispatcher_do_fault+0x9cc/0xb48) from [<c03d91f8>] (adreno_dispatcher_work+0xb8/0x460)
[ 1995.808667] [<c03d91f8>] (adreno_dispatcher_work+0xb8/0x460) from [<c00904e8>] (process_one_work+0x280/0x414)
[ 1995.808698] [<c00904e8>] (process_one_work+0x280/0x414) from [<c0090d40>] (worker_thread+0x1a4/0x2c4)
[ 1995.808728] [<c0090d40>] (worker_thread+0x1a4/0x2c4) from [<c0095b00>] (kthread+0x94/0xa0)
[ 1995.808728] [<c0095b00>] (kthread+0x94/0xa0) from [<c000eefc>] (kernel_thread_exit+0x0/0x8)
root@ubuntu-phablet:/home/phablet#

Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

logcat (this can take a while to happen)

W/Adreno-GSL( 5764): <gsl_ldd_control:405>: ioctl fd 13 code 0xc02c093d (unknown) failed: errno 35 Resource deadlock avoided
W/Adreno-GSL( 5764): <log_gpu_snapshot:320>: panel.gpuSnapshotPath is not set.not generating user snapshot
W/Adreno-ES20( 5764): <core_glFlushInternal:37>: GL_OUT_OF_MEMORY
W/Adreno-GSL( 5764): <gsl_ldd_control:405>: ioctl fd 13 code 0xc02c093d (unknown) failed: errno 35 Resource deadlock avoided
W/Adreno-GSL( 5764): <log_gpu_snapshot:320>: panel.gpuSnapshotPath is not set.not generating user snapshot
W/Adreno-EGL( 5764): <qeglDrvAPI_eglMakeCurrent:2900>: EGL_CONTEXT_LOST
E/libEGL ( 5764): eglMakeCurrent:775 error 300e (EGL_CONTEXT_LOST)

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Did this problem start just today? We turned on the new buffer scheme today in lp:mir ...

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Also, I can't reliably test your branch sorry:

Text conflict in CMakeLists.txt
Text conflict in debian/changelog
Conflict adding file include/test/mir/test/doubles/mock_input_device_hub.h. Moved existing file to include/test/mir/test/doubles/mock_input_device_hub.h.moved.
Text conflict in src/client/input/input_event.cpp
Text conflict in src/platforms/android/server/hal_component_factory.cpp
Text conflict in src/platforms/android/server/hal_component_factory.h
Text conflict in src/platforms/android/server/platform.cpp
Text conflict in src/platforms/evdev/libinput_device.cpp
Text conflict in src/server/input/default_configuration.cpp
Text conflict in src/server/input/default_input_device_hub.cpp
10 conflicts encountered.

Changed in mir:
status: New → Incomplete
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, ignoring the conflicts I've now built and tested your branch on desktop at least. No errors there.

But of course I wouldn't expect errors on desktop. The logs you've provided suggest the bug is in the android platform, or kernel driver.

Changed in mir:
status: Incomplete → New
tags: added: android
summary: - Mir server with QtUbuntu client can cause system crash or GPU driver
- stall
+ Mir server with QtUbuntu client can cause system crash or Android GPU
+ driver stall
Revision history for this message
Gerry Boland (gerboland) wrote : Re: Mir server with QtUbuntu client can cause system crash or Android GPU driver stall

Notes:
1. this is reproduced only on Adreno-based devices, so N4/N7.
2. we think this has been there with mir 0.20 and older.

QtUbuntu has some clunky code to try to manage resizing correctly:

When the Mir resize event comes in, there appears to be no guarantee that the next buffer the client acquires is at the new size. So instead we check the size of the buffer we get after every eglSwapBuffers call, and check if it has changed or not. If we have received a resize event, we try to swap buffers until we get one which has the newly resized dimensions.

This appears to work ok for the existing resizing stuff we do (mainly on orientation change, we resize surface by swapping width & height). But somehow more recent work has exposed this issue.

summary: Mir server with QtUbuntu client can cause system crash or Android GPU
- driver stall
+ driver stall (Qualcomm)
description: updated
Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

[ 231.830108] kgsl kgsl-3d0: [C0000000-C03C0000] (egl_surface) was already freed by pid 3839

I took a look through the android kernel gpu driver and it looks like something freed the shared memory. kgsl_ioctl_sharedmem_free stores a history of memory previously freed.
Unfortunately I couldn't determine what was calling the ioctl function, since the egl android drivers are proprietary...

Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

I've noticed that after a few resizes, the swapBuffers call can end up taking some time (up to about .5-1 seconds)

This coincides with the messages from dmesg about "(egl_surface) was already freed". Every time we get a stall from swapBuffers, we get that message.

Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

And for the life of me, i can't get it to happen for surfaces when switching orientations (which does a resize) with unity8.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Historically we have had instability with the buffering code (and resizing) on the android platform. Although I thought it had improved in the past year to the point of no more such bugs (?)

As for resizing lag, yes it's a known issue that's been playing on my mind again this week too... --> bug 1288021

Revision history for this message
Kevin DuBois (kdub) wrote :

I was recently reminded of strange adreno behavior in this regard while poking around lp:~kdub/mir/notify-buffers-directly. This branch is for the "new buffer semantics" but the problem might be the same.

Adreno will query via its ANativeWindow for the width of the surface, as opposed to just going off the information that's in the ANativeWindowBuffer. If the window size is incorrect with regard to the last buffer (that is, it has to be in sync with this vendor, kinda a DRY problem in the drivers behavior), I think the gpu will overrun the memory for the buffer and cause the kernel to explode.

The (egl_surface) error is unexplained though.

Kevin DuBois (kdub)
tags: added: nexus4 nexus7
removed: android
Revision history for this message
Gerry Boland (gerboland) wrote :

Any progress on this issue? This is blocking a nice optimisation in unity8

Revision history for this message
Kevin DuBois (kdub) wrote :

no progress so far. I have reason to beleive that NBS averts the issue, will check once all the lights are on for that feature.

Gerry Boland (gerboland)
Changed in qtubuntu:
status: New → Confirmed
Changed in mir:
status: New → Confirmed
importance: Undecided → High
Changed in qtubuntu:
importance: Undecided → High
Changed in mir (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Kevin DuBois (kdub) wrote :

Might be worth trying again in 0.23.0.

Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

I've done a quick test with 0.23 and it seems to work on my n7 now.
Will continue testing.

Revision history for this message
Nick Dedekind (nick-dedekind) wrote :

Confirmed fixed on n7 & n4. Marking as "Fix Released"

Changed in mir:
status: Confirmed → Fix Released
Changed in qtubuntu:
status: Confirmed → Invalid
Changed in mir (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

Unfortunately, we had to revert NBS on 0.23.1. You may want to recheck to make sure it hasn't come back.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

^ Needs retesting with Mir 0.24 or later (when the potentially relevant feature gets reintroduced)

Changed in mir:
status: Fix Released → Incomplete
Changed in mir (Ubuntu):
status: Fix Released → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for mir (Ubuntu) because there has been no activity for 60 days.]

Changed in mir (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Mir because there has been no activity for 60 days.]

Changed in mir:
status: Incomplete → Expired
Michał Sawicz (saviq)
affects: qtubuntu → qtubuntu (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.