[regression] Mir servers (since 0.9) randomly crash in malloc due to heap corruption

Bug #1401488 reported by Michał Sawicz on 2014-12-11
36
This bug affects 4 people
Affects Status Importance Assigned to Milestone
GLib
Expired
Medium
Mir
Fix Released
Critical
Daniel van Vugt
0.9
Won't Fix
Critical
Unassigned
glib2.0 (Ubuntu)
Critical
desrt
mir (Ubuntu)
Critical
Unassigned

Bug Description

This happens randomly when using the phone

ProblemType: Crash
DistroRelease: Ubuntu 15.04
Package: unity-system-compositor 0.0.5+15.04.20141204-0ubuntu1
Uname: Linux 3.4.67 armv7l
ApportVersion: 2.14.7-0ubuntu10
Architecture: armhf
AssertionMessage: *** Error in `unity-system-compositor': corrupted double-linked list: 0xaa817808 ***
CrashCounter: 1
Date: Wed Dec 10 19:28:35 2014
ExecutablePath: /usr/sbin/unity-system-compositor
ExecutableTimestamp: 1417733344
GraphicsCard:

InstallationDate: Installed on 2014-12-11 (0 days ago)
InstallationMedia: Ubuntu Vivid Vervet (development branch) - armhf (20141211-020204)
ProcCmdline: unity-system-compositor --disable-overlays=false --spinner=/usr/bin/unity-system-compositor-spinner --file /run/mir_socket --from-dm-fd 9 --to-dm-fd 13 --vt 1
ProcCwd: /
ProcEnviron:

Signal: 6
SourcePackage: unity-system-compositor
StacktraceTop:
 __libc_message (do_abort=<optimized out>, fmt=0xb68e3628 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
 malloc_printerr (action=1, str=0xb68e366c "corrupted double-linked list", ptr=<optimized out>) at malloc.c:4996
 malloc_consolidate (av=av@entry=0xaa800010) at malloc.c:4165
 _int_malloc (av=av@entry=0xaa800010, bytes=bytes@entry=1264) at malloc.c:3423
 __GI___libc_malloc (bytes=1264) at malloc.c:2891
Title: unity-system-compositor assert failure: *** Error in `unity-system-compositor': corrupted double-linked list: 0xaa817808 ***
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

version.libdrm: libdrm2 2.4.58-2
version.lightdm: lightdm 1.13.0-0ubuntu2
version.mesa: libegl1-mesa-dev N/A

Related branches

Michał Sawicz (saviq) wrote :
information type: Private → Public

StacktraceTop:
 __libc_message (do_abort=<optimized out>, fmt=0xb68e3628 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
 malloc_printerr (action=1, str=0xb68e366c "corrupted double-linked list", ptr=<optimized out>) at malloc.c:4996
 malloc_consolidate (av=av@entry=0xaa800010) at malloc.c:4165
 _int_malloc (av=av@entry=0xaa800010, bytes=bytes@entry=1264) at malloc.c:3423
 __GI___libc_malloc (bytes=1264) at malloc.c:2891

Changed in mir (Ubuntu):
importance: Undecided → Medium
tags: removed: need-armhf-retrace

It's crashing in an innocent "new" call, which means the heap corruption has occurred at some unknown location in the past :(

#8 0xb697808c in operator new (sz=sz@entry=1264) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:49
        p = <optimized out>
#9 0xb6b0733e in android::InputDispatcher::notifyMotion (this=0x1ed8800, args=0xabbfe428) at /build/buildd/mir-0.9.0+15.04.20141125/3rd_party/android-input/android/frameworks/base/services/input/InputDispatcher.cpp:2410
        newEntry = <optimized out>
        policyFlags = 1174405120
        needWake = <optimized out>

We'll need to reproduce the crash under valgrind (or env MALLOC_CHECK_=3) to get a better core dump that shows the source of the problem.

Changed in mir (Ubuntu):
status: New → Incomplete
Changed in mir:
status: New → Incomplete
summary: - unity-system-compositor assert failure: *** Error in `unity-system-
- compositor': corrupted double-linked list: 0xaa817808 ***
+ [regression] Mir servers (since 0.9) crashing in malloc due to heap
+ corruption

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in unity-system-compositor (Ubuntu):
status: New → Confirmed
Daniel van Vugt (vanvugt) wrote :

Test case:
  1. Start mir_proving_server
  2. Start lots of clients (e.g. 30+ mir_demo_client_egltriangle)
  3. Wait a while and the server will crash.

See also the duplicate bug for more details.

Changed in mir:
status: Incomplete → In Progress
importance: Undecided → Critical
assignee: nobody → Daniel van Vugt (vanvugt)
milestone: none → 0.10.0
Changed in mir (Ubuntu):
importance: Medium → Critical
status: Incomplete → Triaged
summary: - [regression] Mir servers (since 0.9) crashing in malloc due to heap
- corruption
+ [regression] Mir servers (since 0.9) randomly crash in malloc due to
+ heap corruption
no longer affects: unity-system-compositor (Ubuntu)
Daniel van Vugt (vanvugt) wrote :

The first, and probably the primary, error is this:

==18516== Invalid read of size 4
==18516== at 0x71F518A: g_source_iter_next (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4302.0)
==18516== by 0x71F7A7E: g_main_context_check (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4302.0)
==18516== by 0x71F80EF: g_main_context_iterate.isra.29 (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4302.0)
==18516== by 0x71F825B: g_main_context_iteration (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4302.0)
==18516== by 0x4EC337C: mir::GLibMainLoop::run() (glib_main_loop.cpp:126)
==18516== by 0x4E89EA0: mir::DisplayServer::run() (display_server.cpp:223)
==18516== by 0x4E850E2: mir::run_mir(mir::ServerConfiguration&, std::function<void (mir::DisplayServer&)>, std::function<void (int)> const&) (run_mir.cpp:113)
==18516== by 0x4E84CB7: mir::run_mir(mir::ServerConfiguration&, std::function<void (mir::DisplayServer&)>) (run_mir.cpp:68)
==18516== by 0x4702A0: main (demo_shell.cpp:158)
==18516== Address 0xa857918 is 24 bytes inside a block of size 296 free'd
==18516== at 0x4C2BE10: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18516== by 0x71F5011: g_source_unref_internal (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4302.0)
==18516== by 0x4EC9FCD: mir::detail::GSourceHandle::~GSourceHandle() (glib_main_loop_sources.cpp:94)
==18516== by 0x4EC2D3D: (anonymous namespace)::AlarmImpl::cancel() (glib_main_loop.cpp:48)
==18516== by 0x4FBD3A4: (anonymous namespace)::TimeoutFrameDroppingPolicy::swap_unblocked() (timeout_frame_dropping_policy_factory.cpp:74)

Daniel van Vugt (vanvugt) wrote :

The regression came from this, so the bug just slipped into the 0.9.0 release too:

------------------------------------------------------------
revno: 2072 [merge]
tags: br0.9, v0.9.0
author: Alexandros Frantzis <email address hidden>
committer: Tarmac
branch nick: development-branch
timestamp: Wed 2014-11-19 02:07:20 +0000
message:
  server: Use the GLibMainLoop implementation by default

  This MP also adds an option of using the AsioMainLoop implementation (--use-asio-main-loop or MIR_SERVER_USE_ASIO_MAIN_LOOP) for easier comparative testing. Fixes: https://bugs.launchpad.net/bugs/1392256.

  Approved by PS Jenkins bot, Cemil Azizoglu, Alan Griffiths, Kevin DuBois.
------------------------------------------------------------

Daniel van Vugt (vanvugt) wrote :

Upstream bugs seem to already exist for the issue:
   https://bugzilla.gnome.org/show_bug.cgi?id=720186
and maybe:
   https://bugzilla.gnome.org/show_bug.cgi?id=737677

Changed in glib2.0 (Ubuntu):
assignee: nobody → Ryan Lortie (desrt)
Changed in glib:
importance: Unknown → Medium
status: Unknown → New
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in glib2.0 (Ubuntu):
status: New → Confirmed
Changed in glib2.0 (Ubuntu):
importance: Undecided → Critical
status: Confirmed → Triaged
Daniel van Vugt (vanvugt) wrote :

Further to the test case in comment #8, you can reproduce the bug under valgrind too (hence comment #9). Beware however that only triggers the issue if you run _sufficiently_little_ clients so that performance is still reasonable. So for me instead of 30+ egltriangles on valgrind I use only 7 of them.

PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:mir at revision 2197, scheduled for release in mir, milestone 0.10.0

Changed in mir:
status: In Progress → Fix Committed
Daniel van Vugt (vanvugt) wrote :

desrt just reminded me (after I reminded him :) that my apparent explanation of the bug doesn't make sense.

We're losing a reference to the GSource and the claimed unsafe callback should actually be safe already (as its caller is designed to hold a ref).

Although the workaround that's landed seems to perform well, I'm concerned as to understanding /why/ still...

Launchpad Janitor (janitor) wrote :
Download full text (5.2 KiB)

This bug was fixed in the package mir - 0.10.0+15.04.20150107.2-0ubuntu1

---------------
mir (0.10.0+15.04.20150107.2-0ubuntu1) vivid; urgency=medium

  [ Daniel van Vugt ]
  * New upstream release 0.10.0 (https://launchpad.net/mir/+milestone/0.10.0)
    - Enhancements:
      . Added support for Android HWC 1.3 devices.
      . Plumbing/preparation to support external displays on Android devices.
      . Reduced build dependencies.
      . Client API: Added version macros.
      . Began work on automatic driver probing, to intelligently choose the
        best driver for you.
      . Demo shell (mir_proving_server): Added desktop zoom feature using
        Super + mouse wheel.
      . Demo renamed: mir_demo_server_shell -> mir_proving_server
      . Other demo servers merged into -> mir_demo_server
      . Wider support for display buffer pixel formats in the mesa driver, for
        wider hardware support.
      . Performance: On mesa/desktop at least; only hold compositor buffers
        for the duration of the render, instead of the duration of the frame.
        Following this change the compositor report can now finally report
        render time instead of frame time.
      . Mir now starts reliably when a TV is connected by HDMI, and up to
        4K resolution (2160p) is known to work.
      . Plenty more enhancements logged in the bugs list below.
    - ABI summary: Servers need rebuilding, but clients do not;
      . Mirclient ABI unchanged at 8
      . Mircommon ABI unchanged at 3
      . Mirplatform ABI bumped to 5
      . Mirserver ABI bumped to 28
    - Bug fixes:
      . [regression] Mir servers (since 0.9) randomly crash in malloc due to
        heap corruption (LP: #1401488)
      . USC - mouse cursor on AMD graphics is drawing incorrectly
        (LP: #1391975)
      . Mir fails to start when a TV is connected by HDMI
        [std::exception::what: Invalid or inconsistent display configuration]
        (LP: #1395405)
      . Input/event driven clients may freeze indefinitely (LP: #1396006)
      . Mir server crashes with "std::exception::what: Failed to get front
        buffer object" when trying to fullscreen a surface (LP: #1398296)
      . Switching windows with a Trusted Prompt Session active loses the
        trusted prompt session (LP: #1355173)
      . CI test failure in multiple tests (LP: #1401364)
      . dh_install: usr/bin/mir_demo_server exists in debian/tmp but is not
        installed to anywhere (LP: #1401365)
      . [regression] demo-shell: Instead of moving surfaces they now fly
        off-screen (LP: #1403702)
      . [regression] Binaries are no longer runnable on other machines (or in
        other directories) (LP: #1406073)
      . [i865] unity-system-compositor fails to start: Failed to choose ARGB
        EGL config (LP: #1212753)
      . Mir's compositor holds buffers (blocking clients) for the duration of
        the frame, even when not necessary. (LP: #1264934)
      . Screen goes blank (black) briefly during display config changes which
        don't affect the display mode (LP: #1274359)
      . [enhancement] There should be a quit signal sent to sessions instead
        of killing them dir...

Read more...

Changed in mir (Ubuntu):
status: Triaged → Fix Released
Changed in mir:
status: Fix Committed → Fix Released
Changed in glib:
status: New → Expired
summary: - [regression] Mir servers (since 0.9) randomly crash in malloc due to
- heap corruption
+ Buy Fioricet Online without rx
description: updated
Daniel van Vugt (vanvugt) wrote :

Removed spam. And the above user doesn't seem to exist any more.

summary: - Buy Fioricet Online without rx
+ [regression] Mir servers (since 0.9) randomly crash in malloc due to
+ heap corruption
description: updated
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.