[regression] Clients of nested Mir servers silently crash/exit instantly

Bug #1526209 reported by Daniel van Vugt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Fix Released
Critical
Mir development team
0.18
Fix Released
Critical
Mir development team
mesa (Ubuntu)
Invalid
Critical
Unassigned
mir (Ubuntu)
Invalid
Critical
Unassigned

Bug Description

Clients of nested Mir servers silently crash/exit instantly (on xenial)

This happens in Mir trunk lp:mir and lp:mir/0.18, but not in lp:mir/0.17

  sudo bin/mir_demo_server_minimal -f /tmp/outside &
  sudo bin/mir_proving_server -f /tmp/inside --host-socket=/tmp/outside &
  sudo bin/mir_demo_client_egltriangle -m /tmp/inside

And the client silently exits with return code 139.

Valgrind or gdb show the problem though:

==16515== Process terminating with default action of signal 11 (SIGSEGV)
==16515== Access not within mapped region at address 0x123460A8
==16515== at 0x6EA16E0: XGetXCBConnection (in /usr/lib/x86_64-linux-gnu/libX11-xcb.so.1.0.0)
==16515== by 0x517AC73: ??? (in /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0)
==16515== by 0x5174ADE: ??? (in /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0)
==16515== by 0x5174B98: ??? (in /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0)
==16515== by 0x5170B31: eglInitialize (in /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0)
==16515== by 0x4031A9: mir_eglapp_init (eglapp.c:350)
==16515== by 0x4024EC: main (egltriangle.c:85)

#0 0x00007ffff5b6e6e0 in XGetXCBConnection ()
   from /usr/lib/x86_64-linux-gnu/libX11-xcb.so.1
#1 0x00007ffff7893c74 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1
#2 0x00007ffff788dadf in ?? ()
   from /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1
#3 0x00007ffff788db99 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1
#4 0x00007ffff7889b32 in eglInitialize ()
   from /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1
#5 0x00000000004031aa in mir_eglapp_init (argc=3, argv=0x7fffffffe548,
    width=0x7fffffffdd48, height=0x7fffffffdd4c)
    at /home/dan/bzr/mir/0.18/examples/eglapp.c:350
#6 0x00000000004024ed in main (argc=3, argv=0x7fffffffe548)
    at /home/dan/bzr/mir/0.18/examples/egltriangle.c:85

Related branches

summary: - Clients of nested Mir servers silently crash/exit instantly
+ [regression] Clients of nested Mir servers silently crash/exit instantly
Changed in mir:
importance: Undecided → Critical
milestone: none → 0.19.0
Changed in mesa (Ubuntu):
importance: Undecided → Critical
tags: added: regression
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Connecting a software client to the affected nested server actually made the nested server crash:

ERROR: /home/dan/bzr/mir/fixmirout/src/gl/recently_used_cache.cpp(44): Throw in function virtual std::shared_ptr<mir::gl::Texture> mir::gl::RecentlyUsedCache::load(const mir::graphics::Renderable&)
Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::logic_error> >
std::exception::what: Buffer does not support GL rendering

description: updated
Changed in mir (Ubuntu):
status: New → Incomplete
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Annoyingly bisection is going to be slow, because of a separate regression - bug 1526225.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Bisected. The problem originated in:

------------------------------------------------------------
revno: 3098 [merge]
author: Cemil Azizoglu <email address hidden>
committer: Tarmac
branch nick: development-branch
timestamp: Tue 2015-11-10 23:10:22 +0000
message:
  Tighten probing rules. Fixes: https://bugs.launchpad.net/bugs/1506707.

  Approved by PS Jenkins bot, Andreas Pokorny, Alan Griffiths, Alexandros Frantzis.
------------------------------------------------------------

which coincidentally is also the source of another regression bug 1526225.

Changed in mesa (Ubuntu):
status: New → Invalid
Changed in mir:
status: New → Triaged
Changed in mir (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Confirmed, reverting r3098 fixes the problem.

tags: added: nested
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I swear I've also seen Mir 0.17.1 fail similarly (Unity8 on xenial desktop). I might be imagining things, or I might have been seeing some side-effect of bug 1506707 that perhaps also led to similar problems?...

Revision history for this message
Kevin DuBois (kdub) wrote :

seems to be mesa-specific

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Can't reproduce on wily

Revision history for this message
Kevin DuBois (kdub) wrote :

I also can see the nested server crash scenario when connecting mir_demo_client_egltriangle. I haven't seen the client crash yet

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Please clarify how to reproduce:

If run from a terminal window we don't need any of the "sudo's" as we simply load up the Mir-on-X stack and connect via $DISPLAY. In any case, sudo should only be require for the host mir server when using kms (--arw-file will make the endpoint accessible to all).

If running from a VT or ssh don't we need to specify a --vt to the servers? (C.f. lp:1515558)

Revision history for this message
Kevin DuBois (kdub) wrote :

hmm, and now after doing a reinstall and upgrade of all packages (on xenial), I can no longer reproduce the issue.

Revision history for this message
Andreas Pokorny (andreas-pokorny) wrote :

Yes i can also remember seeing that stacktrace when toying around with our reload hack, to get mesa to find our symbols during egl display detection. But I dont get that anymore on xenial with lp:mir or lp:mir/0.18

Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

I could not reproduce on xenial. Both mir-on-x and mesa worked for me. Note that there is a known issue with the nested server not being able to detect the platform (lp:1515558) so a '--vt' switch is needed on the server for correct behaviour.

Revision history for this message
Kevin DuBois (kdub) wrote :

So my problem noted in #8 was just a mis-match of packages on my system. Cannot reproduce with xenial + silo 021, nor with on-system builds of lp:mir, lp:mir/0.18, or lp:mir/0.17.

After standup today, it seems like we weren't able to reproduce the problem, provided that we had updated systems, and were using the (unintuitive) --vt switch properly (which became a bit more unintuitive starting at 3098). Hopefully everyone else comments below with further details of their testing.

The decision in the standup today was to proceed with release

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Missing information to aid reproducing:

1. Are you running from a terminal, a VT or a ssh session?
1.1. AFAICS the commands as given in the instructions can only work from a terminal session, but Mir-on-X doesn't *need* "sudo", so what happens without "sudo"?

2. Are you running a local build? (the commands as given in the instructions suggest a local build)
2.1. If so, which options to cmake (e.g. is --Duse_debflags=on? Or any other optimization?)
2.2. If so, are there any "old" modules in lib/server-modules/ or lib/client-modules/

Changed in mir:
status: Triaged → Incomplete
Changed in mir (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Kevin DuBois (kdub) wrote :

Haven't been able to confirm this specific scenario today, but another issue (not a crash, rather incorrect platform loading) was noticed related to rev 3098. Seems best to back out that rev for lp:mir and lp:mir/0.18 release.

Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

duflu, could you also try
1. with the '--vt' option ?
2. on another machine with a different GPU?

Changed in mir:
status: Incomplete → In Progress
assignee: nobody → Mir development team (mir-team)
Changed in mir (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Fix committed to lp:mir/0.18 at revision 3182, scheduled for release in Mir 0.18.0

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:mir at revision 3198, scheduled for release in mir, milestone 0.19.0

Changed in mir:
status: In Progress → Fix Committed
Revision history for this message
Andreas Pokorny (andreas-pokorny) wrote :

Wrt to the segmentation fault more details in https://bugs.launchpad.net/mir/+bug/1526658

Kevin DuBois (kdub)
Changed in mir:
status: Fix Committed → Fix Released
Changed in mir:
status: Fix Released → Fix Committed
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Fix Released in Mir 0.18.0. Although it's good to mention this bug got different fixes between the 0.18 and 0.19 branches, it's probably still best to only mention it in the changelog for one of them.

Changed in mir:
milestone: 0.19.0 → none
status: Fix Committed → Fix Released
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Note the root cause of the problem is possibly bug 1526658 (soon to be Fix Released in Mir 0.19.0)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.