Mir

[xmir] Unity doesn't start on ATI test machine (Mir fails to respond to drm_auth_magic request)

Bug #1204939 reported by Didier Roche-Tolomelli
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mir
Won't Fix
Medium
Unassigned
xserver-xorg-video-ati (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Sometimes when running on the i386 ATI jenkins box, XMir hangs when Compiz is started.

When it's hung, it's blocked in DRI2Authenticate, in mir_wait_for on the results of mir_drm_auth_magic.

Applying http://paste.ubuntu.com/5983202/ to Mir to debug this, we get:

Client output - http://paste.ubuntu.com/5981035/
Server output - http://paste.ubuntu.com/5981019/

So it appears that the second drm_auth_magic request is written to the socket but Mir never responds to it.

I have no idea why, though. When I attached gdb to unity-system-compositor in the hung case there didn't seem to be any deadlocked thread or the like; everything seemed to be waiting on input.

==== Original report ====

The log files are deceptive - the error comes from the failsafe X session which is started for some reason.

LightDM, unity-system-compositor, and XMir all come up fine and have no unexpected log output.

gnome-session.log, however, has
"""
I/O error : Bad file descriptor
/usr/share/compiz/place.xml:1: parser error : Document is empty

^
/usr/share/compiz/place.xml:1: parser error : Start tag expected, '<' not found

^
I/O error : Bad file descriptor
compizconfig - Info: Backend : gsettings
compizconfig - Info: Integration : true
compizconfig - Info: Profile : unity
compiz (core) - Info: Loading plugin: composite
compiz (core) - Info: Starting plugin: composite
compiz (core) - Info: Loading plugin: opengl

(polkit-gnome-authentication-agent-1:10828): polkit-gnome-1-WARNING **: Unable to determine the session we are in: No session for pid 10828
compiz (core) - Info: Unity is fully supported by your hardware.
compiz (core) - Info: Unity is not supported by your hardware. Enabling software rendering instead (slow).
compiz (core) - Info: Starting plugin: opengl
"""

Related branches

description: updated
summary: - Mir doesn't start on ATI test machine
+ Mir doesn't start on ATI test machine [(EE) Screen(s) found, but none
+ have a usable configuration.]
summary: - Mir doesn't start on ATI test machine [(EE) Screen(s) found, but none
- have a usable configuration.]
+ Mir doesn't start on ATI test machine (RADEON(0): [drm] failed to set
+ drm interface version.)
Revision history for this message
Chris Halse Rogers (raof) wrote : Re: Mir doesn't start on ATI test machine (RADEON(0): [drm] failed to set drm interface version.)

I don't suppose there's a way to get the /var/log? I'd like to see the Xserver command line args.

summary: - Mir doesn't start on ATI test machine (RADEON(0): [drm] failed to set
- drm interface version.)
+ Unity doesn't start on ATI test machine (fails to find GL acceleration)
description: updated
kevin gunn (kgunn72)
Changed in mir:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Chris Halse Rogers (raof) wrote : Re: Unity doesn't start on ATI test machine (fails to find GL acceleration) (logind fails to track session?)

I'm not sure if I'm seeing the same problem, but I can reproduce this symptom with my ATI box. On my ATI box it seems the problem is that logind hasn't noticed that my Xserver session is active, so hasn't set up the ACLs on /dev/dri/*

(And, furthermore, if the ACLs aren't set up on /dev/dri/* at startup, gnome-session now appears to set LIBGL_ALWAYS_SOFTWARE=1, so even if you fix the permissions of /dev/dri/* you won't get hw accel)

summary: Unity doesn't start on ATI test machine (fails to find GL acceleration)
+ (logind fails to track session?)
Revision history for this message
kevin gunn (kgunn72) wrote :

Archive file from /otto/saucy-i386-20130807-0916 this morning
basically fails & ends up in a locked state, subsequent relaunch of X results in a deadlock/watchdog timer pop...machine totally hosed.

We were able to grab the archive files off of it. attached are xorg.0.log, x-0.log, lightdm.log & unity-system-compositor.log
I think the xorg.0.log might not be correct...there was no .old (so maybe it was the old?)

lightdm seems interesting.

Revision history for this message
kevin gunn (kgunn72) wrote :

lightdm.log

Revision history for this message
kevin gunn (kgunn72) wrote :

unity-system-compositor.log

Revision history for this message
kevin gunn (kgunn72) wrote :

xorg-0.log

Revision history for this message
Chris Halse Rogers (raof) wrote :

Ok. Having logged in to this machine while it's exhibiting this bug, the results are... weird.

X is hung in DRI2Authenticate, in mir_wait_for on the drm_auth_magic_wait_handle. This has received == 0 and expected == 0, suggesting that we've not received a reply from unity-system-compositor.

summary: - Unity doesn't start on ATI test machine (fails to find GL acceleration)
- (logind fails to track session?)
+ Unity doesn't start on ATI test machine (hang in mir_wait_for())
Revision history for this message
kevin gunn (kgunn72) wrote : Re: Unity doesn't start on ATI test machine (hang in mir_wait_for())

data from yesterday, lexington ati machine working fine, raof's ati working fine, "otto" machine failing.

Specifically on the "otto" machine unity_support_test tool is stuck, its launched by compiz, most likely the first gl app

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

as per IRC discussion and multiple runs, seems like latest -ati or something in http://paste.ubuntu.com/5962760/ fixed it

Changed in mir:
status: Triaged → Fix Released
Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

reopening, still happens regularly even after trying another trick (disabling the unity_support_test call from the compiz initilization).

Changed in mir:
status: Fix Released → Triaged
Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

So, using RAOF's special ppa (ppa:raof/aubergine):

* /var/log/lightdm/unity-system-compositor.log:
http://paste.ubuntu.com/5981019/

* /var/log/lightdm/lightdm.log
http://paste.ubuntu.com/5981033/

* /var/log/lightdm/x-0.log:
http://paste.ubuntu.com/5981035/

* /var/log/Xorg.0.log:
http://paste.ubuntu.com/5981038/

compiz was stuck in the opengl plugin loading:
/home/ubuntu/.cache/upstart/gnome-session.log: compiz (core) - Info: Starting plugin: core
/home/ubuntu/.cache/upstart/gnome-session.log: compiz (core) - Info: Loading plugin: ccp
/home/ubuntu/.cache/upstart/gnome-session.log: compiz (core) - Info: Starting plugin: ccp
/home/ubuntu/.cache/upstart/gnome-session.log: compizconfig - Info: Backend : gsettings
/home/ubuntu/.cache/upstart/gnome-session.log: compizconfig - Info: Integration : true
/home/ubuntu/.cache/upstart/gnome-session.log: compizconfig - Info: Profile : unity
/home/ubuntu/.cache/upstart/gnome-session.log: compiz (core) - Info: Loading plugin: composite
/home/ubuntu/.cache/upstart/gnome-session.log: compiz (core) - Info: Starting plugin: composite
/home/ubuntu/.cache/upstart/gnome-session.log: compiz (core) - Info: Loading plugin: opengl
/home/ubuntu/.cache/upstart/gnome-session.log: compiz (core) - Info: Starting plugin: opengl
and it segfauled as Chris saw this morning (no stacktrace available though).

Revision history for this message
Chris Halse Rogers (raof) wrote :

And the patch to Mir to provide that output is: http://paste.ubuntu.com/5983202/

summary: - Unity doesn't start on ATI test machine (hang in mir_wait_for())
+ Unity doesn't start on ATI test machine (Mir fails to respond to
+ drm_auth_magic request)
description: updated
description: updated
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: Unity doesn't start on ATI test machine (Mir fails to respond to drm_auth_magic request)

This feels suspiciously related to bug 1212516.

Revision history for this message
Chris Halse Rogers (raof) wrote :

On inspection, I don't think this is related to bug 1212516. That fails with a SEGV due to an attempt to dereference 0x0, and as far as I can tell only fails because of previous failures (running just the drm_auth tests works fine).

We're not SEGVing here.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

SIGSEGV or not, what is important is the root cause of:
[ FAILED ] BespokeDisplayServerTestFixture.client_drm_auth_magic_calls_platform (1922 ms)

which is too close to this bug to ignore. I think we should resolve bug 1212516 before assuming it's completely unrelated to this one.

Revision history for this message
Olli Ries (ories) wrote :

this bug is blocking the release of mir & friends into archive, please prioritize accordingly

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Alan has suggested bug 1212516 is just a test case bug and not really related to this one. However we're not sure, and if it is related then my findings today suggest there is a chance that the issue might have been solved in either r977 or r981.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Linked a branch that tvoss suspects could fix the issue.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

From the latest test run using the selected branch (lp:~thomas-voss/mir/fix-fd-sharing):

* /var/log/lightdm/unity-system-compositor.log:
http://paste.ubuntu.com/6003055/

* /var/log/lightdm/lightdm.log
http://paste.ubuntu.com/6003056/

* /var/log/lightdm/x-0.log:
http://paste.ubuntu.com/6003059/

* /var/log/Xorg.0.log:
http://paste.ubuntu.com/6003061/

Revision history for this message
kevin gunn (kgunn72) wrote :

just adding to the comment above.
selected branch (lp:~thomas-voss/mir/fix-fd-sharing) tested, result was the same - hangs on compiz loading gl driver.
its worth noting, by this time xmir/mir has already been started and should've loaded gl no problem

Revision history for this message
kevin gunn (kgunn72) wrote :

also, just noticed that linux kernel bumped to 3.11 today for saucy (just moments ago)
would be good to make ensure a test run with latest kernel update

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

http://paste.ubuntu.com/6009313/
ATI machine still blocked today with latest kernel and everything ^

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I think you can and should ignore the compiz hang. Assuming we do try to start compiz even when auth_magic has failed to return, then I would half expect compiz' opengl plugin to be broken or to hang. So focus on the original problem which is going to be the root cause:
[ 12148.287] (II) [xmir] auth_magic called for magic: 28
[ 12148.288] (II) [xmir] auth_magic finished for magic: 28
[ 12149.814] (II) [xmir] auth_magic called for magic: 29

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

That said, if anyone can get stacks of the hung compiz threads, I would be very interested to see them.

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

seems this issue is in the radeon driver.

We looked with Thomas for a couple of hours this particular issue and came to the conclusion this one seems to appear only when switching from traditional Xorg to a Mir environment back and force after having some GPU load. The radeon driver seems to not really support that and performances are dropping until we get to a really bad situation when the machine is stuck.

Performance of glmark2 (under some loads are run in parallel) are getting down quickly, to even 0 FPS in some case after switching (installing/removing unity-system-compositor and restarting the container or lightdm):
Xorg (running glmark2 tests) -> Mir (running glmark2 tests) -> Xorg (running glmark2 tests)

The driver is kept being loaded in the host.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Multiple drm_auth_magic calls suggests multiple screens. Is that true? What is a "screen" in this context anyway?

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

@Daniel: I know there is at least one real monitor connected, people in Lexington will know if we have more than that.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The scenario described in comment #25 is not really a common real world use case. I for one expect lightdm/X to not restart properly in switching between X and XMir. It's just never worked very reliably in doing so. And that's true for intel graphics too, in my experience.

So have we ever seen this bug occur in the more common (and more important) use case of booting straight into XMir?

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote :

Hence the email I'm about to send an emailtelling we are not blocking Mir to release to distro.

Though, we are not 100% sure it's the real issue, just a high probability.
The ATI machine can't be used as of now for this testing, I keep it free (in the same state, with the same package and no ugprade) for upstream to look at it.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Question: Are there any processes which stay resident across iterations switching between X and XMir? Like unity-system-compositor?

If so, can you please check the fd's and memory of such processes to see if we're leaking something? If the problem is in user space then we have much more chance of tracking it down in the short term...

Revision history for this message
Didier Roche-Tolomelli (didrocks) wrote : Re: [Bug 1204939] Re: Unity doesn't start on ATI test machine (Mir fails to respond to drm_auth_magic request)

Le 21/08/2013 14:18, Daniel van Vugt a écrit :
> Question: Are there any processes which stay resident across iterations
> switching between X and XMir? Like unity-system-compositor?
>
> If so, can you please check the fd's and memory of such processes to see
> if we're leaking something? If the problem is in user space then we have
> much more chance of tracking it down in the short term...
>
Apart from the kernel and init, all the other processes are fresh.
You can have access to the machine, it's deprovisionned, to have a
closer look directly

Revision history for this message
kevin gunn (kgunn72) wrote : Re: Unity doesn't start on ATI test machine (Mir fails to respond to drm_auth_magic request)

just to add some additional information from this morning's debug not captured here.
it seems with the latest archive (Aug 21) that the problem exhibits itself sometimes on a first install - but upon reboot will not occur again.
the current thinking is that installing over and over is corner case, 1 reboot is at least a workaround.
changing to high...but leaving open for continued investigation

Changed in mir:
importance: Critical → High
Changed in mir:
assignee: nobody → Daniel van Vugt (vanvugt)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I've spent a day trying to figure out how Mir's socket logic could send a message which is never received. So far, I cannot find a theoretical reason. But it's very complicated and I don't properly understand it yet.

In the mean time, could someone please reproduce the bug, get it hung and send me instructions for accessing the machine? Alternatively, could someone please just post a dump of the call stacks from unity-system-compositor? I'm not yet convinced that it is indeed innocently idle...

Changed in mir:
status: Triaged → Incomplete
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Incomplete. No idea what the status of this one is now.

Changed in xserver-xorg-video-ati (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for xserver-xorg-video-ati (Ubuntu) because there has been no activity for 60 days.]

Changed in xserver-xorg-video-ati (Ubuntu):
status: Incomplete → Expired
Changed in mir:
assignee: Daniel van Vugt (vanvugt) → nobody
kevin gunn (kgunn72)
tags: added: xmir
summary: - Unity doesn't start on ATI test machine (Mir fails to respond to
+ [xmir] Unity doesn't start on ATI test machine (Mir fails to respond to
drm_auth_magic request)
kevin gunn (kgunn72)
Changed in mir:
importance: High → Medium
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

XMir 1.0 (the old Xorg extension) is now deprecated and is not being maintained or fixed. It is replaced by the new 'Xmir' binary (package 'xmir') introduced in Ubuntu 15.10 wily.

Changed in mir:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.