X starts on wrong tty because gdm starts before nvidia driver is ready

Bug #625239 reported by Bilal Akhtar on 2010-08-27
112
This bug affects 20 people
Affects Status Importance Assigned to Milestone
gdm (Ubuntu)
High
Unassigned
Lucid
Undecided
Unassigned
Maverick
High
Unassigned
nvidia-graphics-drivers (Ubuntu)
High
Unassigned
Lucid
Undecided
Unassigned
Maverick
High
Unassigned

Bug Description

This bug has been repurposed based on the comments beginning at #15. The original submitter's bug was a duplicate of bug #626723.

When booting certain systems with fast disks, gdm will start up and attempt to launch X before the nvidia kernel driver has fully initialized. This results in gdm restarting X again, without the 'vt7' argument, which may cause X's first-available-VT autoselection to pick a VT between 1 and 6 because the getty jobs have not all yet started. Ultimately this leads to X crashing when pressing certain keys (such as Enter or '2') because the tty has been set into a mode by getty that X is not prepared to handle.

This bug is currently only reported against lucid.

Bilal Akhtar (bilalakhtar) wrote :
Bryce Harrington (bryce) on 2010-08-27
affects: xorg (Ubuntu) → xserver-xorg-video-ati (Ubuntu)
Bryce Harrington (bryce) on 2010-08-27
Changed in xserver-xorg-video-ati (Ubuntu):
status: New → Confirmed
Bilal Akhtar (bilalakhtar) wrote :

This part of the Gdm 1 log is suspicious:

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x3b) [0x80e83bb]
1: /usr/bin/X (0x8048000+0x5da8d) [0x80a5a8d]
2: (vdso) (__kernel_rt_sigreturn+0x0) [0xc0840c]
3: /usr/bin/X (0x8048000+0x25d3e) [0x806dd3e]
4: /usr/bin/X (0x8048000+0x1a5ba) [0x80625ba]
5: /lib/libc.so.6 (__libc_start_main+0xe7) [0x3c8ce7]
6: /usr/bin/X (0x8048000+0x1a191) [0x8062191]

Caught signal 3 (Quit). Server aborting

Please consult the The X.Org Foundation support
  at http://wiki.x.org
 for help.
Please also check the log file at "/var/log/Xorg.0.log" for additional information.

I think this one should be triaged. Should I mark it so? If it has been confirmed and it has enough details to get fixed, and if it has no dups then it should be triaged.

summary: - Pressing enter kills X
+ Pressing enter key kills X
Changed in xserver-xorg-video-ati (Ubuntu):
importance: Undecided → Medium
Bilal Akhtar (bilalakhtar) wrote :

It appears bug #532047 is the correct one and this is a dup.

Bilal Akhtar (bilalakhtar) wrote :

I did an experiment. I renamed the plymouth-splash.conf file in the /etc/init folder to *.disabled and then re-started without plymouth. It worked!

This problem is affecting plymouth and not X.

summary: - Pressing enter key kills X
+ Pressing Enter crashes both X and plymouth in graphical mode
description: updated
Bryce Harrington (bryce) on 2010-08-29
tags: added: crash

This bug is still here from me in Maverick.
plymouth 0.8.2-2ubuntu3.
The crashed session is on VT7, new one on VT8.
Crashes on ENTER. Not tried "2".
Intel 4500 graphics. This is unrelated to ati.
Encrypted home, if it matters.

Andres (andres-erbsen) wrote :

I have intel graphics card and still have this bug, so it's unrelated to ati driver.

Changed in xserver-xorg-video-ati (Ubuntu):
status: Confirmed → Invalid
Robert Hall (reh07) on 2010-08-29
Changed in plymouth (Ubuntu):
status: New → Confirmed
Changed in plymouth (Ubuntu):
importance: Undecided → Medium
Andres (andres-erbsen) wrote :
Daniel Hahler (blueyed) on 2010-09-01
Changed in plymouth (Ubuntu):
status: Confirmed → Triaged
Steve Langasek (vorlon) on 2010-09-01
Changed in plymouth (Ubuntu):
status: Triaged → Confirmed
Bilal Akhtar (bilalakhtar) wrote :

I think its good to triage this one, I don't know why steve didn't. Should I?

Steve Langasek (vorlon) wrote :

"triaged" means "contains all the information necessary for a developer to start working on it". I don't know this to be the case - when I look at this bug, *I* don't see what appears to be the information needed to reproduce and debug this; if this were a widespread problem there would be many more open bug reports about it, so I think this is specific to your environment, in a way that will be difficult to pinpoint. So this bug is confirmed, but not triaged.

Bilal Akhtar (bilalakhtar) wrote :

I got the reason why this is happening. When X starts up, it sends a deactivate request to plymouth and not a quit request. So, plymouth keeps running, and it interprets the keypresses of enter and 2 to be of a quit request.

Bilal Akhtar (bilalakhtar) wrote :

Hence, I have made a patch on my sys (testing right now) which is more of a 'hack' rather than a fix, since it instructs plymouth to quit on deactivate requests as well. Should work well, since during the boot phase, its the only time it gets a deactivate request.

Bilal Akhtar (bilalakhtar) wrote :

Didn't work, give up

Robert Hall (reh07) wrote :

@Steve Langasek:

What log/trace are we missing?

Jamie Krug (jamie-thekrugs) wrote :
Download full text (3.4 KiB)

After spending hours pouring over Google results and bug reports like bug #529230 and bug #532047, I believe this is the bug I'm seeing on my Lucid (Ubuntu 10.04 64-bit on a new System76 Serval Pro--serp6) laptop. I have a decent amount of information to share, which corroborates this bug, and I've tried every workaround possible with no luck.

Honestly, bug #529230 appears a better fit, but it's listed as a duplicate of bug #532047, which supposedly released a fix. It was suggested that I comment on this bug. The slight difference between my experience and this bug's description is that I don't see a crash as soon as I press Enter, but rather later in my session it will crash when hitting Enter (after having pressed Enter many of times already).

Here is what I can confirm:

1. Upon first boot, I login to GNOME, and after normal use anywhere between 5 minutes and an hour, it will crash (always as I press Enter, to the best of my knowledge). I see a blank screen, and then the login screen again.

2. If I login again (after a crash, or if I just logout right away and login again), I never see another X reset until after a reboot.

3. After an initial boot up and first login, I'm always on tty1 or tty2, and I'll always get the X reset eventually (I'm certain I've witnessed this on both VT1 and VT2).

4. After logging in a second time (or if I just logout and login again after a boot), I'm always on VT8 and I never have an X reset.

5. On VT7 (Ctrl-Alt-F7), for both first and second logins, I always see an error like this:
(process:516) GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)

6. I see the following in auth.log, just before a crash (cat /var/log/auth.log | grep FAILED):
Sep 1 21:38:16 jkrug-serval login[6215]: FAILED LOGIN (1) on '/dev/tty1' FOR 'UNKNOWN', Authentication failure
Sep 1 21:38:19 jkrug-serval login[6215]: FAILED LOGIN (2) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:22 jkrug-serval login[6215]: FAILED LOGIN (3) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:25 jkrug-serval login[6215]: FAILED LOGIN (4) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:27 jkrug-serval login[6215]: FAILED LOGIN (5) on '/dev/tty1' FOR 'UNKNOWN', Error in service module

Here are the suggested workarounds that do NOT work for me:

1. I've tried adding APT source ppa:ubuntu-x-swat/x-updates and reinstalling my nVidia (current) driver, as suggested in this thread on the issue:
http://ubuntuforums.org/showthread.php?t=1545846

2. I tried changing /etc/default/grub back to standard distro settings instead of System76-shipped version (also suggested in above mentioned thread).

3. I also tried the "temporary workaround" in bug #532047 -- sudo mv /etc/init/plymouth-splash.conf /etc/init/plymouth-splash.conf.disabled -- no change. *Actually, I later noticed that on the day I tried this, I definitely had an X reset/crash, but noticed that there was no "UNKNOWN" grep from /var/log/auth.log on that date. So, slight change, or just an anomaly, but either way still a horribly annoying bug.

I've spent many hours on this and would love to see it go away. BTW, I have nVidia current driv...

Read more...

John S. Gruber (jsjgruber) wrote :

@Jamie:
If you don't mind my asking, what version of Plymouth are you running?

Jamie Krug (jamie-thekrugs) wrote :

@John:
Synaptic shows plymouth 0.8.2-2ubuntu2.

Steve Langasek (vorlon) wrote :

Bilal,

X sends no requests at all to plymouth. The display manager (gdm, kdm, etc) does this; and each of these DMs are patched in Ubuntu to send plymouth a quit request at the appropriate time. If you are not using the Ubuntu 10.04 packages for the display manager, it's possible this is the cause of the problem you're seeing.

plymouth is NOT supposed to quit on 'deactivate' requests, and any patch to make it do this would be rejected.

Anyway, if pressing enter causes X to exit, the problem is most specifically *not* a result of plymouth capturing and responding to these keypresses; it's almost certainly a result of X being started on the wrong VT, one which already has a getty running on it. What VT is X listed as running on in a process list when this happens? (You'll want to switch to another VT and log in on the console to check this.)

Jamie Krug (jamie-thekrugs) wrote :

Steve,

I have X running on VT1 or VT2 after every reboot, and I always eventually have X crash on me when pressing Enter (but never the first pressing of Enter).

On Fri, 2010-09-03 at 20:24 +0000, Steve Langasek wrote:
> Bilal,
>
> X sends no requests at all to plymouth. The display manager (gdm, kdm,
> etc) does this; and each of these DMs are patched in Ubuntu to send
> plymouth a quit request at the appropriate time. If you are not using
> the Ubuntu 10.04 packages for the display manager, it's possible this is
> the cause of the problem you're seeing.

You're right. This was the first time I tried to play with plymouth, and
I messed up stuff. Thankfully, I was able to revert back to the Ubuntu
packages without any further problem and now I am back to ground zero.

>
> plymouth is NOT supposed to quit on 'deactivate' requests, and any patch
> to make it do this would be rejected.
>
> Anyway, if pressing enter causes X to exit, the problem is most
> specifically *not* a result of plymouth capturing and responding to
> these keypresses; it's almost certainly a result of X being started on
> the wrong VT, one which already has a getty running on it. What VT is X
> listed as running on in a process list when this happens? (You'll want
> to switch to another VT and log in on the console to check this.)

I will check this soon.
>

Bilal Akhtar (bilalakhtar) wrote :

On Fri, 2010-09-03 at 20:24 +0000, Steve Langasek wrote:
> Bilal,
>
> X sends no requests at all to plymouth. The display manager (gdm, kdm,
> etc) does this; and each of these DMs are patched in Ubuntu to send
> plymouth a quit request at the appropriate time. If you are not using
> the Ubuntu 10.04 packages for the display manager

FYI I am using Maverick with all updates installed.

>
> plymouth is NOT supposed to quit on 'deactivate' requests, and any patch
> to make it do this would be rejected.
>
> Anyway, if pressing enter causes X to exit, the problem is most
> specifically *not* a result of plymouth capturing and responding to
> these keypresses; it's almost certainly a result of X being started on
> the wrong VT, one which already has a getty running on it. What VT is X
> listed as running on in a process list when this happens? (You'll want
> to switch to another VT and log in on the console to check this.)
>

Probably Jamie is having a different problem, since the workaround on bug #532047 worked very well for me.

dino99 (9d9) wrote :

what i've experienced:

- on lucid i386 + x-swat, with no encryption: no issue at all about this problem
- on maverick updated (i386 + x-swat + no encryption):

system boot fine then login seem ok ( no crash when validating password) but later when i need to enter password again ( for example when asked by synaptic or any other app/service), at that time, validating this password (by pressing "enter") drop me back to gdm login. Reentering password and validating it again work fine this time and dont crash till next reboot.

On my end, this issue is only with maverick and appear 2 weeks ago ( my installation is with gnome+compiz, will try without compiz and report back if any change)

dino99 (9d9) wrote :

what is strange: when i watch errors logged into auth.log, my password is declared each time as "incorrect" (im sure its good) after the first login

dino99 (9d9) wrote :

what happen after few tweaks:

- compiz package installed, system/prefs/appearance/visual effects: no visual effects
- system/prefs/compizconfig: disabled all "effects" & "extras" settings & session manager (into utils)

on next reboot: first login always fail, second is fine. Got these errors logged:

 polkitd(authority=local): Unregistered Authentication Agent for session /org/freedesktop/ConsoleKit/Session3 (system bus name :1.37, object path /org/gnome/PolicyKit1/AuthenticationAgent, locale fr_FR.utf8)
 gdm-session-worker[1562]: pam_succeed_if(gdm:auth): requirement "user ingroup nopasswdlogin" not met by user "oem"
 gdm-session-worker[1562]: pam_unix(gdm:auth): conversation failed
 gdm-session-worker[1562]: pam_unix(gdm:auth): auth could not identify password for [oem]
 gdm-session-worker[1562]: pam_winbind(gdm:auth): getting password (0x00000388)
 gdm-session-worker[1562]: pam_winbind(gdm:auth): Could not retrieve user's password
 gdm-session-worker[2716]: pam_succeed_if(gdm:auth): requirement "user ingroup nopasswdlogin" not met by user "oem"
 gdm-session-worker[2716]: pam_unix(gdm:session): session opened for user oem by (uid=0)
 gdm-session-worker[2716]: pam_ck_connector(gdm:session): nox11 mode, ignoring PAM_TTY :0

Bilal Akhtar (bilalakhtar) wrote :

Sep 5 10:56:23 bilal-laptop gdm-session-worker[2039]: pam_succeed_if(gdm:auth): requirement "user ingroup nopasswdlogin" not met by user "bilal"
Sep 5 10:56:28 bilal-laptop gdm-session-worker[2039]: pam_unix(gdm:session): session opened for user bilal by (uid=0)

It appears that GDM is putting up a false requirement for login , which is failing when we press enter.

dino99 (9d9) wrote :

new tweak: removing/purging everything about compiz, then rebooting

this time this issue is gone: login work as it might, reinstalling compiz make this issue back again: so as this happen on maverick and not with lucid (without compiz), it seems to me that something into pam/gdm/compiz settings process is conflicting with password built with letters+num

 my gdm login is roc53sa: so into gdm login, i first enable numlock then validate my password, and i've often seen numlockx been disabled when re-entering password for app or service. The problem exist with/without compiz and cant find what is disabling numlock: i only see it disabled when pressing "enter" to validate a password, so is it a keyboard setting issue or a utf8-FR or numlockx itself or gdm ?

John S. Gruber (jsjgruber) wrote :

I'm an Ubuntu amateur, so believe what Steve says over what I say. He's no amateur.

I don't think the message "requirement "user ingroup nopasswdlogin" not met by user "bilal" " is a problem. I see from the forum post http://ubuntuforums.org/showthread.php?t=819198 that this is a requirement that a userid like "guest" might meet so that no password is required for that particular userid. Userid's requiring passwords should not be in this group. If I'm right, and if the userid bilal and oem require passwords, that particular message in the log would be expected.

The reason I suggested in now closed duplicate bug #529230 that someone look in the auth file was to look for evidence that the getty program , responsible for getting people logged in on virtual consoles 1-6, was active seeing keystrokes coming from the xsession and getting confused. This only happens if x is started on tty1-tty6, as is happening to Jamie. See Steve's comment 18 above.

This probably isn't happening to Bilal. His xorg.0.log.old has the X crash and that log says it started on vt7. Andres says his crashing X is starting on vt7.

I haven't gotten my head around dino99's report, but I don't think most people who have had this problem have had a problem with authentication, only with interference with x, either from plymouth (which should have quit just before X started, if I'm correct), or from getty. In the case of getty, I think the interference only crashes X after getty thinks, mistakenly, that it has seen several login attempts and restarts. It messes with the virtual terminal when it does that and that does in X when you next press the 2 or enter key. That might be the reason for Jamie's delay before the crash.

John S. Gruber (jsjgruber) wrote :

@Jamie:
What are you seeing right before gdm brings up the login screen? In particular, are you watching the splash screen?

From the command "sudo ls -l /var/run/gdm" when was the firstserver.stamp file created?

So, I think there may be two separate issues here:
1) X started on the wrong tty, resulting in getty eventually timing out and resetting the tty input, manifests as “X crashes when pressing enter after 5 minutes or so”.
2) X started on the correct tty, but with incorrect input settings, manifests as “X crashes as soon as I press enter or 2 in gdm”. In this case, tty7 has some plymouth garbage text on it, including some “\▉”.

dino99 & Jamie Krug seem to have bug (1). Bilal & I have bug (2).

I think this bug should continue to be a discussion of bug (1), where X starts on the wrong tty and then gets killed after a timeout. Further work on bug (2) can continue on bug #626723.

summary: - Pressing Enter crashes both X and plymouth in graphical mode
+ X starts on wrong tty: pressing enter after 5 minutes crashes X
Bilal Akhtar (bilalakhtar) wrote :

Okay, I agree with what Chris is saying, it is correct that whenever I switched back to tty7 after the crash, I saw ^C or \▉ over there.

Changed in plymouth (Ubuntu):
importance: Medium → High
Jamie Krug (jamie-thekrugs) wrote :

@John:
I do see a splash screen. Everything is relatively quick, as I have an SSD in my new System76 laptop :D I see a blank screen with a blinking cursor, followed by a quick flicker to the splash screen, followed by another blank-screen/flicker, and then the login screen. I think there's a very quick flash of text either before or after the Ubuntu splash screen, which seems it might be the message I always see on tty7: "GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)".

As for the timestamp of the file /var/run/gdm/firstserver.stamp, it seems to be the timestamp of the most recent clean boot.

Thanks to all for the help. It sounds like I'm commenting on the right bug, which is exciting in and of itself. I have a friend who purchased a nearly identical System76 Serval laptop around the same time, and he's seeing exactly what I'm reporting as well. If there's any other logs or tests I can share to provide feedback, I'm eager to help out!

John S. Gruber (jsjgruber) wrote :
Download full text (3.1 KiB)

@Jamie:

Since I'm only another member of the community it's completely possible that I won't be able to see this through to a fix. A glance from a plymouth or gdm developer may mean more progress than I could make in days. If it's ok with you, though, maybe we can move this towards a successful triage state on our own so it will be easy for a developer to find the problem. I appreciate your willingness to try things. I may not recognize when we have enough info.

I think your reply was very interesting.

1. You have a system that flies. This bug may be a problem caused by a race condition. A faster system may have different results than a slower one. How many processors or cores does your system have? What speed? I ask this to look for correlations between your report and those by others.

2. You are configured to use splash and it is working.

3. The timestamp on your file is correct. Because you were occasionally starting on tty2, I had myself half convinced that it would have an old date and that gdm wasn't even trying to start on the same virtual terminal as plymouth. It may be that gdm thinks plymouth's active terminal is tty1 (or maybe even tty2).

4. Between the time you see the splash and you see the login you see a blink. There isn't supposed to be a blink. That may be where your virtual console is switching from tty7 to tty1 and X is sealing its fate by starting there (or may not).

Some basics first:

You said you are running plymouth version plymouth 0.8.2-2ubuntu2. What version of gdm and upstart are you running?

Next:

Could you try adding the boot option plymouth:debug in grub and on that boot recreate the problem? Could you then attach the plymouth debugging log, gdm log, and the log from the failing X. It will probably be in Xorg.0.log.old after you restart. The failing X log will contain a traceback and say something about a quit signal or another signal depending upon the key you pressed when it crashed.

@Dino:
If you are still with us. Would you mind terribly telling us, when it fails for you, what you see? What versions of plymouth, gdm, and upstart are you running? What is your computer like? Like Jamie, does the creation time of /var/run/gdm/firstserver.stamp match the login time (it should). I'm just trying to confirm there aren't two separate underlying reasons for starting X on tty1 or tty2.

Warning: If this is a race condition just adding plymouth:debug might change the timing and "fix" the problem by slowing plymouth down. The logs may still be helpful. I'm thinking that it's a race condition and that Dino's compiz experiment changed the timing. I certainly may be wrong.

The purpose of the logs is to make sure plymouth is running on tty7 as expected, to see if gdm is asking if plymouth is active on plymouth's virtual terminal, and to watch where X starts. The idea is to get an idea where things get directed to tty1. We won't need the auth logs. They simply helped establish the interference from the getty process. We already know if you are starting X on tty1 that we have this problem.

Whether you each have the time and inclination to continue on with me or not, thanks to you both for posting your experience...

Read more...

dino99 (9d9) wrote :

hi John,

here are your #33 requests:

my system:
home made desktop: p5wdh+q9550+8500gt+4go ddr2+2pata+1sata
os: maverick i386 with latest main server updates (35-20-generic-pae)
packages: plymouth 0.2.8.2-2ubuntu3, gdm 2.30.5-0ubuntu2, upstart 0.6.6-3, nvidia-current 256.53-0ubuntu2

note: compiz completly removed/purged

how it boot:

start as it might, gdm ask for password, i enable numlock as my password is built with num+letters, write the password and validate (enter): at that time gdm fail and it ask again for password, but numlock is again disabled (strange), i enabled numlock again and validate my password, now its working.

what is logged:

i've booted with plymouth:debug on the boot line.
/var/run/gdm/ : cant read it, the icon folder is greyed with an x, own by root:gdm (i've my user added to gdm group)

gdm-session-worker[1201]: GLib-GObject-CRITICAL: g_value_get_boolean: assertion `G_VALUE_HOLDS_BOOLEAN (value)' failed

dino99 (9d9) wrote :
dino99 (9d9) wrote :

some errors seen into debug.log

dino99 (9d9) wrote :

root@dub:/var/run/gdm# ls
auth-for-gdm-tIW1mo auth-for-oem-Z9n9W2 firstserver.stamp

firstserver.stamp is an empty file

John S. Gruber (jsjgruber) wrote :

merci, Dino

Jamie Krug (jamie-thekrugs) wrote :

@John: Thanks very much for helping this along.

I watched very closely when I booted this morning, and I'm certain that I'm seeing a flash of a text screen just after the Ubuntu splash, but before the GDM login screen, and I'm certain that it is showing this message near the top:
"GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)"

I can also confirm that this is the message, as always, at the top of the tty7 screen (I can press Ctrl-Alt-F7 before even logging in to confirm this). As always, I'm usually on tty2 (or tty1) after a boot, and tty8 after logging out and logging in again.

Here are my hardware specs:
System76 Serval Pro (SER-P6) laptop
Ubuntu 10.04 (Lucid Lynx) 64 Bit Linux
Core i7-840QM Processor ( 45nm, 8MB L3 Cache, 1.86GHz )
8 GB - DDR3 1333 MHz
Nvidia GeForce GTX 285M Graphics with 1GB GDDR3 Video Memory
160 GB Intel X25-M Solid State Drive

I can provide the logs you requested soon, but have a couple questions. To add the boot option plymouth:debug in grub, would I just append to the GRUB_CMDLINE_LINUX_DEFAULT parameter in my /etc/grub/default file and run update-grub? So, it would look like this:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash plymouth:debug"

I'm also uncertain as to where the plymouth debugging log will be located.

As for a gdm log, I see a number of different logs under /var/log/gdm/ -- which one should I provide?

dino99 (9d9) wrote :

hi Jamie,
have done it by editing the boot line (E) into grub menu ( added at end of quiet splash ) then boot (ctrl+x)

tags: added: patch
description: updated
summary: - X starts on wrong tty: pressing enter after 5 minutes crashes X
+ X starts on wrong tty: pressing enter after 5 minutes crashes X; gdm
+ restart of X involved
Steve Langasek (vorlon) on 2010-09-14
summary: - X starts on wrong tty: pressing enter after 5 minutes crashes X; gdm
- restart of X involved
+ X starts on wrong tty because gdm starts before nvidia driver is ready
Steve Langasek (vorlon) on 2010-09-14
description: updated
affects: plymouth (Ubuntu Maverick) → gdm (Ubuntu Maverick)
affects: xserver-xorg-video-ati (Ubuntu Maverick) → nvidia-graphics-drivers (Ubuntu Maverick)
Changed in nvidia-graphics-drivers (Ubuntu Maverick):
status: Invalid → Confirmed
description: updated
tags: added: lucid
removed: maverick
Changed in nvidia-graphics-drivers (Ubuntu Lucid):
status: New → Confirmed
Changed in gdm (Ubuntu Maverick):
status: Confirmed → New
Changed in nvidia-graphics-drivers (Ubuntu Maverick):
status: Confirmed → Incomplete
Changed in gdm (Ubuntu Maverick):
status: New → Incomplete
Changed in gdm (Ubuntu Lucid):
status: New → Confirmed
Changed in nvidia-graphics-drivers (Ubuntu Maverick):
importance: Medium → High
87 comments hidden view all 167 comments
Jamie Krug (jamie-thekrugs) wrote :

Steve,
I meant to confirm that the boot that produced the output file attached to comment #127 did indeed land me with X on vt2, and I did crash, as expected, after a few minutes, when pressing enter in Terminal.

Seth (bugs-sehe) wrote :

Here is my complete kernel.log since last reboot, just after downgrading to the regular gdm package for lucid (no workarounds).
You can do your own greps if you wish :)

-------------------------------------------------------------------------------
Now if someone is ever interested in more details, what follows is a prosaic account of my last session 'with the issue'. This is hardly new information just more detail. It also serves as evidence that the attached log file is relevant as the symptoms are seen.
-------------------------------------------------------------------------------

The boot went fairly quickly, and thus, expectedly, X and gdm landed on stty2. I then went for coffee - nothing happens. I subsequently read my mail, nothing happened, I harvested this log and started typing. That's when the signs start to appear!

I had the following watch running:

Every 2.0s: date; date -r /var/run/gdm/firstserver.stamp; stty -F /dev/tty2 Tue Sep 14 16:36:25 2010

   Tue Sep 14 16:36:25 CEST 2010
   Tue Sep 14 15:55:16 CEST 2010
   speed 38400 baud; line = 0;
   -brkint ixoff -imaxbel
   -isig -iexten -echo

After about 50 keystrokes in succession (approx!) the scroll lock light came on. I froze, because I knew this was the signal that the vt is about to crash X on reception of certain scan-codes. I was sly enough quietly slip into vt1 and do:

root@lucid:~# date; while true; do sleep .5; stty -F /dev/tty2 -isig; done
Tue Sep 14 16:28:05 CEST 2010

You can see how that would provide me with (racy) immunization against terminating the X session.
I'm still on the same terminal/X session now at 16:53. Every so often, the scrolllock light will toggle on or off (keys must be pressed for it to happen but there doesn't seem to be a pattern in timing, sometimes quite rapidly in succession).

The terminal 'flags' do change every now and then, but not in synch with the scrolllock light. Note again how I do not normally have a functioning scrolllock key/light (not even when X is on a separate vt7 without any other confusion).

Approx. halft the time the output of 'stty -F /dev/tty2 'is

    speed 38400 baud; line = 0;
    -brkint ixoff -imaxbel
    -isig -iexten -echo

Once in a while the output changes to the following for some time:

    speed 38400 baud; line = 0;
    min = 1; time = 0;
    -brkint -icrnl -imaxbel
    -opost -onlcr
    -isig -icanon -iexten -echo -echoe -echok -echoctl -echoke

Note that this is when I'm keeping -isig artificially 'low' by using the background loop running on vt1. I wouldn't be typing this otherwise :)
The symptoms went away when I issued 'stop tty2'
The time is 16:59 now

Jamie Krug (jamie-thekrugs) wrote :

Steve,

> You might also be able to confirm this is the problem by commenting out
> the "graphics-device-added fb0" part of the start condition.

Okay, testing with standard disto version of /etc/default/grub and the following in my /etc/init/gdm.conf:

start on (filesystem
          and started dbus
          and (drm-device-added card0 PRIMARY_DEVICE_FOR_DISPLAY=1
               or stopped udevtrigger))
stop on runlevel [016]

Boot seemed to be fairly quick and I did NOT notice the flicker I used to see after the splash screen. It was a pretty smooth transition to the GDM login screen, and I'm on tty7. I've also been working for 40 minutes now without a crash, which seems promising. Are there any logs you'd like me to report with this configuration/boot?

On 09/14/2010 05:09 PM, Jamie Krug wrote:
> Steve,
>
>
>> You might also be able to confirm this is the problem by commenting out
>> the "graphics-device-added fb0" part of the start condition.
>>
> Okay, testing with standard disto version of /etc/default/grub and the
> following in my /etc/init/gdm.conf:
>
> start on (filesystem
> and started dbus
> and (drm-device-added card0 PRIMARY_DEVICE_FOR_DISPLAY=1
> or stopped udevtrigger))
> stop on runlevel [016]
>
> Boot seemed to be fairly quick and I did NOT notice the flicker I used
> to see after the splash screen. It was a pretty smooth transition to the
> GDM login screen, and I'm on tty7. I've also been working for 40 minutes
> now without a crash, which seems promising. Are there any logs you'd
> like me to report with this configuration/boot?
>
>
So let me get this straight, you just removed the "graphics-device-added
fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" disjunction?
I might try that too. This could be brittle, because all this could do
is reduce the time window for the race condition to happen. I will
however give that a shot so I can confirm whether it would help at my
system too.

Seth

Jamie Krug (jamie-thekrugs) wrote :

Seth,

> So let me get this straight, you just removed the "graphics-device-added
> fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" disjunction?

Yes, that is exactly what I did, and nothing else.

Jamie Krug (jamie-thekrugs) wrote :

Steve, Seth,

The "graphics-device-added fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" removal tweak is not a guaranteed workaround after all. I just booted up again and crashed within a minute of use. I was just waiting for a Chrome tab to open--no keyboard activity this time.

On Tue, Sep 14, 2010 at 07:58, Steve Langasek
<email address hidden>wrote:

> has anyone reproduced this nvidia+gdm crash on maverick? Marking
> incomplete for maverick pending confirmation.
>
On my system, this bug was fixed.

Steve Langasek (vorlon) wrote :

On Tue, Sep 14, 2010 at 08:21:09AM -0000, Seth wrote:
> PS. the description no longer reflects that people with ATI cards have
> reported this issue (have I misread the comments or did you miss that?)

The only comments in this report from someone seeing issues on ATI are from
the original submitter, who was experiencing the apache2 stty bug. Brian
claims in comment #93 that he's havng "the problem" on intel graphics, but
offers no concrete information in support of this.

All the evidence points to an nvidia-specific race condition.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Seth (bugs-sehe) wrote :

On 09/14/2010 06:47 PM, Steve Langasek wrote:
> On Tue, Sep 14, 2010 at 08:21:09AM -0000, Seth wrote:
>
>> PS. the description no longer reflects that people with ATI cards have
>> reported this issue (have I misread the comments or did you miss that?)
>>
> The only comments in this report from someone seeing issues on ATI are from
> the original submitter, who was experiencing the apache2 stty bug. Brian
> claims in comment #93 that he's havng "the problem" on intel graphics, but
> offers no concrete information in support of this.
>
> All the evidence points to an nvidia-specific race condition.
>
>

Ok thanks for straightening that out

Seth (bugs-sehe) wrote :

On 09/14/2010 05:49 PM, Jamie Krug wrote:
> no keyboard activity this time.
>
Erm... perhaps this is a different bug? Did you inspect the log(s) to
see what aborted X?

Jamie Krug (jamie-thekrugs) wrote :

I seem to have discovered a slightly new style of crashes, which I believe have only occurred when I have a modified /etc/init/gdm.conf file. Yesterday and today I've witnessed this new crash behavior with both:
* Seth's /etc/init/gdm.conf workaround from https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/532047/comments/91
* Steve's suggested test that I note in comment #130 (which didn't crash immediately, but did upon a second boot)

These are the characteristics of my "new" crash experience:
* Boots with X on tty7 and seems fine for a while.
* Crashes without keyboard cause (sometimes using mouse/click, one time doing nothing but waiting for IntelliJ IDEA to open).
* Crash to login screen, then X on tty8.
* After crash and relogin to tty8, usually another crash very quickly (again, not related to keyboard activity), which then moves X to tty9.
* After a crash, I'm unable to switch to any other tty--all show completely blank screen.

My original issue, which I can reproduce any time by simply keeping the original /etc/init/gdm.conf and /etc/default/grub in place, was unique to the above in the following ways:
* First boot always to tty1 or tty2 (usually tty2).
* Always crashed within an hour of working, when pressing Enter.
* Always on tty8 after second login, and never crashed again until a reboot.
* Both before and after a crash, I was still able to switch to other ttys with no problem.

The one workaround that seems to always work, to the best of my knowledge, is when I only add "vga=799" to the linux boot line, and change nothing else. I'm pretty sure I went days without a crash with Seth's gdm.conf tweak, so it was odd when it suddenly started crashing like crazy yesterday.

Seth (bugs-sehe) wrote :

On 09/14/2010 07:50 PM, Jamie Krug wrote:
> I seem to have discovered a slightly new style of crashes, which I believe have only occurred when I have a modified /etc/init/gdm.conf file.

May I suggest _not_ tweaking your gdm.conf in that particular way then?

> change nothing else. I'm pretty sure I went days without a crash with
> Seth's gdm.conf tweak, so it was odd when it suddenly started crashing
> like crazy yesterday.
>
>
Perhaps you did not notice, but I have posted a fix that works for me.
It is a patch against gdm and the package is available from
https://launchpad.net/~bugs-sehe/+archive/gdm625239

To test: simply

    sudo add-apt-repository "deb
    http://ppa.launchpad.net/bugs-sehe/gdm625239/ubuntu lucid main"
    sudo apt-get update
    sudo apt-get install --reinstall gdm

To revert:

    sudo sed -i '/bugs-sehe\/gdm625239/d' /etc/apt/sources.list
    sudo apt-get update
    sudo apt-get install --reinstall gdm

If apt-get baulks at the reinstall step (there seems to be a bug hidden
there), you might use synaptic 'Package/Force Version... (Ctrl+E)' menu
item to select the desired version. Of course you can always simply
'apt-get remove gdm; apt-get install ubuntu-desktop' but I personally
hate the big-gun approach :)

I'm very curious to know whether it'll fix your situation as well

Seth (bugs-sehe) wrote :

I just noticed there is a slightly cleaner approach to the source manipulation fu:

    sudo apt-add-repository ppa:bugs-sehe/gdm625239

and vice-versa:

    sudo rm -fv /etc/apt/sources.list.d/bugs-sehe-gdm625239-lucid.list

Jamie Krug (jamie-thekrugs) wrote :

Seth,

> May I suggest _not_ tweaking your gdm.conf in that particular way then?

Yes :) I did so only as a workaround, but obviously that was not stable. I also just discovered that the same crashes can now occur even with a untouched gdm.conf. Possibly a different bug, sure, but I hope not!?!

> Perhaps you did not notice, but I have posted a fix that works for me.
> It is a patch against gdm and the package is available from
> https://launchpad.net/~bugs-sehe/+archive/gdm625239

I had not noticed you mention a fix, but not realize it was made available in your PPA. Thank you very much for the detailed instructions--that's what I needed.

BTW, I just disabled my nVidia drivers, so I could at least post an update without crashing, and I noticed that my boot was notably MUCH faster. I'll be curious to see if I notice the same with your fix. Off to apply your patch. Thanks!

Seth (bugs-sehe) wrote :
  • force.sh Edit (350 bytes, application/x-sh; name="force.sh")

On 09/14/2010 05:19 PM, Seth wrote:
> This could be brittle, because all this could do
> is reduce the time window for the race condition to happen
I can confirm my suspicion to be right because after - granted: many-
some reboots I saw the log message again in successive reboots:

    /var/log/syslog:Sep 14 20:38:39 localhost gdm-binary[984]: WARNING:
    GdmDisplay: display lasted 0.695752 seconds
    /var/log/syslog:Sep 14 20:42:04 localhost gdm-binary[1060]: WARNING:
    GdmDisplay: display lasted 0.198205 seconds

This was running _with_ my fixed gdm package, but from the analysis we
know that in this case the X desktop would have ended up shared with a
getty. If you are interested, you can easily verify for yourself how the
"nvidia not ready" condition will generate this message by using the
attached reproduction script:

    #!/bin/bash
    stop gdm
    sleep 2
    pgrep -l X && pkill X
    rm -fv /var/run/gdm/firstserver.stamp
    while lsmod | grep -q nvidia
    do
        echo unloading removing nvidia module
        rmmod nvidia
        sleep 1
    done
    lsmod | grep nv | nl
    {
        modprobe nvidia-current &
        start gdm &
        wait
    }
    sleep 5; chvt 1
    grep "display lasted" /var/log/syslog | tail
    date +"Time is now: %D %T"

Such a script, unfortunately, is not suitable for testing upstart
mutations (because the other subsystems interacting are not being
restarted; only a boot will give the right order and timing for these...)

Seth (bugs-sehe) wrote :

On 09/14/2010 09:28 PM, Jamie Krug wrote:
> BTW, I just disabled my nVidia drivers, so I could at least post an
> update without crashing, and I noticed that my boot was notably MUCH
> faster. I'll be curious to see if I notice the same with your fix. Off
> to apply your patch. Thanks!
>
If you are desparate for a workaround, the following should work, just
append it to rc.local (before exit 0...)

    gdmtty="/dev/$(ps --no-heading -o tty -p $(pgrep X))"
    while true
    do
        sleep .1
        stty -F "$gdmtty" -isig
    done

As I've documented before, I could be running 'stably' with X on a tty
shared with getty for more than an hour, seeing all the other signs
(scroll lock light toggling slowly etc.)

Jamie Krug (jamie-thekrugs) wrote :

Seth, thanks for the additional workaround. I've applied your patch from ppa:bugs-sehe/gdm625239 and all has been well for about an hour. My boot time seemed a little slow again, after re-enabling my nVidia drivers, but I can live with that if I have a stable X session! I will report back if I see another crash, but so far your fix appears to be successful for me as well. Thanks.

Seth (bugs-sehe) wrote :

On 09/14/2010 10:16 PM, Jamie Krug wrote:
> Seth, thanks for the additional workaround. I've applied your patch from
> ppa:bugs-sehe/gdm625239 and all has been well for about an hour. My boot
> time seemed a little slow again, after re-enabling my nVidia drivers,
> but I can live with that if I have a stable X session! I will report
> back if I see another crash, but so far your fix appears to be
> successful for me as well. Thanks.
>

Well, I can't say my boot has slowed down. My last boots were about 8
seconds (with 'my' gdm fix, attached bootchart[1]) so it would seems
that any slowness will not be attributable to nvidia init timing.
However, if you were to tell me that your boot time used to be _less_
than that ... I'm all ears !

I'm spoiled either way!

[1] I attach an older one too for comparison; the comparison is not
completely fair because I've since loaded my workstation with apache2
mod-mono for my summer tech workshop :) this, and perhaps a few other
'gadgets' may have slowed down the boot a tiny bit.

Jamie Krug (jamie-thekrugs) wrote :

Seth,

I just installed bootchart, rebooted and had 2 quick crashes (with your patch installed). I only see black screen when switching tty, so unable to check out logs so far. I'll report back soon.

Jamie Krug (jamie-thekrugs) wrote :

Seth,

With just a couple bootcharts, it appears I take 4 seconds longer to boot with nVidia drivers enabled. Also, despite having an extremely fast machine, your boots were about 400% faster than mine! I've attached the two, in case they're of any interest.

Is there any reason installing bootchart would reintroduce this bug, or does this just mean that your fix simply doesn't work for me? I could remove bootchart and see... Thanks.

Seth (bugs-sehe) wrote :

@Steve:

I've taken the trouble to install maverick beta (64bit). I installed on the same iron, same disk configuration (2x SSD in raid0, lvm2 root). After doing the server install I installed ubuntu-desktop and bootchart.

My few boots took max 5.71s in bootchart (woah) and all my X desktops were 'first time right' (with the explicit vt7 param visible in ps -ef).

So that seems to imply that the problem does not manifest itself on maverick (yet). However, since we have such good understanding of the triggering events I decided to check with my force.sh (forced reproduction script from comment #142).

Surprisingly, not only is the new nvidia/xorg combination vulnerable to the same start up race, what's more: the timing has gotten faster by just enough that gdm-simple-slave (gdm-server.c) runs out of launching attempts. The end result in that scenario is _not pretty_: the upstart job ("task"?) is in status 'running' while there is no X session at all!

All in all, it appear that the potential that this race occurs on Maverick is present just like in Ubuntu. Timings are different, and I haven't yet spontaneously encountered the situation on boot. But on manual triggering the behaviour is far worse.

Attached is the output of running force.sh on maverick (completely uptodate as of now)

Steve Langasek (vorlon) wrote :

Hi Seth,

On Wed, Sep 15, 2010 at 12:25:13AM -0000, Seth wrote:
> So that seems to imply that the problem does not manifest itself on
> maverick (yet). However, since we have such good understanding of the
> triggering events I decided to check with my force.sh (forced
> reproduction script from comment #142).

> Surprisingly, not only is the new nvidia/xorg combination vulnerable to
> the same start up race,

That's not surprising to me at all. Your force.sh does not accurately model
the conditions on startup; instead of a trigger for gdm starting based on
udev events, you are forcing gdm to start in parallel to the modprobe of
nvidia-current, which is always going to be racy.

> what's more: the timing has gotten faster by just enough that
> gdm-simple-slave (gdm-server.c) runs out of launching attempts. The end
> result in that scenario is _not pretty_: the upstart job ("task"?) is in
> status 'running' while there is no X session at all!

Oh. The intended behavior in this situation is that the gdm process will
exit non-zero, so that the /etc/init/failsafe-x.conf job will start up and
present the user with appropriate options. It sounds like this is a
separate regression in maverick; would you mind filing a bug report against
the gdm package for this?

> All in all, it appear that the potential that this race occurs on
> Maverick is present just like in Ubuntu. Timings are different, and I
> haven't yet spontaneously encountered the situation on boot. But on
> manual triggering the behaviour is far worse.

force.sh doesn't establish this at all. There *could* still be a race -
successful boots don't rule this out - but if so, we need to find it based
on looking at /var/log/udev from maverick and/or by reproducing it with an
actual boot.

Actually, there might be an easier way. Could you boot with '--verbose'
added to the boot options, and attach the resulting /var/log/syslog? This
should show us what events upstart is seeing related to video cards, and
may be enough to confirm/deny that the requisite event is being sent in
response to the nvidia driver and not an unrelated framebuffer driver.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Steve Langasek (vorlon) wrote :

On Tue, Sep 14, 2010 at 02:23:31PM -0000, Jamie Krug wrote:
> Attaching /var/log/udev, /var/log/plymouth-debug.log and /proc/fb after boot with no GRUB_GFXMODE set and:
> GRUB_CMDLINE_LINUX_DEFAULT="vga=799 plymouth:debug"
> GRUB_CMDLINE_LINUX=""

Thanks. Could you now provide the same files when *not* booting with this
workaround?

One curious thing I notice in the plymouth log is that none of the renderers
load:

[./plugin.c] open_device:could not open '/dev/fb0': No such file or directory

Can you confirm that when booting this way, you get a text-only plymouth
splash screen?

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Steve Langasek (vorlon) wrote :

On Tue, Sep 14, 2010 at 02:24:22PM -0000, Jamie Krug wrote:
> > Seth, Jamie, can one of you also show the output of 'grep fb
> > /var/log/kern.log' on a system showing this problem?

> Attached, after boot with standard configuration (no workarounds or
> related custom config at all).

> ** Attachment added: "kern.log-grep-fb.out"
> https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers/+bug/625239/+attachment/1583163/+files/kern.log-grep-fb.out

Great - it happens that this captures the output both when running with
grub_gfxmode (efifb runs, vga16fb tryies to start and fails), and without
(vga16fb starts and claims fb0). In both cases, the only framebuffer we see
here is unrelated to the nvidia kernel module, which explains how gdm is
starting before nvidia is ready.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Steve Langasek (vorlon) wrote :

On Tue, Sep 14, 2010 at 07:38:49PM -0000, Seth wrote:
> > This could be brittle, because all this could do
> > is reduce the time window for the race condition to happen
> I can confirm my suspicion to be right because after - granted: many-
> some reboots I saw the log message again in successive reboots:

> /var/log/syslog:Sep 14 20:38:39 localhost gdm-binary[984]: WARNING:
> GdmDisplay: display lasted 0.695752 seconds
> /var/log/syslog:Sep 14 20:42:04 localhost gdm-binary[1060]: WARNING:
> GdmDisplay: display lasted 0.198205 seconds

Can you include the gdm/X logs from these sessions, showing why it reset so
quickly - just to be absolutely certain of what we're seeing?

> This was running _with_ my fixed gdm package, but from the analysis we
> know that in this case the X desktop would have ended up shared with a
> getty.

<nod>

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Martin Pitt (pitti) wrote :

This sounds a lot like bug 615549 which was fixed yesterday. Is this still an issue with gdm 2.30.5-0ubuntu4?

Steve Langasek (vorlon) wrote :

On Tue, Sep 14, 2010 at 03:49:25PM -0000, Jamie Krug wrote:
> The "graphics-device-added fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" removal
> tweak is not a guaranteed workaround after all. I just booted up again
> and crashed within a minute of use. I was just waiting for a Chrome tab
> to open--no keyboard activity this time.

That sounds like a different sort of crash, then. Please show the gdm/X
logs from the crash in question - as well as for any previous X server start
attempts as part of this boot cycle.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Steve Langasek (vorlon) wrote :

On Wed, Sep 15, 2010 at 06:55:49AM -0000, Martin Pitt wrote:
> This sounds a lot like bug 615549 which was fixed yesterday.

Yep, sounds like exactly the same bug.

> Is this still an issue with gdm 2.30.5-0ubuntu4?

So far the only test on maverick from someone who's confirmed to see this
bug on lucid has failed to reproduce the problem there, even /without/ the
new gdm package. Nevertheless, I think there's enough information here to
mark this as a duplicate of 615549 and target that bug for SRU.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Seth (bugs-sehe) wrote :

On 09/15/2010 07:23 AM, Steve Langasek wrote:
> Can you include the gdm/X logs from these sessions, showing why it reset so
> quickly - just to be absolutely certain of what we're seeing?
>
Good call. I've reviewed all leftover Xorg logs and there were no (EE)
messages, particularly note related to nvidia. I have no reason to
believe that logs have gone missing, so I need to be more sure to make
that call. I will keep running with the modded upstart condition (which
incidentally matches the one from maverick, I believe?):

    start on (filesystem
              and started dbus
              and (drm-device-added card0 PRIMARY_DEVICE_FOR_DISPLAY=1
                   or stopped udevtrigger))
    stop on runlevel [016]

Let you know how I fare

Seth (bugs-sehe) wrote :

On 09/15/2010 03:02 AM, Steve Langasek wrote:
> Actually, there might be an easier way. Could you boot with '--verbose'
> added to the boot options, and attach the resulting /var/log/syslog?
Will do later tonight

Martin Pitt (pitti) wrote :

Seth [2010-09-15 7:53 -0000]:
> I will keep running with the modded upstart condition (which
> incidentally matches the one from maverick, I believe?):

Correct.

Seth (bugs-sehe) wrote :
Download full text (3.4 KiB)

On 09/15/2010 09:17 AM, Steve Langasek wrote:
> So far the only test on maverick from someone who's confirmed to see this
> bug on lucid has failed to reproduce the problem there, even /without/ the
> new gdm package. Nevertheless, I think there's enough information here to
> mark this as a duplicate of 615549 and target that bug for SRU.
>
Count me in. I just went the 'IKEA way' and added a reboot loop section
to my rc.local that logs all conceivable related info per boot[1]. The
used script has been attached. Per boot, a number of logfiles is
collected, various ps listings, package versions, upstart config is
saved in a tarball.

This resulted in 79 tarballs, which I intend to attach later in sofar
Martin/Steve have been asking for the info contained.

I tested 79 boots (_real_ boots) in all, in three batches:

--------------------------------------------------
(A) with plain vanilla lucid config and package: 40 boots. 30 of these
boots landed on tty2 rather than tty7:

    root@lucid:~/first# for a in gdmdebug_1*; do tar xOf $a tmp/ | grep
    usr/bin/X | head -1; done | cut -c31-34 | sort | uniq -c
         15 tty2
         25 tty7

All tty2 occurences coincide with multiple (1 _or_ 2) messages in the
logs about re-attempting to launch X:

    root@lucid:~/first# for a in gdmdebug_1*; do tar xOf $a
    var/log/daemon.log | grep 'display lasted' | wc -l; done | sort |
    uniq -c
         25 0
          6 1
          9 2

Interestingly, of the failing cases (75% of total boots), two-thirds
required 3 attempts, against one-third requiring two attempts to launch
X. Ah, well.

--------------------------------------------------
(B) with plain vanilla packages but maverick-style startup condition
(from bug 615549). 23 out of 23 land on tty7 just fine

    root@lucid:~/martinfix# for a in gdmdebug_1*; do tar xOf $a tmp/ |
    grep usr/bin/X | head -1; done | cut -c31-34 | sort | uniq -c
         23 tty7

--------------------------------------------------
 (C) in an effort to really torture test this when really reducing the
boot time to the absolute minimum ), i switched of as many services as I
could. Resulting boot time (1 sample) was around 7.20s. Obviously the
maverick fresh install was even faster than that.

Happily, 16 out of 16 boots landed X on tty7

    root@lucid:~/martin_minimal# for a in gdmdebug_1*; do tar xOf $a
    tmp/ | grep usr/bin/X | head -1; done | cut -c31-34 | sort | uniq -c
         16 tty7

All three batches were running the standard lucid package of gdm (manual
modifications done to gdm.conf in B and C):

    root@lucid:~# for a in */gdmdebug_*; do tar xOf $a | grep ^Version;
    done | uniq -c
         79 Version: 2.30.2.is.2.30.0-0ubuntu3

Hence, I would consider the fix in #615549 valid for this bug. Note that
it needs to be back-ported to lucid!
Finally, I think I will report the premature creation of
/var/run/gdm/firstserver.stamp as well as broken handling of gdm start
failure as bugs with gdm.

Thanks for all the thinking along, and especially Martin(?) for coming
up with the root cause.

[1] quote:

    #!/bin/bash
    #Not: set -e

...

    sleep 10
    chvt 1; echo "INFO: auto logging debug info"
   ...

Read more...

Seth (bugs-sehe) wrote :

On 09/15/2010 03:02 AM, Steve Langasek wrote:
> Actually, there might be an easier way. Could you boot with '--verbose'
> added to the boot options
Would that go into grub config? I'm not used to seeing --flags on the
kernel param line?

Jamie Krug (jamie-thekrugs) wrote :

Steve Langasek wrote in #150:
> Thanks. Could you now provide the same files when *not* booting with this
> workaround?

Yes, the attached archive contains two sets of logs, both without vga=799 or any other workaround in place. One has "quiet splash" appended to the boot line, which landed X on tty2 (as expected, and would likely crash eventually). The other has "plymouth:debug" appended to the boot line, which landed X on tty7 (and is generally more stable, rarely crashing--another near workaround). I've included udev, /proc/fb and any relevant X/gdm logs. HTH!

> One curious thing I notice in the plymouth log is that none of the renderers
> load:
>
> [./plugin.c] open_device:could not open '/dev/fb0': No such file or directory
>
> Can you confirm that when booting this way, you get a text-only plymouth
> splash screen?

Yes, with vga=799 I see a text-only splash.

Seth (bugs-sehe) wrote :

for info, I have deleted ppa:bugs-sehe/gdm625239 because I don't think it is relevant anymore. The patch is still available for reference in comment #109

Confirming on Maverick, but before changing the status, is there any other information you would like me to report?

Steve Langasek (vorlon) wrote :

On Wed, Sep 15, 2010 at 06:32:41PM -0000, Mohamed Amine IL Idrissi wrote:
> Confirming on Maverick, but before changing the status, is there any
> other information you would like me to report?

Confirming with what version of gdm? The latest version of gdm in maverick,
2.30.5-0ubuntu4, is believed to correct this issue for systems with the
nvidia binary drivers.

We would also want to see gdm/X logs showing that you're experiencing the
same crash, before considering this a confirmation.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Jamie Krug (jamie-thekrugs) wrote :

Forgive me if I'm missing something, but based on comments from Martin and Steve, and Seth's thorough testing reported in comment #159, my understanding is that I should simply wait for the bug #615549 fix to be back-ported to Lucid. Is this correct? Should I be updating bug #615549 for Lucid in some way, or will this bug continue as a way to get that back-port? I don't want to step on toes of developers or other who know better, but I didn't want to sit here doing nothing if I could help move this along :) Thanks.

John S. Gruber (jsjgruber) wrote :

@Jamie:
Looks like Martin is taking care of it in #615549. If I recall correctly, it will be (has been?) uploaded to the proposed queue and, after testing, it will be released to Lucid.

Seth (bugs-sehe) wrote :

 On 09/20/2010 03:39 PM, John S. Gruber wrote:
> *** This bug is a duplicate of bug 615549 ***
> https://bugs.launchpad.net/bugs/615549
>
> @Jamie:
> Looks like Martin is taking care of it in #615549. If I recall correctly, it will be (has been?) uploaded to the proposed queue and, after testing, it will be released to Lucid.
>
Yep and I vouch for it to work (having done extensive automated boot
testing on both lucid and maverick. On maverick my boot time can be as
low as 5s according to bootchart.)

Displaying first 40 and last 40 comments. View all 167 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers