Ubuntu

Active VT tracking can fail at startup

Reported by Chris Coulson on 2010-03-22
252
This bug affects 59 people
Affects Status Importance Assigned to Milestone
ConsoleKit
Confirmed
Medium
consolekit (Ubuntu)
High
Colin Watson
Lucid
High
Colin Watson
Maverick
High
Colin Watson

Bug Description

Impact: ConsoleKit sometimes fails to determine which VT is active, breaking many parts of the system. For example, any dialog that requires PolicyKit authentication cannot be unlocked.
Development branch: Fixed in https://launchpad.net/ubuntu/+source/consolekit/0.4.1-4ubuntu1 by retrying console opens if they return EIO.
Patch: https://bugs.launchpad.net/ubuntu/+source/consolekit/+bug/544139/+attachment/1728928/+files/consolekit_0.4.1-3ubuntu2.debdiff
TEST CASE: Unfortunately, this bug is not uniformly reproducible, and may take many reboot attempts or even not be reproducible at all on any given system. If it is reproducible, then you can either try unlocking an administration dialog such as System -> Administration -> Time and Date (which will fail with a broken version), or (quicker) check for the string "Error waiting for native console" in /var/log/daemon.log. A successful fix will always permit a user with administrative privileges to unlock administrative dialogs.
Regression potential: When it breaks (not necessarily every time), consolekit is effectively completely broken. The test case should be sufficient to ensure that it is working properly.

Original description follows (note that the discussion about why EINVAL was being returned does not correspond to the end result of investigating this bug, but is preserved here for the record):

Binary package hint: consolekit

A few times over the last couple of days, I've noticed some weird consolekit issues where it doesn't correctly determine which VT is active, causing a lot of things to break (eg, disk mounting, rebooting, suspending etc). The issue is solved by rebooting.

When it fails, I get a lot of messages in my daemon.log when consolekit starts:

WARNING: Error waiting for native console 5 activation: Invalid argument

This occurs because the following call fails with EINVAL:

ioctl (console_fd, VT_WAITACTIVE, num);

I discussed this with Scott on #ubuntu-desktop. To summarize, there is a window between GDM starting and the X server coming up where the ioctl that consolekit does on the VT's will fail. Unfortunately, consolekit starts around the time of this window. Here is the log:

<chrisccoulson> Keybuk - i mentioned a consolekit issue last week, and your name was mentioned there
<chrisccoulson> that might have been what you remember
<Keybuk> can you remember more about what you mentioned?
<chrisccoulson> Keybuk - a couple of times when I booted last week, consolekit was unable to determine what the active VT was
<chrisccoulson> and it was throwing out errors like this:
<chrisccoulson> WARNING: Error waiting for native console 5 activation: Invalid argument
<Keybuk> right
<Keybuk> but why is consolekit using that ioctl?
<Keybuk> that's only used when you switch VT
<chrisccoulson> Keybuk - it spawns a thread for each VT, which waits for it to become active
<chrisccoulson> so it can track where the active one is
<Keybuk> ok
<Keybuk> it'll fail with -EINVAL for a short period during boot
<Keybuk> does it correctly back-off from that, and restart the thread again later?
<Keybuk> (if it goes into an infinite loop, that's not good either)
<chrisccoulson> Keybuk - no, that's probably the issue really. once it has failed, it just gives up
<chrisccoulson> so, we probably need to fix consolekit then?
<Keybuk> yeah
<Keybuk> we caused X to have the same bug
<chrisccoulson> ah, ok. that makes sense. and that explains why i can't recreate it all the time
<Keybuk> you get -EINVAL from VT_WAITACTIVE in a very specific condition
<Keybuk> the current foreground VT is in KD_GRAPHICS mode, but also VT_AUTO
<Keybuk> ie. it's been left with painted graphics ... but no process running on it
<Keybuk> since it's in graphics mode, the kernel prohibits VT switches
<Keybuk> can you guess when that condition is true?
<chrisccoulson> do you know how long it's in that condition for?
<Keybuk> chrisccoulson: however long the X server takes to start ;-)
<Keybuk> couple of seconds usually
<chrisccoulson> oh, right. that seems obvious now :)
<chrisccoulson> Keybuk - so the window is quite large then (and I think consolekit is activated after GDM starts isn't it?)
<chrisccoulson> i think gdm is the first thing to use it anyway
<Keybuk> chrisccoulson: gdm activates it
<chrisccoulson> yeah, i thought so
<chrisccoulson> thanks
<Keybuk> which means it's activated "before X starts or while X is starting"
<Keybuk> ie. exactly in that window
<Keybuk> chrisccoulson: so, on the VT_WAITACTIVE+VT_AUTO thing ... you could kinda argue it's a kernel bug
<Keybuk> because the kernel bug should deal with that case on its own
<Keybuk> but the kernel guys will tell you that the whole VT_* stuff is a mess, and they'd rather leave it alone
<chrisccoulson> yeah, it might be easier to work around it in consolekit for now
<Keybuk> exactly

Changed in consolekit (Ubuntu):
importance: Undecided → High
status: New → Triaged
description: updated
Graham Crumb (gcrumb) wrote :

Given that I'm still seeing these symptoms on an up-to-date Lucid installation, I'm assuming there's been no movement on this.

Can anyone suggest a more elegant work-around than 'log in and out a few times until things work again'?

Jacek Kałucki (laborm) wrote :

I can't.
I noticed, that "Login screen configuration" can't be unlocked simultaneously.

Hendrik Knackstedt (hennekn) wrote :

Hey guys!
I'm also having this problem but it didn't appear for a long time. Now, after the recent kernel update, it happens almost every time I start my laptop and it won't go away even if I completely shut down. Before, I was able to get it to work if I shut down and restarted. Can anyone confirm this worsening after the kernel update?

Tell me, if you need any more information regarding this issue.

Hendrik Knackstedt (hennekn) wrote :

Isn't there any workaround for this bug? Is there a way to downgrade the kernel to the previous version? There is nothing working anymore: no sound, no power management, no screensaver, I can't plug in any USB devices because it cannot be mounted and I can only shutdown using the terminal.

Also I have this bug in my Ubuntu 10.04 update 32 bit, but only sometimes this occurs....Any workarounds???

Are you using the current nvidia driver, by any chance? This sounds exactly like the problem that I had with the 'current' version. I downgraded to v. 173 and the problems went away. My symptoms were not being able to mount drives, power management problems, shutting down issues, no sound (that I noticed - there may have been more).

FYI: My computer hasn't suffered the symptoms described in this bug
report for over 2 months, so for my part this bug is fixed.

Hendrik Knackstedt (hennekn) wrote :

@ Anthony Long: I have an ATI Radeon X1250 (RS690M chipset), so I don't have the nvidia driver installed.

This bug still exists on my computer and appears almost every time since the last kernel update! Only workaround right now is to completely shutdown, start again and hope that it works this time. Log in and out does not work for me, restart doesn't work either!

Maybe the importance should be set to critical since it makes the system unusable?

Damien (takahara) wrote :

Same problem here, on all our production machines. Like Chris I have the following messages in daemon.log:

WARNING: Error waiting for native console X activation: Invalid argument

Only for me X grows from 1 to 62, so I have 62 messages there. (Oddly, the two last ones are 62 *then* 61)

Symptoms for the user are (AFAICT):
- no sound
- no mounting of external drives
- impossibility to power down/suspend (in my case the power off and suspend options are not visible in the system menu)
- many (if not all) tasks that require admin privileges cannot be performed. For instance, the user-admin interface does not ask for the admin password; its controls are simply inoperative.

The bug is already visible in GDM, in that it does not show the suspend option in the power menu (lower right icon). GDM does show the power off option, but clicking it does nothing.

I second Hendrik on this, this bug should be set to critical:
- the a lack of functionality resulting from this bug is _very_ bad. Even "commercial" operating systems don't behave like that ;-)
- this happens every now and then, which is even more frustrating and hard to track down for a any user
- a stock 10.04 install has this problem.
- this has been reported multiple times already
- this bug is 3 months old and no fix seems to have been released (if there is one please let us know :-)

I don't know heck about consolekit but I will certainly be happy test any solution that will be proposed.

Let's squish this ugly bug!!!

Alex Bird (alex-alexbird) wrote :

Happens to me every single boot on Acer Aspire One (110, ZG5, 1GB RAM) since the last time I ran update manager (e.g. most recent updates right now 10 July 2010). I assume this did a kernel upgrade. I never experienced this problem before. I am unsure how to try an earlier kernel without the grub screen appearing as it does in desktop ubuntu.
I've been telling everyone how great UNE is, but now I'm about ready to throw it out of the window.

Sepero (cowpie2000) wrote :

A Possible Solution

This bug seems very related to these two:
https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/543506
https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/554172

This solution here is what worked for me. If it works for you, please help spread it to others.
http://ubuntuforums.org/showpost.php?p=9480172&postcount=20

Damien (takahara) wrote :

Sepero,

Thanks a lot for your input. I'm testing your solution right now. Currently I have 30 reboots with no problems, which is a good sign.

I first tried 2) and 3) only, but it failed after 10 reboots (no power button etc...) Trying with all 3 now and it seems to work.

However, I can now see this error appearing twice:

ureadahead-other main process (###) terminated with status 4

where ### is a proc number.

I will report later when I have performed more reboots/shutdowns/boots cycles.

Crossing fingers!!

Damien (takahara) wrote :

Two remarks:

1) you will have to reactivate your network connection(s) the first time you login after doing the 3 changes proposed by Sepero. Obviously purging ifupdown has something to do with that... ;-)

2) the ureadahead errors that I mentioned previously seem to happen just after X starts (between a screen flash and my nvidia logo)

I have now 40 reboots and... no problems! Could this be It??

Damien (takahara) wrote :

After 106 reboots the bug appeared.... twice! I really thought it was gone for good 'cause nothing happened until the 90th reboot. But apparently something is still lurking out there...

In any case it's a nice improvement, thanks for the hack!! ( ^^)/ At least now we have a usable system.
For "linux grade" stability we'll need a proper fix in consolekit I guess...

Mike Bianchi (mbianchi-foveal) wrote :

Damien, Sepero, and all ...

If you look at https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/543506/comments/83 you will see
that I also recommend
    Add "init='/sbin/init --verbose'" to the linux boot line in grub.

I believe this helps by forcing "/sbin/init" to serialize starting up /etc/init/*.conf scripts ever-so-slightly and I found
that it helped avoid runlevel "unknown" boots.

Damien, please give --verbose a try and report on how you do. The fact that your are keeping a successful boot count should be very useful in verifying if this helps.

Please note that these are "fixes" are all get-arounds. What we desperately need is for someone to figure out why the writes to the console are occasionally failing!

Hendrik Knackstedt (hennekn) wrote :

Ok, I'm gonna try the work-around suggested in the forum. bug 554172 does not look like it's the same I'm seeing though because restart doesn't fix the problem only a total shutdown and then start again sometimes fixes the problem temporarily.

@DD: I can see the power off and suspend buttons but every time I click them it just takes me back to the login screen. Does that mean it's a different issue?

Damien (takahara) wrote :

Mike,

I don't have the runlevel-unknown issue, at least not that I think of.
But anyway, I'll try your extra hack today and see if it helps. Problem is I may have to do a good 200 reboots to be sure it helps so it's going to take some time. Stay tuned...

Damien (takahara) wrote :

Hendrik, all,

('DD' and 'Damien' are the same person ;-)

Is the power-off behaviour your only problem? Can you still perform admin tasks and have sound when your problem occurs? If the answer is yes then it is likely a different issue, at least partly. But I think there's quite a few bugs related to the same problem; the cause may be the same.

BTW, is there any way to ditch the new 'fast' boot process of ubuntu? It's not really stable at the moment, to say the least. If we could go back to a classic sequential boot sequence it may just solve all these problems.

Hendrik Knackstedt (hennekn) wrote :

No it's not the only problem, I can't power-off or restart, sound does not work, cups is not started and some other services like apache2 are not started either. I haven't tried to perform admin tasks yet so I don't know about that. How can I edit my linux boot line? Using grub2, so there is no menu.lst!

Damien (takahara) wrote :

Hi all,

Here are some preliminary results...

1) After setting Mike's "verbose" option the errors came back on 11 reboots out of 51 (~20%)

2) So I removed the "verbose" option, updated and installed grub and rebooted a few times. Errors were still there (7/42=16%), although I got more at the begining of my test than at the end. Not exactly the result I expected!

I have now reflashed the hard drive with the original image I had and I'm going to estimate the number of bad boots from this un-hacked system. Then try Sepero's hack again. Maybe the 1/50 error rate I reported earlier was not accurate ??

Damien (takahara) wrote :

Hendrik,

For admin tasks, try to add a user through the GUI users-admin. I don't know about apache or cups, but I'm not surprised they fail too.

Before editing your bootloader, first try without that hack (e.g. only Sepero's 3 hacks). Then if it doesn't work for you this is the way to edit your bootloader:

1) sudo gedit /etc/default/grub

In there, change the line

GRUB_CMDLINE_LINUX=""

to

GRUB_CMDLINE_LINUX=" init='/sbin/init --verbose' "

save this and then,

2) sudo update-grub

3) sudo grub-install <your drive device, e.g. /dev/sda>

then reboot a few times to see how it works.

Damien (takahara) wrote :

The results are in for today: a standard linux distro has this problem 7% os the time (7 bad boots, 93 good boots)

This is not so far from the 2/100 obtained before. I'm not so sure the hacking does any effect after all...

I'll try the hacks tomorrow, but I don't have much time to get this bug fixed so I may try an install of Debian instead ;-(

Damien (takahara) wrote :

I have no better news at this point so I contacted the consolekit guys to see what they think.

Damien (takahara) wrote :

According to an answer on the consolekit list, "GDM shouldn't be activating ConsoleKit until after the X server is up".
This may thus be a GDM bug (not waiting for X "ready" state before launching consolekit)

I have thus added GDM as affected package.

Damien (takahara) wrote :

Ubuntu being based on Debian, I've tried "testing" (aka squeeze). The good news is that it works without a glitch, the bad one is that I am switching all our production systems to Debian.

FYI, here's a comparison of relevant packages versions on Squeeze and Lucid:

Debian Squeeze:
--------------------------
GDM: 2.30.2-4
Consolekit: 0.4.1-4
policykit-1: 0.96.2
Kernel: 2.6.32-5

Ubuntu Lucid:
--------------------------
GDM: 2.30.2.is.2.30.0-0ubuntu2 (definitely the worst version string I've ever seen)
Consolekit: 0.4.1-3ubuntu1
policykit-1: 0.96.2
Kernel: 2.6.32-23-generic

Tobias Johansson (tj-mensa) wrote :

I reinstalled ubuntu on a computer. Before I had this issue maybee 1/5 startupps. But now (even with the workarounds) i would say its more 7/10 times.. I just started to use linux 2 month ago but Im starting to wonder if switching was a good idea.

Damien (takahara) wrote :

Tobias,

If you have a spare partition, could you try Debian "testing"? Since you experience this bug almost systematically, it would be interesting to see if Debian solves your problem like it did for me.
http://www.debian.org/devel/debian-installer/

Also, which video drivers do you use? If you have an nvidia card, try this:
lsmod | grep nvidia

It may be that the nvidia proprietary drivers make X slower to boot, resulting in The Bug.
Another thing: is your root partition on an SSD?

Damien (takahara) wrote :

(Chris Coulson, are you there? Is there anyone knowledgeable listening to this thread anymore?)

Some more investigations today: if I build consolekit from the latest source and install that, then the error is SYSTEMATIC.
Reinstalling the ubuntu package (--reinstall) does not change a thing, the error persists. I also noted that every 2-3 boot the X server will not even start.

Is the ubuntu source tree for consolekit available somewhere? After looking at the sources a fix seems trivial but I can't build a .deb that would install cleanly.

Thanks...

OC (olivier-c) wrote :

I confirm this on an HP Compaq nc6000 laptop, Ubuntu 10.04 i386, fresh install, both as installed and after updates.
My rate of "bad boots" is roughly 50% (based on roughly 20 restarts).

I installed Maverick alpha3 as a dual-boot. After 20 restarts I cannot reproduce the problem under Maverick.

I attach a syslog of a "bad boot" and one of a "normal boot".

OC (olivier-c) wrote :

Tried Mike Bianchi's proposed solution in comment 81 of Bug #543506.
As a result, 2 bad boots in a series of 20. This is five times less than before.

Also, marking the present bug as duplicate of Bug #554172, please correct me if mistaken.

Damien (takahara) wrote :

OC,

These bugs look different: they seem to have runlevel problems. Some comments point to this bug though, like #15 of Bug #543506.

There certainly is quite a lot of confusion with all these boot bugs.
This guy nailed exactly how messy all this is:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554172/comments/140

I'm personally amazed that a bug for which there is an apparent solution (like this one) is not fixed after FOUR months.
Heck, there isn't even anyone assigned to it, even if its priority is high!

"On Launchpad, no one can hear you scream..."

skorqa (skorqa) wrote :

Hi all,
I'm also affected by this bug, or at least encounter the exact same symptoms as Damien described in #9
The machine is a laptop Acer Aspire 1692 and the kernel is 2.6.32.24-generic-pae
The distrib is an up to date Lucid 10.04, ( fresh install + updates)
I did not perform enough boots to extract a significant statistical data about failing / successful boots, but i m attaching a few logs
of good and bad boots. for some of them i set 'debug init='/sbin/init --verbose' ' kernel options.

Like Damien i'm not sure how much this bug is related to #554172 but an upstart fix has been committed in lucid-proposed, so i suggest some of us install that fix to see if it has any effects on this bug. I did it and i still did not notice the bug after a few boots (not enough to be significant). The logs of these one are attached as well.

If the upstart fix reveals not to work in this case, do you think that a temporary work around could be downgrading gdm / consolekit and kernel to debian squeeze version, or do you think it would break the distribution stability ?

Cheers

Fred

skorqa (skorqa) wrote :
skorqa (skorqa) wrote :
skorqa (skorqa) wrote :
skorqa (skorqa) wrote :
skorqa (skorqa) wrote :

well, unfortunately the bug reappears after ~20 reboots, even with the new upstart patch...

regards

f

Damien (takahara) wrote :

Thanks Skorka, now we are sure it's not a duplicate :)

The good thing about this bug is that I'm now testing many other distros (Debian, CentOS, Fedora...) for our multitouch system and I'm seeing lots of good things there... "♪♪ Always look on the bright side of life... ♪♪"

Chris Coulson (chrisccoulson) wrote :

Sorry, this seems to have dropped off the radar a bit. I will take a look at this when I get some time, but I'm busy with other things at the moment. Somebody else feel free to take it if they get an opportunity to work on it before I do

Changed in consolekit (Ubuntu Lucid):
assignee: nobody → Chris Coulson (chrisccoulson)
milestone: none → ubuntu-10.04.2
Damien (takahara) wrote :

Hi Chris,

Nice to see you're back! :-)

I have tried a few things in the sources of consolekit but unfortunately installing my new deb can make things worse "sometimes" (i.e. X refuses to start every other time or so, even if I make no changes in the sources). Maybe some config files are not properly cleaned up during the upgrade? Well, I don't know what's going on there but I'll continue to probe.

In the meantime if you have a specific fix to suggest (code?) or even a .deb to test I'll be happy to stress-test it a few hundred times to see if it solves something.

Cheers,

Damien

tags: added: patch
Damien (takahara) on 2010-08-31
Changed in consolekit (Ubuntu Lucid):
status: Triaged → In Progress
64 comments hidden view all 144 comments
Lennie (lbesselink) wrote :

Damien,

Let me know what you'll find, because then we'll get a good idea of how bad it really is.

The new ppa has a loop which has 30 times 0.2 seconds (I changed the usleep also this time).

So that's 6 seconds, that ought to be enough ?

If you have any other code/suggestions let me know.

Damien (takahara) wrote :

Lennie,

I finished rebooting "a few times" and here's a histogram of what I've got.

It basically falls into 3 categories: less than 100ms (majority of cases), about 600~1000ms and close to 2000ms. I could not get anything >2 seconds, which is a good sign.

Given the wide range I would suggest starting the usleep value at 20ms and multiplying it by 2 every time it fails, with the max delay of 1 sec (what do you mean I'm going to 11? ;) ). That way we keep a good reaction time without polling hundreds of times. So we would check at the following accumulated times: 0 (first retry), 20, 60, 140, 300, 620, 1260, 2260, 3260,..., 10260. If it fails after those 10 seconds then we should have a phatt warning in the logs. A debug message telling the total time of the waiting loop would be nice too (if the loop is triggered, of course)

moojix (moojix) wrote :

just FYI:
the latest ppa (0.4.1-3ubuntu1-lbesselink7) works for me.

grep "but failed" /var/log/syslog | wc -l
0

grep "console-kit" /var/log/syslog | grep -v "VT_WAITACTIVE"
Sep 1 11:49:26 st-002 console-kit-daemon[1368]: WARNING: did not fail
Sep 1 11:49:26 st-002 console-kit-daemon[1368]: WARNING: skipping: 7

Damien (takahara) wrote :

moojix,

How often did the bug appeared with the stock ubuntu packages? If it was not systematic you'll need to reboot a few (or a lot) of times to trigger the bug and see how it's handled by Lennie's PPA.

moojix (moojix) wrote :

Damien,

i had the symptoms (no automount of external drives and shutdown/restart not working as user) described above in this bug at
99% of my reboots (On my new box since 4 weeks nearly with every daily boot - it's a workstation, not a server). As I had other priorities on this new box (i870 with ssd) I had just ignored this UI shutdown-problem ( sudo shutdown -h 0 works well :) ).
As i am not familiar with consolekit and boot process, I have no clue for which string or which startup sequence I need to look with the stock ubuntu consolekit package.
I have not systematic tested with the stock Ubuntu package, what's the difference in the error-case and the good-case was.

If Lennie is interested in, he can build a ppa with debug logging, but without the re-try fix, so I can try to reproduce the bug on my box with a nearly original Lucid-consolekit package, but with more logging.

Additionally what could be helpful: I will reboot several times with the original stock Ubuntu consolekit package to be sure, that I can reproduce the bug on this box (to validate my 99% failure rate without any fix...).

moojix (moojix) wrote :

okay, I've used the original Lucid consolekit package (0.4.1-3ubuntu1) for some reboot tests.
I can reproduce the symptoms: 1. reboot: okay - 2. .. 5. reboot failed with the symptoms.

1. reboot syslog (ok)
egrep "console" /var/log/syslog_20100901_2 | grep -v activation
Sep 1 18:16:04 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 18:16:04 st-002 kernel: [ 0.000000] console [tty0] enabled

2. reboot syslog (failed)
egrep "console" /var/log/syslog_20100901_3 | grep -v activation
Sep 1 18:20:03 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 18:20:03 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 1 18:20:03 st-002 init: Failed to open system console: Input/output error
Sep 1 18:20:04 st-002 console-kit-daemon[1050]: WARNING: Could not determine active console

Good. Now I'll try to reproduce this timing problem with Lennies PPA.

moojix (moojix) wrote :

I've repeated the tests with Lennies ppa (0.4.1-3ubuntu1-lbesselink7)

In 9/10 reboots this is my syslog:
grep console /var/log/syslog | grep -v VT_WAITACTIVE
Sep 1 19:00:26 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 19:00:26 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 1 19:00:26 st-002 init: Failed to open system console: Input/output error
Sep 1 19:00:30 st-002 console-kit-daemon[1364]: WARNING: did not fail
Sep 1 19:00:30 st-002 console-kit-daemon[1364]: WARNING: skipping: 7

in 1/10 reboot this is my syslog:
grep console /var/log/syslog_20100901_9 | grep -v VT_WAITACTIVE
Sep 1 18:58:08 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 18:58:08 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 1 18:58:09 st-002 console-kit-daemon[1147]: WARNING: did not fail
Sep 1 18:58:09 st-002 console-kit-daemon[1147]: WARNING: skipping: 1
Sep 1 18:58:09 st-002 console-kit-daemon[1147]: WARNING: skipping: 7

In all 10/10 reboots I have no symptoms as I've got with the original stock Lucid package (missing external drives etc.).
I do not understand how "init: Failed to open system console: Input/output error" is related.

Lennie (lbesselink) wrote :

moojix do you have one that does not say the following ?: did not fail

moojix (moojix) wrote :

Lennie,
no, unfortunately zero. all say with your ppa "did not fail".

Lennie (lbesselink) wrote :

Then I don't think I actually solved it, just slowed it down enough so it didn't fail.

I'm already looking at a diff between 0.4.1-3ubuntu1 and 0.4.1-3ubuntu1-lbesselink7

This morning the strace I tried also looked good. I don't think it's the debug/warning logging to syslog.

Maybe it is the calls to the debug-functions ? I did already change them from warning to debug, so it would not try to log anything through syslog.

An other day, an other ppa.

Lennie (lbesselink) wrote :

After a whole evening running tests and no failures (and thus no way to test the fix), the new ppa gave this result after less then 10 reboots:

Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: tried but failed
Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: eventually choose to use 12
Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: first fd was 3
Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: retried 1 times

So the workaround/test works for me.

moojix (moojix) wrote :

Lennie, I made 10 reboot tests. Looks good to me:

8x:
Sep 2 09:16:04 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 2 09:16:04 st-002 init: Failed to open system console: Input/output error
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: tried but failed
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: eventually choose to use 12
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: first fd was 3
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: retried 1 times
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: skipping: 7

2x
Sep 2 09:17:40 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 2 09:17:41 st-002 console-kit-daemon[1161]: WARNING: did not fail
Sep 2 09:17:41 st-002 console-kit-daemon[1161]: WARNING: skipping: 1
Sep 2 09:17:41 st-002 console-kit-daemon[1161]: WARNING: skipping: 7

0.4.1-3ubuntu1-lbesselink9 works for me, too.

Lennie (lbesselink) wrote :

Good very good, so now I can get back to talking to upstream about this. How they want to put in a real fix.

Damien (takahara) wrote :

Thanks Lennie! (^-^)v

I just understood why the console #3 was popping up all the time. That brings two newbie questions:

- why is nothing done if all no console could be obtained? IMHO the first warning should be printed from ck_get_a_console_fd. In fact, shouldn't the retry loop be in ck_get_a_console_fd?

- why is the console fd search stopped at 3? We often have values above that (11, 12...) so would it work if we just extended the search to 12 (or whatever the limit is)?

Lennie (lbesselink) wrote :

Damien, all good questions, but I got a better response from upstream:

Atleast /dev/tty0 should work, find out why it isn't.

Damien (takahara) wrote :

Well, I just answered my second question by trying it. It doesn't work.

But I still think that ck_get_a_console_fd should return ERROR and not a bogus fd.
This is fine tuning though...

Lennie (lbesselink) wrote :

Concerning your last question, I think a default kernel might have something like 65536 possible values for that and you don't want to check them all. Probably it can even be raised with a kernel-compile.

Lennie (lbesselink) wrote :

Your 3rd question:

No1 likes that part of the code, I've got the feeling that part will go away.

Colin Watson (cjwatson) wrote :

With reference to comment 110 in particular, I wonder if this is a consequence of bug 554172.

Colin Watson (cjwatson) wrote :

How about version 0.4.1-4ubuntu1~ppa2 in https://launchpad.net/~cjwatson/+archive/ppa? It should build soon. This retries the open() of console fds on EIO.

Colin Watson (cjwatson) wrote :

Could anyone please test the packages in that PPA? If it works and I get positive feedback, I can probably still just about squeeze it into 10.10.

moojix (moojix) wrote :

Colin, sorry. At the moment I can test for Lucid only. Anyone else?

Lennie (lbesselink) wrote :

Ohh, hi Colin, sorry I missed your first 2 messages.

I have a few remarks:

1. won't the timing be different on Maverick ? You still need to be able to reproduce the problem first.

2. Would it not be a good idea to build a ppa for Lucid so people can test that ?

3. I've been really busy with other things. I'm sad to say. I have some code I wanted to test to find out what the cause of the problem is and get a fix upstream.

4. Maybe I should schedule some time for that tonight or tomorrow night. As it's only 12 days left.

Colin Watson (cjwatson) wrote :

I've uploaded a matching Lucid version.

Colin Watson (cjwatson) wrote :

Lennie, I suspect that if you read through bug 554172 it will be enlightening. It explains why opening /dev/console (/dev/tty0 will be the same) can sometimes fail with EIO.

Lennie (lbesselink) wrote :

It was quiet a long bug report, but it made sense to me.

I wasn't able to test, the laptop I use for testing was in use this evening.

moojix (moojix) wrote :

I rebooted my box (Lucid) 10 times.
Good, I cannot reproduce the symptoms described in #9 or #109.
zero errors. all okay. no fail or error in syslog.
The Lucid consolekit package 0.4.1-3ubuntu2~ppa1 from cjwatson (#124 and #128) works for me.

Colin Watson (cjwatson) wrote :

OK, I'll go ahead then, and we can revisit if further problems arise. Thanks, and particular thanks to Lennie for tireless debugging.

Colin Watson (cjwatson) wrote :

I'll backport this to Lucid after it's spent some time being tested in Maverick.

Changed in consolekit (Ubuntu Maverick):
assignee: nobody → Colin Watson (cjwatson)
milestone: none → ubuntu-10.10
Changed in consolekit (Ubuntu Lucid):
assignee: Chris Coulson (chrisccoulson) → Colin Watson (cjwatson)
Changed in consolekit (Ubuntu Maverick):
status: Triaged → In Progress
Lennie (lbesselink) wrote :

Thanks Colin.

I had some time and access to the laptop again.

So I created a small patch on top of your patch to prove it what happends.

I did 10 boots and 4 times the code was triggered, 6 times it did not trigger,
but it seems those were the times it did not need to be triggered.

So I think that proves this works.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package consolekit - 0.4.1-4ubuntu1

---------------
consolekit (0.4.1-4ubuntu1) maverick; urgency=low

  * 10-retry_console_open_eio.patch: Retry console opens if they return EIO,
    since this may happen while a tty is closing (see LP #554172, and thanks
    to Lennie and others for extensive debugging work; LP: #544139).
 -- Colin Watson <email address hidden> Thu, 30 Sep 2010 14:49:11 +0100

Changed in consolekit (Ubuntu Maverick):
status: In Progress → Fix Released
Colin Watson (cjwatson) wrote :

Thanks a lot, Lennie!

Colin Watson (cjwatson) on 2010-11-10
description: updated
1 comments hidden view all 144 comments

Accepted consolekit into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in consolekit (Ubuntu Lucid):
status: In Progress → Fix Committed
tags: added: verification-needed
Martin Pitt (pitti) wrote :

It seems nobody forwarded this upstream? *sigh*, I'll clean up after this and do that, so that we can get rid of that delta.

Martin Pitt (pitti) on 2010-11-20
affects: gdm → consolekit
Changed in consolekit:
importance: Undecided → Unknown
status: New → Unknown
Changed in consolekit:
status: Unknown → Confirmed
Martin Pitt (pitti) wrote :

Any testers for the lucid update? At least to verify that it still works?

Verification for Lucid.

I've verified that the package upgrades correctly from a default Lucid installation and that after the installation the system reboots, that X and the network are working. If there are specific verifications to do, let me known.

Marking as verification-done.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package consolekit - 0.4.1-3ubuntu2

---------------
consolekit (0.4.1-3ubuntu2) lucid-proposed; urgency=low

  * 10-retry_console_open_eio.patch: Retry console opens if they return EIO,
    since this may happen while a tty is closing (see LP #554172, and thanks
    to Lennie and others for extensive debugging work; LP: #544139).
 -- Colin Watson <email address hidden> Wed, 10 Nov 2010 12:18:29 +0000

Changed in consolekit (Ubuntu Lucid):
status: Fix Committed → Fix Released
Changed in consolekit:
importance: Unknown → Medium
Martin Pitt (pitti) wrote :

Upstream requests updating this patch:

----------------
Hmm, so I have discussed this with some folks, and it seems we do need
something like this as the kernel will not help us here.

However, I don't think the patch is acceptable: there are other reasons why EIO
might be returned. Since we cannot distuingish the error causes we need at
least a safe way out. My suggestion would be to retry opening 20 times and then
give up. In systemd I have settled to use 20 times with 50ms in between,
totalling to 1s in total.

Could you please update the patch? I'll then merge it.
-------------------

Didn't work. Installed thunar and i can mount filesystems as a normal user using thunar

Displaying first 40 and last 40 comments. View all 144 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.