System locks up at boot with 686 and 686-smp kernels

Bug #8341 reported by Jason Toffaletti
20
Affects Status Importance Assigned to Milestone
linux-source-2.6.15 (Ubuntu)
Invalid
Medium
Fabio Massimo Di Nitto

Bug Description

Ever since I installed ubuntu, this machine hasn't gone a single day without
needing a reboot. I don't have any special kernel modules installed. The only
non-default thing I can think of is that I used reiserfs for the root partition
and ext2 for boot.

Usually the machine dies after I've gone home and I come in the next day to find
it is either completely frozen (can't even ctrl-alt-f1 to a term) or it's so
slow it literally takes a minute for a single keystroke to show up on the term
(I've never been able to use X11 after this has happened). The one time I was
able to run top, it took about 5 minutes to load and it showed negative numbers
ie. (-0.1) for processes CPU usage, and the load averages were all over 1.0 even
though 99% of CPU usage was idle time. Today this actually happened while I was
using the machine. I had firefox and gnome-terminal open using ssh. First I
noticed that my mouse (IBM USB mouse) was registering two clicks for every
single click. Then very shortly after the keyboard (IBM PS/2) stopped working
and I had to reboot using the power button on the case.

This is an IBM machine with a P4 CPU 2.60GHz w/ HT. It previously had a 78 day
uptime running fedora core 1 with a 2.4.2x-nplt smp kernel. So I have no reason
to suspect hardware, only the kernel. I'm willing to try patches and compile my
own kernel and do whatever else I can to help get this resolved.

Revision history for this message
Jones Lee (2529386) wrote :

I have the same probs running with /HT enabled, I recompile the kerenl without
HT support and find no probs.

Revision history for this message
Matt Zimmerman (mdz) wrote :

Is anything logged to /var/log/kern.log? Magic sysrq doesn't seem to be enabled
on i386, though it is on powerpc and amd64; that might have been worth a try.

This will be difficult to track down without more specific information about
what is happening when the system hangs.

Is there any disk activity when it is in this state?

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

(In reply to comment #0)
>
> This is an IBM machine with a P4 CPU 2.60GHz w/ HT. It previously had a 78 day
> uptime running fedora core 1 with a 2.4.2x-nplt smp kernel. So I have no reason
> to suspect hardware, only the kernel. I'm willing to try patches and compile my
> own kernel and do whatever else I can to help get this resolved.

Please attach your dmesg. Please also try booting with nosmp.

Thanks,

Revision history for this message
Jason Toffaletti (jason) wrote :

Created an attachment (id=178)
dmesg from 2.6.8.1-2-686-smp kernel boot

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

(In reply to comment #4)
> Created an attachment (id=178) [edit]
> dmesg from 2.6.8.1-2-686-smp kernel boot

Thanks for the dmesg. Please try nosmp and tell me if the problem goes away.

When it occurs, can the system still write to disk? If so please hit
CTRL-SCROLLLOCK and SHIFT-SCROLLLOCK on the console and send me the resulting
messages.

Revision history for this message
Jason Toffaletti (jason) wrote :

I added the nosmp option to /boot/grub/menu.lst:
...
kernel /vmlinuz-2.6.8.1-2-686-smp root=/dev/hda3 ro quiet splash nosmp
initrd /initrd.img-2.6.8.1-2-686-smp
...
and the machine refused to boot, it halted on loading initrd, right after
decompressing the kernel image I believe.

I'm now trying the plain 686 kernel instead of 686-smp, and seeing if that makes
a difference.

Revision history for this message
Jason Toffaletti (jason) wrote :

The 686 kernel just froze up on me and did the same thing. I forgot to do
CTRL-SCROLLLOCK and SHIFT-SCROLLLOCK, but I'm sure it will happen again soon.
Another strange side effect is that when I do something that causes the pc
speaker to beep, the beep lasts about 30 seconds. It's very annoying :)

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

(In reply to comment #7)
> The 686 kernel just froze up on me and did the same thing. I forgot to do
> CTRL-SCROLLLOCK and SHIFT-SCROLLLOCK, but I'm sure it will happen again soon.
> Another strange side effect is that when I do something that causes the pc
> speaker to beep, the beep lasts about 30 seconds. It's very annoying :)

Please boot with noacpi, nolapic, and noapic. Please post the dmesg again.

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

(In reply to comment #8)
> Please boot with noacpi, nolapic, and noapic. Please post the dmesg again.

Please reply.

Revision history for this message
Jason Toffaletti (jason) wrote :

Created an attachment (id=226)
dmesg w/ noacpi, nolapic, and noapic

Here is my grub entry:

title Ubuntu, kernel 2.6.8.1-2-686-smp
root (hd0,0)
kernel /vmlinuz-2.6.8.1-2-686-smp root=/dev/hda3 ro quiet splash
noacpi nolapic noapic
initrd /initrd.img-2.6.8.1-2-686-smp
savedefault
boot

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

(In reply to comment #10)
> Created an attachment (id=226) [edit]
> dmesg w/ noacpi, nolapic, and noapic

Sorry, can you please change noacpi to acpi=off and repeat?

Revision history for this message
Jason Toffaletti (jason) wrote :

Today is the first day since installing Ubuntu that I've arrived at work and my
machine didn't need a reboot. And I haven't had to reboot all day, so the
nolapic and noapic options seemed to help. I'm going to reboot right now with
noacpi changed to acpi=off, and I'll attach the dmesg. Just for good measure.

Revision history for this message
Jason Toffaletti (jason) wrote :

Created an attachment (id=243)
dmesg w/ noacpi changed to acpi=off

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

(In reply to comment #12)
> Today is the first day since installing Ubuntu that I've arrived at work and my
> machine didn't need a reboot. And I haven't had to reboot all day, so the
> nolapic and noapic options seemed to help. I'm going to reboot right now with

Good. Can you please try 2.6.8.1-9 with no options and post the dmesg with it?
 Thanks.

Revision history for this message
Jason Toffaletti (jason) wrote :

I tried -10 because it was the latest at the time and it still gave me problems,
so I had to reboot with the options to turn of apic and acpi. I didn't have a
chance to save the dmesg. I was super busy at work and I just needed my machine
to be usable.

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

OK, when you get around it please post the dmesg. Thanks.

Revision history for this message
Jason Toffaletti (jason) wrote :

Created an attachment (id=338)
dmesg for 2.6.8.1-11

Installed latest kernel, booted with default options. I'll let you know if this
freezes.

Revision history for this message
Jason Toffaletti (jason) wrote :

Came in this morning and machine was locked up.

Revision history for this message
Herbert Xu (herbert-gondor) wrote :

OK, can you please boot with nmi_watchdog=1 and setup a serial/network console
so that errors are logged even if you're in X? The other option would be to stay
in a text console and try to reproduce the hang. Perhaps you can leave it on
tty1 overnight or something.

Revision history for this message
Matt Zimmerman (mdz) wrote :

(In reply to comment #19)
> OK, can you please boot with nmi_watchdog=1 and setup a serial/network console
> so that errors are logged even if you're in X? The other option would be to stay
> in a text console and try to reproduce the hang. Perhaps you can leave it on
> tty1 overnight or something.

Can you try this please?

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

I'm having the same problems, so I'm adding myself to the CC list.

Revision history for this message
Jason Toffaletti (jason) wrote :

Created an attachment (id=716)
dmesg w/ nmi_watchdog=1

I booted with nmi_watchdog=1 on friday and left the machine on tty1 all
weekend. Nothing additional was logged to the console, but the machine
exhibited the same horribly slow performance, 60 second pc speaker beeps,
negative numbers for CPU usage in top, frozen mouse, and loss of keyboard key
repeat. I've attached the dmesg.

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

Running latest Hoary, seem to have the same problems.

When I boot the 686-smp kernel, the machine boots fine, but my keyboard no
longer exists.

The kernel is: linux-image-2.6.8.1-3-686-smp.

Adding "noacpi nolapic noapic" produced the same results.

Changing to "acpi=off nolapic noapic" --> same thing.

However, I can at least ssh in to the machine to look at stuff so it seems
everything else is working fine. Let me know if there's anything I can try.

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

Not sure if this matters, but I don't have any "kbd" variants loaded as modules.
 (BTW, I'm booting in non graphical mode (GDM), if that matters.)

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

Lastly,

I think my "warty" install using this same package works just fine.

However, it's the "hoary" version I'm having trouble with. Odd, eh?

Revision history for this message
Matt Zimmerman (mdz) wrote :

(In reply to comment #23)
> Running latest Hoary, seem to have the same problems.
>
> When I boot the 686-smp kernel, the machine boots fine, but my keyboard no
> longer exists.

That doesn't sound like the same problem; didn't the original reporter describe
a hang, rather than a non-functional keyboard? And at semi-random times, rather
than immediately at boot?

> The kernel is: linux-image-2.6.8.1-3-686-smp.

Please send the output from "dpkg -l linux-image-2.6.8.1-3-686-smp" and "dmesg".

> Adding "noacpi nolapic noapic" produced the same results.
>
> Changing to "acpi=off nolapic noapic" --> same thing.
>
> However, I can at least ssh in to the machine to look at stuff so it seems
> everything else is working fine. Let me know if there's anything I can try.

Revision history for this message
Jason Toffaletti (jason) wrote :

This sounds like a completely different bug. When I boot everything is fine,
only after a few hours does the machine become unusable.

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

Created an attachment (id=736)
dmesg

dmesg for keyboard, the mouse lockup boot on 686 smp kernel

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

Created an attachment (id=737)
dpkg -l linux-image-2.6.8.1-3-686-smp

dpkg output

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

Attachments above.

Well, the subject says, "kernel flawed" which is what I think I'm experiencing....

This morning I rebooted the box after it ran all night on the 386-up kernel, and
magically got the keyboard back. Nice! So then I ran "startx" and now I don't
have a mouse. The only way I can tell I have a keyboard is Ctrl-Alt-<right/left
arrow> seems to work.

From this I conclude that it *might* be the same problem.

Revision history for this message
Matt Zimmerman (mdz) wrote :

(In reply to comment #30)
> Attachments above.
>
> Well, the subject says, "kernel flawed" which is what I think I'm experiencing....

An unfortunately vague title which I fixed at the same time that I responded to
your comment. Please file your bug separately.

Revision history for this message
Keith Irwin (keith-keithirwin) wrote :

If I separate bug seems useful, I'm happy to do it. ;)

Revision history for this message
Chuck Short (zulcss) wrote :

Could you try with a more recent kernel. You can download a cd-image to test if
that what you wish to do.

Regards
chuck

Revision history for this message
Jason Toffaletti (jason) wrote :

I tried booting a hoary live cd, but it had all kinds of video problems that
made it unusable. The second time I tried to boot, it locked up. It seemed like
a different problem than what I opened this bug for though. Once hoary is a
little closer to release, I'll try out the 2.6.10 and 2.6.11 kernels from hoary.
This is my work machine, so I can't risk downtime hunting down bugs. So far
booting with acpi=off noapic nolapic has worked great, this machine had a 90 day
update before I rebooted to try the livecd. I realize that isn't the ideal
solution, but it works.

Revision history for this message
Matt Zimmerman (mdz) wrote :

(In reply to comment #34)
> I tried booting a hoary live cd, but it had all kinds of video problems that
> made it unusable. The second time I tried to boot, it locked up. It seemed like
> a different problem than what I opened this bug for though. Once hoary is a
> little closer to release, I'll try out the 2.6.10 and 2.6.11 kernels from hoary.
> This is my work machine, so I can't risk downtime hunting down bugs. So far
> booting with acpi=off noapic nolapic has worked great, this machine had a 90 day
> update before I rebooted to try the livecd. I realize that isn't the ideal
> solution, but it works.

Hoary doesn't have much time left before release. If you're having problems
with the video on the live CD, they need to be reported as bugs so that they can
be fixed. It's working well for us, so if it isn't working for you, we need
your help to diagnose it. Otherwise, Hoary may not work properly on your hardware.

Revision history for this message
Jason Toffaletti (jason) wrote :

I tried out the Hoary preview live cd and the video problems were fixed. I also
installed hoary on an extra HD this weekend and left it running after booting
the 686-smp kernel 2.6.10.3-5. When I got in to work this morning the machine
seemed fine, but as soon as I tried to launch firefox it displayed the same
super slow behavior I initially described when reporting this bug.

Revision history for this message
Jason Toffaletti (jason) wrote :

This problem might have been a buggy BIOS. I had IT check for a BIOS update
again and IBM posted one on 3/11/2005. I installed it, and I'll let you know if
the problems go away.

Revision history for this message
Matt Zimmerman (mdz) wrote :

2.6.10.3-5 is not a valid kernel version number; which version did you test?

dpkg-query -W --showformat '${Version}' linux-image-`uname -r`

Revision history for this message
Jason Toffaletti (jason) wrote :

(In reply to comment #38)
> 2.6.10.3-5 is not a valid kernel version number; which version did you test?
>
> dpkg-query -W --showformat '${Version}' linux-image-`uname -r`

2.6.10.3-5 was the package version number. So far, updating the BIOS seems to
have worked. I'll know for sure after leaving the machine over night.

Revision history for this message
Matt Zimmerman (mdz) wrote :

(In reply to comment #39)
> (In reply to comment #38)
> > 2.6.10.3-5 is not a valid kernel version number; which version did you test?
> >
> > dpkg-query -W --showformat '${Version}' linux-image-`uname -r`
>
> 2.6.10.3-5 was the package version number. So far, updating the BIOS seems to
> have worked. I'll know for sure after leaving the machine over night.

If so, then I have no idea where you got that kernel, but it didn't come from
Ubuntu.

Please run the command above and send the output so that we know which kernel
you are using.

Revision history for this message
Jason Toffaletti (jason) wrote :

Sorry, that was the version number of the linux-image-686-smp meta package. The
version of the linux-image-2.6.10-4-686-smp package was 2.6.10-26, but when I
did the BIOS update I installed 2.6.10-27.

Revision history for this message
Matt Zimmerman (mdz) wrote :

(In reply to comment #41)
> Sorry, that was the version number of the linux-image-686-smp meta package. The
> version of the linux-image-2.6.10-4-686-smp package was 2.6.10-26, but when I
> did the BIOS update I installed 2.6.10-27.

There has never been a 2.6.10.3-5 of the metapackage, either. Are you certain
that you are using unmodified Ubuntu packages?

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

(In reply to comment #42)
> (In reply to comment #41)
> > Sorry, that was the version number of the linux-image-686-smp meta package. The
> > version of the linux-image-2.6.10-4-686-smp package was 2.6.10-26, but when I
> > did the BIOS update I installed 2.6.10-27.
>
> There has never been a 2.6.10.3-5 of the metapackage, either. Are you certain
> that you are using unmodified Ubuntu packages?

Probably 2.6.10.3-5 is coming from linux-restricted-modules.

Fabio

Revision history for this message
Jason Toffaletti (jason) wrote :

Yes, it was the restricted modules package version, I was pretty tired yesterday
(long weekend, two cross country red eye flights). Anyway, the BIOS upgrade
seems to be working great so far. This is the first time the machine has ever
stayed usable for two days without passing "acpi=off, nolapic, noapic" on boot.
According to hal, the BIOS version was 2AKT32AUS and after the upgrade it is
2AKT48AUS. It is a Phoenix made BIOS, IBM machine. I'm not sure there is much
else to do with this bug and it can be closed as far as I'm concerned. I feel
bad that it had little to do with Ubuntu and you guys wasted time trying to help
me. Maybe the kernel can do a better job of detecting buggy acpi support or
maybe you guys can use hal to detect the BIOS version number and give a warning
to users in the future. Other than that this bug has probably been a waste of time.

Revision history for this message
Jason Toffaletti (jason) wrote :

It seems the BIOS upgrade has fixed this issue, you can close the bug as far as
I'm concerned.

Revision history for this message
Chuck Short (zulcss) wrote :

Closing

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.