Comment 6 for bug 586901

Revision history for this message
databubble (phil-linttell) wrote : Re: [lucid] intermittent full system freeze

I've removed the duplicate status of this report (previously marked duplicate of #585765.) I've been living with bug for over six months now (running with only 1 core enabled out of two) , and believe that it is quite specific and distinct from the range of issues in #58575.

My system was stable under Karmic and the early builds of Lucid. Just prior to the release of the Lucid beta, and ever since, my system will lock up (screen froze, can't toggle numlock, system fan high-speed, can't ping) unless I boot with kernel parameter "nolapic", disabling the second CPU core. Sometimes the system will log soon after logging in, sometimes hours after. I can ALWAYS stimulate a lock up by attempting to rip a DVD using Handbrake (within a couple of minutes) -- a process which completes successfully when booting with nolapic.) This system is generally more likely to freeze quickly under heavy load (CPU/disk).

Occasionally, the log will contain a message similar to the following immediately prior to the freeze:

Oct 13 13:52:44 family kernel: [ 37.408761] do_IRQ: 0.189 No irq handler for vector (irq -1)

The hardware has not changed, and I've verified that the system memory is fine with memtest. I also verified that it wasn't a change in BIOS, as I reverted to a BIOS from 2009 to verify that the problem still happens.

I've verified this still occurs with the released maverick, and that it still exists in upstream kernel 2.6.36-020636rc7-generic.

I've tested a whole variety of other kernel parameters as part of my investigation, none of which avert the system freezes (variously and individually noapic, acpi=off, clocksource=jiffies, clocksource=tsc, nolapic_timer, clocksource=acpi_pm, i8042.nopnp, nohz=off, acpi_irq_nobalance, pci=nomsi,noaer, acpi_enforce_resources=lax, pci=nocrs, pci=nommconf).

My only other clues are that very occasionally (with two cores enabled) I may see log messages to the effect:
BUG: Soft lockup detected on CPU

and I do regularly get log complaints along the lines of:
Clocksource tsc unstable (delta = -333085419 ns)

I can't really think of anything else to try. It's easy for me to reproduce the lock-up, but the logs are sometime empty and I don't get any kind of stack or register dump.... I need help to understand how to isolate the issue further so that a kernel engineer can analyze it.

Thanks!