[maverick] intermittent full system freeze

Bug #586901 reported by databubble
58
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

This bug may be a duplicate of #569011 (which inexplicably now has a status of fix-released), however there are minor differences.

I have been experiencing occasional (2 or 3 times a day) system hangs under Lucid/10.04 for at least a couple of weeks, with the following symptoms: screen freezes, keyboard/mouse unresponsive (can't toggle keyboard LEDs), can't ssh in, system still responds to pings, can reboot with ALT-SysRq+REISUB. On reboot, I'm unable to locate any relevant error messages in logs from the time the system froze.

This is with kernel 2.6.32-22-generic, and I have also tested with upstream 2.6.34-020634-generic, with the same results. The same hardware was stable under jaunty and karmic.

I have nvidia-current drivers installed, however, the system can still freeze from a text console with X/desktop disables (sudo stop kdm), thus I don't believe it's graphics driver related.

Only pattern to the lock-ups I've noticed is that it's usually during times of heavy CPU load... which usually also has heavy disk load: a large compile, ffmpeg video conversion, running virtualbox, dvd ripping, etc. My system is a dual-core 64-bit Athlon, and I boot from a md-mirrored EXT4 root filesystem.

I have virtualbox-3.2 installed, and while running virtualbox with a WinXP client is a good way to help encourage the freeze, I also experience freezes without virtualbox running. Still, I'll try removing it and seeing if I can reproduce without it.

I have not run a memtest recently, and will do so as soon as I have a Window to take the system down.... but the hardware has been stable without change for over a year with jaunty/karmic, and there seem to be many users experiencing a range of freezing problems on ubuntuforums:
http://ubuntuforums.org/showthread.php?t=1478787

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-22-generic 2.6.32-22.33
Regression: Yes
Reproducible: No
ProcVersionSignature: Ubuntu 2.6.32-22.33-generic 2.6.32.11+drm33.2
Uname: Linux 2.6.32-22-generic x86_64
NonfreeKernelModules: nvidia
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: NVidia [HDA NVidia], device 0: VT1708B Analog [VT1708B Analog]
   Subdevices: 2/2
   Subdevice #0: subdevice #0
   Subdevice #1: subdevice #1
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: phil 2441 F.... kmix
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Card0.Amixer.info:
 Card hw:0 'NVidia'/'HDA NVidia at 0xfce78000 irq 20'
   Mixer name : 'Nvidia MCP78 HDMI'
   Components : 'HDA:1106e721,10438345,00100100 HDA:10de0002,10de0101,00100000'
   Controls : 38
   Simple ctrls : 19
Date: Fri May 28 11:35:02 2010
EcryptfsInUse: Yes
Frequency: I don't know.
HibernationDevice: RESUME=UUID=b9ce8951-a8b2-46a2-9da5-305580e86e70
MachineType: System manufacturer System Product Name
ProcCmdLine: root=/dev/md0 ro iommu=noaperture,memaper=4 quiet splash crashkernel=384M-2G:64M,2G-:128M
ProcEnviron:
 LANGUAGE=en_CA:en
 PATH=(custom, user)
 LANG=en_CA.UTF-8
 SHELL=/bin/bash
RelatedPackageVersions: linux-firmware 1.34
RfKill:
 0: hci0: Bluetooth
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
dmi.bios.date: 03/19/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1502
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: M3N78-VM
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1502:bd03/19/2010:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnM3N78-VM:rvrRevX.0x:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
databubble (phil-linttell) wrote :
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi databubble,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
tags: removed: needs-upstream-testing
Revision history for this message
databubble (phil-linttell) wrote :

As mentioned, I've tested using upstream kernel 2.6.34-020634-generic, from May 17. However, for good measure I'll test with the latest nightly, linux-image-2.6.34-999-generic_2.6.34-999.201005261208 (May 26) and verify that it hasn't been fixed in the last week.

Revision history for this message
databubble (phil-linttell) wrote :

BTW, I've also tried booting with "noacpi" on grub kernel command line, and separately pegging the CPU speed via:
echo "3000000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq

Revision history for this message
databubble (phil-linttell) wrote :

Well, that was interesting. Using last night's kernel I got slightly different behaviour.

System locked up.... within minutes of logging in, and before I could even start to put a significant load on it. Slight different symptoms, however.... the mouse never locked. I was able to mouse around as much as I wanted, but no response from desktop. Keyboard dead.... not even SysRq-REISUB would elicit a response. System responding to network pings, but unable to ssh in. No disk accesses going on at all.

Odd that I could mouse about.

Revision history for this message
databubble (phil-linttell) wrote :

I've removed the duplicate status of this report (previously marked duplicate of #585765.) I've been living with bug for over six months now (running with only 1 core enabled out of two) , and believe that it is quite specific and distinct from the range of issues in #58575.

My system was stable under Karmic and the early builds of Lucid. Just prior to the release of the Lucid beta, and ever since, my system will lock up (screen froze, can't toggle numlock, system fan high-speed, can't ping) unless I boot with kernel parameter "nolapic", disabling the second CPU core. Sometimes the system will log soon after logging in, sometimes hours after. I can ALWAYS stimulate a lock up by attempting to rip a DVD using Handbrake (within a couple of minutes) -- a process which completes successfully when booting with nolapic.) This system is generally more likely to freeze quickly under heavy load (CPU/disk).

Occasionally, the log will contain a message similar to the following immediately prior to the freeze:

Oct 13 13:52:44 family kernel: [ 37.408761] do_IRQ: 0.189 No irq handler for vector (irq -1)

The hardware has not changed, and I've verified that the system memory is fine with memtest. I also verified that it wasn't a change in BIOS, as I reverted to a BIOS from 2009 to verify that the problem still happens.

I've verified this still occurs with the released maverick, and that it still exists in upstream kernel 2.6.36-020636rc7-generic.

I've tested a whole variety of other kernel parameters as part of my investigation, none of which avert the system freezes (variously and individually noapic, acpi=off, clocksource=jiffies, clocksource=tsc, nolapic_timer, clocksource=acpi_pm, i8042.nopnp, nohz=off, acpi_irq_nobalance, pci=nomsi,noaer, acpi_enforce_resources=lax, pci=nocrs, pci=nommconf).

My only other clues are that very occasionally (with two cores enabled) I may see log messages to the effect:
BUG: Soft lockup detected on CPU

and I do regularly get log complaints along the lines of:
Clocksource tsc unstable (delta = -333085419 ns)

I can't really think of anything else to try. It's easy for me to reproduce the lock-up, but the logs are sometime empty and I don't get any kind of stack or register dump.... I need help to understand how to isolate the issue further so that a kernel engineer can analyze it.

Thanks!

summary: - [lucid] intermittent full system freeze
+ [maverick] intermittent full system freeze
Revision history for this message
Saivann Carignan (oxmosys) wrote :

I get the same behavior. On heavy operations (many computers making filesystem changes through NFS, controlling a remote computer through VNC, VirtualBox with Windows XP running) the computer freeze (mouse and num lock stop to work for seconds, any file transfer with remote computers hang, then resume. Any Skype communication gets noise, get temporarily or permanently cutted. My computer boots on a mdadm RAID-0 luks encrypted ext4 filesystem. I have 12Gb of memory, tested with memtest and my two hard drives SMART state is perfect. This problem did not happen with lucid, and it happens with maverick since two weeks.

Revision history for this message
Tiago (toazz) wrote :

Does linux-image-2.6.35-22-generic correct this?

Revision history for this message
databubble (phil-linttell) wrote :

Toal,

linux-image-2.6.35-22-generic does not correct it. Nor does any upstream kernel I've tested up to 2.6.36-020636rc7-generic correct it.

Thanks,
Phil

Revision history for this message
Tiago (toazz) wrote :

In my case, although screen and mouse pointer freezes, sound keeps playing, background tasks keep running and sysreqs still work. Virtual terminals (alt+ctrl+F#) are accessible after a alt+sysrq+re. Could this be xorg issue instead of a kernel one?

Revision history for this message
Tiago (toazz) wrote :
Revision history for this message
databubble (phil-linttell) wrote :

Toal,

I don't believe my bug has anything to do with xorg.... I can trigger it by using handbrake (CLI) to rip a DVD with no X server running.

Just testing it again (using linux 2.6.35-22)...

1) booted without nolapic kernel parameter
2) switched to tty1
3) login and "sudo stop kdm"
4) begin ripping a DVD with handbrake CLI

After a couple of minutes of ripping, the application reports a segfault.

The shell prompt re-appears, but within a couple of seconds, the screen goes blank and no further response is to be had (no keyboard lights, no ping responses, etc.)

Again, this bug isn't about ripping DVD's.... that's just my sure-fire way of triggering it. If I boot without nolapic (so both cores enabled), then, after a few minutes to a few hours, the system will lock solid during normal usage. The system is solid with nolapic, and was solid under karmic.

Revision history for this message
Tiago (toazz) wrote :

I managed to solve this just by reinstalling maverick from release CD. I was previously using an updated-from-lucid maverick installation.

Revision history for this message
databubble (phil-linttell) wrote :

Interesting that a re-install fixed it in your case. I currently run a 1GB md mirrored root filesystem (without a separate /home) and re-installing is a big project - I'll need an extra drive before I can attempt it.

I tried to boot Kubuntu live from a USB stick and see if I could trigger a failure. I'll need a larger memory stick before I can install the necessary software to test a "clean" install.

In the meantime, I'm hoping someone will eventually chime in with some suggestions as to how I can isolate the failure.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu development release http://cdimage.ubuntu.com/daily-live/current/ . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.