[Dell Precision M4500] offlining and then re-onlining CPUs makes the system unresponsive

Bug #907454 reported by Daniel Manrique
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Committed
Medium
Unassigned

Bug Description

This is a fully-updated (from -proposed) 11.10 i386 system.

While testing the 3.0.0-15 kernel we came across this problem, where one of our tests, which just offlines and then re-onlines all the CPUs, caused the system to become sluggish, then unresponsive.

Steps to reproduce:
1- Offline all the CPUs by echo 0 > /sys/devices/system/cpu/cpu?/online
1- Online all the CPUs by echo 1 > /sys/devices/system/cpu/cpu?/online

Expected behavior:
- System behaves normally
- Syslog has a record of the offlining and subsequent rebooting of all the CPUs

Actual behavior:
- Some of the CPU offlining commands don't exit or take a long time to exit
- System starts being sluggish, slows down to a crawl and eventually becomes pretty much unusable
- The gnome-terminal "greys out" and is completely unresponsive
- I can SSH into the system and do some stuff, but anything even moderately intensive just blocks (e.g. "sync" just never exits).
- Syslog shows messages about a starved canary thread and lots of kernel oopses. I'll attach kernel logs.

I tested several kernels to further diagnose this:

3.0.0-14 works FINE, CPU offlining/onlining leaves the system in a working state.
3.0.0-15 FAILS as described.
3.2.0-6.12 from Precise FAILS as described.
3.2.0-rc6 from mainline FAILS as described.

This system is used primarily for testing so please don't hesitate to ask for any further tests or information.

Other than kernel logs I'll also attach the test script we use for testing, though it's available in /usr/share/checkbox/scripts/cpu_offlining in any 11.10 system.

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: linux-image-3.0.0-15-generic 3.0.0-15.24
ProcVersionSignature: Ubuntu 3.0.0-15.24-generic 3.0.13
Uname: Linux 3.0.0-15-generic i686
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 1.23-0ubuntu4
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: STAC92xx Analog [STAC92xx Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: ubuntu 1808 F.... pulseaudio
 /dev/snd/controlC0: ubuntu 1808 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xd9660000 irq 48'
   Mixer name : 'IDT 92HD81B1C5'
   Components : 'HDA:111d76d5,1028040c,00100104 HDA:14f12c06,14f1000f,00100000'
   Controls : 15
   Simple ctrls : 10
Card1.Amixer.info:
 Card hw:1 'NVidia'/'HDA NVidia at 0xd3080000 irq 17'
   Mixer name : 'Nvidia GPU 0d HDMI/DP'
   Components : 'HDA:10de000d,10de0101,00100100'
   Controls : 16
   Simple ctrls : 4
Date: Wed Dec 21 12:33:32 2011
HibernationDevice: RESUME=UUID=0410f3bd-3218-46df-9452-8c0e6f2b0731
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release i386 (20111012)
MachineType: Dell Inc. Precision M4500
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.0.0-15-generic root=UUID=90bc737c-ff97-4da2-b2e5-2f48a4484ca0 ro quiet splash initcall_debug vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.0.0-15-generic N/A
 linux-backports-modules-3.0.0-15-generic N/A
 linux-firmware 1.60
SourcePackage: linux
StagingDrivers: mei
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/30/2010
dmi.bios.vendor: Dell Inc.
dmi.bios.version: X51
dmi.board.name: RAMDEL
dmi.board.vendor: Dell Inc.
dmi.board.version: 0001
dmi.chassis.type: 9
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrX51:bd01/30/2010:svnDellInc.:pnPrecisionM4500:pvr0001:rvnDellInc.:rnRAMDEL:rvr0001:cvnDellInc.:ct9:cvr:
dmi.product.name: Precision M4500
dmi.product.version: 0001
dmi.sys.vendor: Dell Inc.

Revision history for this message
Daniel Manrique (roadmr) wrote :
Revision history for this message
Daniel Manrique (roadmr) wrote :
Revision history for this message
Daniel Manrique (roadmr) wrote :

---
Ubuntu Bug Squad volunteer triager
http://wiki.ubuntu.com/BugSquad

Revision history for this message
Daniel Manrique (roadmr) wrote :
Revision history for this message
Daniel Manrique (roadmr) wrote :
Revision history for this message
Daniel Manrique (roadmr) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Daniel

Does this only happen on the Dell Precision M4500, or does it also happen on other machines?

Also, would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . If possible, please test the latest v3.2-rc6 kernel (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

tags: added: kernel-da-key regression-update
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: needs-upstream-testing
Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi Joseph,

We tested about 40 systems for the 3.0.0-15 kernel. The M4500 was the only one that exhibited this problem.

I anticipated your request to test mainline 3.2.0-rc6 :) as I mention in the original report, I already tested with that kernel and I found the same problem, I attached the logfile for that kernel (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/907454/+attachment/2642920/+files/3.2.0rc6-syslog). Based on your instructions I'll remove needs-upstream-testing.

Thanks!

tags: removed: needs-upstream-testing
tags: added: kernel-bug-exists-upstream
Revision history for this message
mtvoid (mtvoid) wrote :

I am seeing this problem too on my desktop PC (DP55WB with Core i5 750 processor), running 11.10. After the upgrade, the system no longer managed to resume from suspend, and would just freeze. I tried debugging the suspend; it worked until the 'platform' stage by writing to /sys/power/pm_test, but putting 'processors' would cause the system to fail to recover from suspend, so I suspected that it is an SMP related issue. I manually tried offlining a CPU, but trying to put it back online afterwards does not succeed and leads to lots of kernel errors.

The problem exists in the 3.0 kernel in oneiric, and I've even installed the 3.2-rc7 kernel from precise, but the issue remains. I'm attaching the relevant section from the syslog (The first two lines are produced after I offline CPU 3, the rest after I issue the command to bring it back online).

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Daniel (or mtvoid)

The Oneiric kernel had a patch reverted and was updated to 3.0.0-15.25; I'm not sure if the two issues are related but it might be worth testing again just to make sure the problem is still there.

Thanks,

Revision history for this message
Daniel Manrique (roadmr) wrote :

Brendan,

I tested 3.0.0-15.25 and it works fine, the offlining script is able to offline and re-online all CPUs with no ill effects, slowdowns or nasty errors on dmesg. So it looks like the reverted patch was also causing this problem on this particular machine.

Revision history for this message
mtvoid (mtvoid) wrote :

Yes, that patch was the cause of the problem and is what was also breaking resume from suspend. After upgrading to 3.0.0-15.25, the problem's gone. I guess this bug can be closed now.

Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
To post a comment you must log in.