kernel oops in free_task

Bug #1214931 reported by Olli Ries
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned

Bug Description

dist upgrade to Saucy on 8/21, running 3.11.0-3

[ 0.000000] Linux version 3.11.0-3-generic (buildd@roseapple) (gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-9ubuntu1) ) #7-Ubuntu SMP Tue Aug 20 15:21:58 UTC 2013 (Ubuntu 3.11.0-3.7-generic 3.11.0-rc6)

systems works for a few minutes then stalls completely

was able to get a dmesg (attached pic below) shortly before the hard freeze. Attached kern.log should have same information starting at at 8:07:35

other attachments were created via ubuntu-bug linux prior to the 8:07:35 crash

systems works fine on
root@minime:/var/log# uname -a
Linux minime 3.10.0-6-generic #17-Ubuntu SMP Fri Jul 26 18:29:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

ProblemType: Bug
DistroRelease: Ubuntu 13.10
Package: linux-image-3.11.0-3-generic 3.11.0-3.7
ProcVersionSignature: Ubuntu 3.11.0-3.7-generic 3.11.0-rc6
Uname: Linux 3.11.0-3-generic x86_64
ApportVersion: 2.12.1-0ubuntu2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: lightdm 1609 F.... pulseaudio
 /dev/snd/controlC0: lightdm 1609 F.... pulseaudio
Date: Wed Aug 21 08:00:05 2013
HibernationDevice: RESUME=UUID=8e954843-0e0e-415b-a54f-0c1185f5475e
InstallationDate: Installed on 2013-06-13 (69 days ago)
InstallationMedia: Ubuntu 13.04 "Raring Ringtail" - Release amd64 (20130424)
MachineType: LENOVO 4286CTO
MarkForUpload: True
ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.11.0-3-generic root=UUID=a532432e-5fb3-45bf-8fb2-5b3aa286e005 ro i915.i915_enable_rc6=1 i915.i915_enable_fbc=1 i915.lvds_downclock=1 quiet splash vt.handoff=7
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.11.0-3-generic N/A
 linux-backports-modules-3.11.0-3-generic N/A
 linux-firmware 1.113
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/07/2011
dmi.bios.vendor: LENOVO
dmi.bios.version: 8DET50WW (1.20 )
dmi.board.asset.tag: Not Available
dmi.board.name: 4286CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr8DET50WW(1.20):bd07/07/2011:svnLENOVO:pn4286CTO:pvrThinkPadX220:rvnLENOVO:rn4286CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 4286CTO
dmi.product.version: ThinkPad X220
dmi.sys.vendor: LENOVO

Revision history for this message
Olli Ries (ories) wrote :
Revision history for this message
Olli Ries (ories) wrote :
Revision history for this message
Olli Ries (ories) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Olli,

We can perform a kernel bisect to identify the commit that introduced this bug. It will require testing about 7 - 10 test kernels. We first need to identify the first bad and last good kernel versions. The current Saucy kernel was based off of v3.11-rc6. Can you test v3.11-rc1 to see if that is the release candidate that first introduced this? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc1-saucy/

If v3.11-rc1 does not have the bug, we would want to test v3.11-rc2, then 3.11-rc3 and so on, until we hit the first bad kernel:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc2-saucy/
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc3-saucy/

If 3.11-rc1 does have the bug, we would want to test v3.10 final, just to confirm it does not have the bug:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.10-saucy/

tags: added: performing-bisect
Changed in linux (Ubuntu):
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Chris J Arges (arges) wrote :

Looking at the log I see the following:
mce: [Hardware Error]: Machine check events logged
You should install mcelog, and see if it can retrieve more details about the hardware failure.

I also have an X220, and I can run with 3.11.0-3 without it oopsing within a few minutes. Perhaps this is something specific with your configuration or a hardware issue that caused the problem.

Revision history for this message
Olli Ries (ories) wrote :

I have done following steps as per Leann's suggestion:

disable vboxdrv:
root@minime:~# grep vboxdrv /etc/modprobe.d/blacklist.conf
blacklist vboxdrv

remove i915 grub cmdline parameter:
/etc/defaul/grub
# GRUB_CMDLINE_LINUX="i915.i915_enable_rc6=1 i915.i915_enable_fbc=1 i915.lvds_downclock=1"
with a successive run of update-grub

Revision history for this message
Olli Ries (ories) wrote :

re comment #5 && note to self:
I am not sure that starting w/ v3.11-rc1 is a good point, as the last working kernel I am running on the system is 3.10.0-6-generic.

starting with 3.10.9-031009 in the search for the last working one

Revision history for this message
Olli Ries (ories) wrote :

no issues with:
olli@minime:~$ uname -a
Linux minime 3.10.9-031009-generic #201308201935 SMP Tue Aug 20 23:36:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Olli Ries (ories) wrote :

no issues with:
olli@minime:~$ uname -a
Linux minime 3.11.0-031100rc1-generic #201307141935 SMP Sun Jul 14 23:36:57 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Olli Ries (ories) wrote :

no issues with:
olli@minime:~$ uname -a
Linux minime 3.11.0-031100rc4-generic #201308041735 SMP Sun Aug 4 21:36:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
(skipped rc3)

Revision history for this message
Olli Ries (ories) wrote :

culprit found:
Linux version 3.11.0-031100rc5-generic

last working version was rc4 as per comment #11

reproduction steps:
boot kernel, start regular (random desktop) usage, i.e. multiple chromium sessions, xchat, thunderbird, crashes within minutes/seconds

from kern.log (see next attachment)
Aug 22 23:36:00 minime kernel: [ 0.000000] Linux version 3.11.0-031100rc5-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201308112135 SMP Mon Aug 12 01:35:49 UTC 2013
...
Aug 22 23:36:47 minime kernel: [ 62.273638] Oops: 0000 [#1] SMP
...

no mce in kern.log this time (afaics)

Revision history for this message
Olli Ries (ories) wrote :

2 relevant oops within 1 session at
Aug 22 23:36:47 minime kernel: [ 62.273638] Oops: 0000 [#1] SMP
Aug 22 23:36:48 minime kernel: [ 62.498471] Oops: 0010 [#2] SMP

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Olli - try turning off your Bluetooth trackpad.

[ 61.035328] hidraw: raw HID events driver (C) Jiri Kosina
[ 61.091992] Bluetooth: HIDP (Human Interface Emulation) ver 1.2
[ 61.092004] Bluetooth: HIDP socket layer initialized
[ 61.167046] magicmouse 0005:05AC:030E.0001: unknown main item tag 0x0
[ 61.167132] input: ries’s Trackpad as /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.4/1-1.4:1.0/bluetooth/hci0/hci0:11/input15
[ 61.167906] magicmouse 0005:05AC:030E.0001: input,hidraw0: BLUETOOTH HID v1.60 Mouse [ries’s Trackpad] on cc:af:78:ef:62:dd
[ 62.273554] BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
[ 62.273602] IP: [<ffffffff81579b01>] evdev_poll+0x31/0x70
[ 62.273628] PGD 0
[ 62.273638] Oops: 0000 [#1] SMP

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between v3.11-rc4 and v3.11-rc5. The first test kernel is built up to commit:
201d3dfa4da10ac45b260320b94e2f2f0e10d687

The kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1214931/

Can you test this kernel and post back if it exhibits the bug or not?

Revision history for this message
Olli Ries (ories) wrote :

re comment #14:
disabling the trackpad has me run rc5 now for >1h

Revision history for this message
Olli Ries (ories) wrote :

connecting the trackpad on rc5 causes the system to freeze, kern.log to follow

Revision history for this message
Olli Ries (ories) wrote :

Aug 23 10:06:31 minime kernel: [ 4532.566825] magicmouse 0005:05AC:030E.0003: input,hidraw2: BLUETOOTH HID v1.60 Mouse [ries’s Trackpad] on cc:af:78:ef:62:dd
Aug 23 10:06:35 minime kernel: [ 4537.246692] ------------[ cut here ]------------
Aug 23 10:06:35 minime kernel: [ 4537.246711] WARNING: CPU: 3 PID: 3025 at /home/apw/COD/linux/net/core/filter.c:397 sk_run_filter+0x569/0x570()

there was also a dump on VT which I didn't get to capture, will try to repro & take picture later

Revision history for this message
Olli Ries (ories) wrote :

mce log that should have been created during the last run of rc5

Revision history for this message
Olli Ries (ories) wrote :

re comment #15:

linux-image-3.11.0-031100rc4-generic_3.11.0-031100rc4.201308231048_amd64.deb ended locking up eventually.

I gave it 2 shots, and think the first one locked immediately after connecting the trackpad. Short of a verification of the right kernel via uname -a, I rebooted and wasn't able to lock the system immediately but only after some time.

I noticed a series of traces in VT1 running a dmesg (while the system was still responsive), let me know if you want the kern.log (which doesn't seem to have captured them, only shows ^@ entries)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, which is built up to commit:
201d3dfa4da10ac45b260320b94e2f2f0e10d687

The kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1214931/

Can you test this kernel and post back if it exhibits the bug or not?

c4afd7b95fff2f4964e630d0de90e8bc94ae37f1

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Olli,

bug 1218004 seems very similar. However, that bug seems to have been introduced in v3.11-rc1. If possible, can you confirm that you don't see your bug in v3.11-rc1, which can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc1-saucy/

Revision history for this message
Olli Ries (ories) wrote :

the kernel v3.11-rc1 from comment #22 freezes as soon as I connect the touchpad.

I am running xmir though in this configuration whereas in the past I was running plain X

Revision history for this message
Olli Ries (ories) wrote :

re comment #23:
revalidated w/o xmir on rc1 kernel from comment #22 - still happens

see attached kern.log from at Aug 30 17:16:25 minime kernel: [ 105.040721]

Revision history for this message
Olli Ries (ories) wrote :

re comment #23:
revalidated w/o xmir on rc1 kernel from comment #22 - still happens (this is the 3rd run of rc1)

see attached kern.log from at Aug 30 17:16:25 minime kernel: [ 105.040721]

the numerous Aug 30 17:13:XX traces appeared when running xmir and did not lock the system (2nd run on rc1)

grep "Call Trace" /var/log/kern.log | grep "Aug 30 17:13" | wc -l
94848

the first run of rc1 that led to a lock up was at Aug 30 17:10:03 and locked up at Aug 30 17:13:59 minime kernel: [ 557.720801]

Revision history for this message
Olli Ries (ories) wrote :

verifying comment #12 just to be sure:

the system does not lock up with rc4, but does lock up with rc5:
ii linux-image-3.11.0-031100rc4-generic 3.11.0-031100rc4.201308231539 amd64 Linux kernel image for version 3.11.0 on 64 bit x86 SMP

ii linux-image-3.11.0-031100rc5-generic 3.11.0-031100rc5.201308112135 amd64 Linux kernel image for version 3.11.0 on 64 bit x86 SMP

is the rc1 kernel from comment #10 (which was working) different from the rc1 kernel in comment #22 (which is locking the system up)?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the additional testing, Olli. The rc1 kernels in comments #10 and #22 should be the same builds. One way to tell would be to boot them and run uname -a, then compare the build stamps.

It sounds like we've concluded the following:

v3.10 final: good
v3.11 rc1: bad
v3.11-rc4: good
v3.11-rc5: bad

It appears this bug was introduced in 3.11-rc1, but was fixed in rc2 or rc3. Then it was introduced againg, or a similar bug introduced, in rc5.

Since we are bisecting between 3.10 and 3.11-rc1 in bug 1218004 , it might be best for us to focus on the bisect between rc4 and rc5 with your bug.

The v3.11 final kernel has now been released by Linus. It might be a good idea to test this kernel to confirm it still has this bug, or if it fixes it. It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-saucy/

Can you test 3.11 final when you have a chance? If it still has the bug, we can continue with the bisect.

Revision history for this message
Olli Ries (ories) wrote :

confirmed that v3.11 final still shows the same symptoms, i.e. crash / lock soon after I enable the trackpad.

Do you want me to rule out rc4 as working by exercising it for a longer period (~1d) just to make sure we don't chase ghosts?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yeah, that would be great if you can confirm rc4 does not have the bug:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc4-saucy/

Revision history for this message
Olli Ries (ories) wrote :

I did a BIOS upgrade from 1.20 to 1.39 followed by a run of RC4 which upon connecting the touchpad crashed.

I have not dug in to see if it's related.

Revision history for this message
Olli Ries (ories) wrote :

Upon trying to reproduce on RC4, I was able to capture attached kernel panic. The symptom could be triggered within a short amount of time.

re comment #27:
--->---
It sounds like we've concluded the following:

v3.10 final: good
v3.11 rc1: bad
v3.11-rc4: good
v3.11-rc5: bad
---<---

comment #20 reports rc4 to lock up eventually, while I wasn't seeing an issue with rc4 in comment #11 & #26, but just have verified twice that rc4 crashes (comment #30). With that I think we should disregard comment #11 & #26 and also mark rc4 as bad. As per comment #28, v3.11 final also shows the symptom.

Conclusion: regression between v3.10 final and v.3.11 rc1, will try to double check v.3.10 final now

Revision history for this message
Olli Ries (ories) wrote :

v.3.10 is still working:

olli@minime:~$ uname -a
Linux minime 3.10.9-031009-generic #201308201935 SMP Tue Aug 20 23:36:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
olli@minime:~$ uptime
 08:47:18 up 17:32, 2 users, load average: 1.98, 2.65, 1.73

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

We may have identified the commit that introduced this regression. The bisect performed in bug 1218004 identified commit a4a23f6 as the first bad commit.

I built a test kernel with commit a4a23f6 reverted. Can you test this kernel and see if it resolves the bug?

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1218004/

Thanks in advance.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Olli,

This should be fixed in the 3.11.0-12 kernel. Can you apply the latest updates and post back if this bug is resolved for you?

Thanks in advance!

Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.