Kernel crash in rb_next doing ohai loops
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Natty |
Won't Fix
|
Undecided
|
Unassigned | ||
Oneiric |
Fix Released
|
Undecided
|
Unassigned | ||
Precise |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
SRU Justification:
Impact:
If tasks use the setsid call a lot (which places them into separate tasks groups), they may trigger a race that can cause a access violation in the scheduler code that crashes the kernel. Kernels after v3.3 avoid inconsistencies and do not crash while the race is still present.
Fix:
The attached patch resolves the race and should make its way upstream. Proposing to apply it pre-stable due to the potential of crashes and after successful verification locally.
Testcase:
1. apt-get install build-essential ruby-1.9.3 screen
2. gem install chef
3. in screen session: while true; do ohai; done
---
We have a number of small and large instances running the release version of 12.04. The small instances have been completely stable. However, every large instance we have has crashed at a seemingly random interval. This is repeatable on individual systems, though not within a defined time period. It appears to be triggered by our half hourly run of OpsCode's chef-client. We tried running the client in a tight loop to recreate the crash but were unable to get it to do so in a short time period. It still took two days to crash again.
This was affecting the 3.2.0-23-virtual kernel, so we updated to the 3.2.0-24-virtual kernel but still have found the same crash. The only information available in the system logs is:
[17605315.391128] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[17605315.391148] IP: [<ffffffff8130d
[17605315.391163] PGD 1d2fdc067 PUD 1d0e3c067 PMD 0
[17605315.391172] Oops: 0000 [#1] SMP
[17605315.391179] CPU 1
[17605315.391182] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables isofs acpiphp
[17605315.391209]
[17605315.391214] Pid: 28794, comm: chef-client Not tainted 3.2.0-23-virtual #36-Ubuntu
[17605315.391223] RIP: e030:[<
[17605315.391232] RSP: e02b:ffff8801d2
[17605315.391238] RAX: 0000000000000000 RBX: ffff8801d2eb5a00 RCX: 0000000000000000
[17605315.391244] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[17605315.391250] RBP: ffff8801d2659c48 R08: 0000000000000000 R09: 0000000000000000
[17605315.391255] R10: ffff8801dff866c0 R11: 0000000000000001 R12: 0000000000000000
[17605315.391263] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000033b9e28
[17605315.391274] FS: 00007fee8cc1070
[17605315.391281] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[17605315.391287] CR2: 0000000000000010 CR3: 00000001d2a0b000 CR4: 0000000000002660
[17605315.391294] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17605315.391301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17605315.391308] Process chef-client (pid: 28794, threadinfo ffff8801d2658000, task ffff8801d0870000)
[17605315.391315] Stack:
[17605315.391319] ffff8801d2659c48 ffffffff8104ece9 ffff8801d2eb5a00 ffff8801dffa26c0
[17605315.391331] ffff8801d2eb5200 0000000000000000 ffff8801d2659c78 ffffffff810544b8
[17605315.391343] ffff8801d2659c78 ffff8801dffa26c0 0000000000000001 ffff8801d08703a8
[17605315.391354] Call Trace:
[17605315.391364] [<ffffffff8104e
[17605315.391373] [<ffffffff81054
[17605315.391382] [<ffffffff81652
[17605315.391391] [<ffffffff81655
[17605315.391399] [<ffffffff81653
[17605315.391408] [<ffffffff8117e
[17605315.391417] [<ffffffff81089
[17605315.391425] [<ffffffff8117e
[17605315.391433] [<ffffffff81174
[17605315.391443] [<ffffffff8100a
[17605315.391451] [<ffffffff8100a
[17605315.391459] [<ffffffff81298
[17605315.391466] [<ffffffff81174
[17605315.391473] [<ffffffff81174
[17605315.391479] [<ffffffff81174
[17605315.391488] [<ffffffff8165d
[17605315.391494] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47
[17605315.391577] RIP [<ffffffff8130d
[17605315.391583] RSP <ffff8801d2659c18>
[17605315.391587] CR2: 0000000000000010
[17605315.391596] ---[ end trace 586cfae3c9e3e67e ]---
The stack trace is identical between the two kernels. I am unable to find any reference to this on Ubuntu, Xen, or kernel forums or mailing lists but it's repeatable even on freshly installed m1.large instances on EC2.
ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-
ProcVersionSign
Uname: Linux 3.2.0-24-virtual x86_64
AcpiTables:
AlsaDevices:
total 0
crw-rw---T 1 root audio 116, 1 May 7 09:58 seq
crw-rw---T 1 root audio 116, 33 May 7 09:58 timer
AplayDevices: aplay: device_list:252: no soundcards found...
ApportVersion: 2.0.1-0ubuntu7
Architecture: amd64
ArecordDevices: arecord: device_list:252: no soundcards found...
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Tue May 15 15:23:54 2012
Ec2AMI: ami-fd1c2789
Ec2AMIManifest: ubuntu-
Ec2Availability
Ec2InstanceType: m1.large
Ec2Kernel: aki-62695816
Ec2Ramdisk: unavailable
IwConfig:
lo no wireless extensions.
eth0 no wireless extensions.
Lspci:
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
PciMultimedia:
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB:
ProcKernelCmdLine: root=LABEL=
PulseList:
Error: command ['pacmd', 'list'] failed with exit code 1: Home directory /home/mydrive not ours.
No PulseAudio daemon running, or not running as session daemon.
RelatedPackageV
linux-
linux-
linux-firmware 1.79
RfKill:
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
WifiSyslog:
Related branches
Changed in linux (Ubuntu): | |
importance: | Undecided → Medium |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Oneiric): | |
status: | New → Fix Committed |
Changed in linux (Ubuntu Precise): | |
status: | New → Fix Committed |
Changed in linux (Ubuntu Natty): | |
status: | New → In Progress |
assignee: | nobody → Stefan Bader (stefan-bader-canonical) |
tags: |
added: verification-done-precise removed: verification-needed-precise |
tags: |
added: verification-done-oneiric removed: verification-needed-oneiric |
tags: | added: patch |
Changed in linux (Ubuntu Natty): | |
status: | In Progress → Won't Fix |
assignee: | Stefan Bader (stefan-bader-canonical) → nobody |
Changed in linux (Ubuntu): | |
status: | Triaged → Fix Released |
This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:
apport-collect 999755
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.