Bug #614853 “kernel panic divide error: 0000 [#1] SMP” : Bugs : linux-ec2 package : Ubuntu

Revision history for this message

joe williams (joetify) wrote on 2010-08-07:

#1

Dependencies.txt Edit (1.3 KiB, text/plain; charset="utf-8")

Revision history for this message

John Johansen (jjohansen) wrote on 2010-08-12:

#2

Joe,

can you elaborate on which kernel and the setup you were using when you saw this on physical hardware, ie. were you running lucid's generic kernel on physical hardware, or where you running the ec2 kernel under a Xen dom0, etc.

Revision history for this message

joe williams (joetify) wrote on 2010-08-12:

#3

On the physical hardware I am running 2.6.32-24-generic, without any virtualization of any sort.

Revision history for this message

joe williams (joetify) wrote on 2010-08-12:

#4

Download full text (3.5 KiB)

I have been unable to collect a core using linux-crashdump on my physical machines, it doesn't seem dump it and reboot automatically. However it does seem to load the crash kernel (kdump init script).

I did collect another stack trace from one of my EC2 machines:

[2498228.006101] divide error: 0000 [#1] SMP
[2498228.006113] last sysfs file: /sys/devices/xen/vbd-16756/block/sdx4/stat
[2498228.006117] CPU 0
[2498228.006120] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat raid0 ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype xt_state ip6table_filter ip6_tables ipv6 md_mod nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables
[2498228.006157] Pid: 2128, comm: beam.smp Not tainted 2.6.32-305-ec2 #9-Ubuntu
[2498228.006161] RIP: e030:[<ffffffff8102ceb4>] [<ffffffff8102ceb4>] update_sd_lb_stats+0x3a4/0x4e0
[2498228.006172] RSP: e02b:ffff88044188f9f8 EFLAGS: 00010046
[2498228.006176] RAX: 0000000000000000 RBX: ffff88044188fbe4 RCX: 0000000000000001
[2498228.006179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[2498228.006183] RBP: ffff88044188fad8 R08: ffff88000184dbc8 R09: 0000000000000040
[2498228.006187] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
[2498228.006191] R13: 000000000000a380 R14: ffffffffffffffff R15: 0000000000000000
[2498228.006199] FS: 00007fd37084f710(0000) GS:ffff880001846000(0000) knlGS:0000000000000000
[2498228.006204] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[2498228.006207] CR2: 00007fd35e7bd000 CR3: 0000000440f9b000 CR4: 0000000000002620
[2498228.006211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2498228.006215] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[2498228.006220] Process beam.smp (pid: 2128, threadinfo ffff88044188e000, task ffff880440f982c0)
[2498228.006224] Stack:
[2498228.006226] ffff8804407fec40 ffff88044188fa78 000088044188fb18 0000000000000000
[2498228.006231] <0> ffff88000184daa0 0000000000000000 0000000000000000 0000000000000008
[2498228.006238] <0> 000000000000a380 000000000000a380 ffff88000184dbb0 000000000000a380
[2498228.006245] Call Trace:
[2498228.006251] [<ffffffff8103877d>] find_busiest_group+0x4d/0x460
[2498228.006258] [<ffffffff814a1b53>] ? __wait_on_bit_lock+0x73/0xb0
[2498228.006262] [<ffffffff81039294>] load_balance_newidle+0xa4/0x320
[2498228.006266] [<ffffffff814a1623>] thread_return+0x3cc/0x429
[2498228.006272] [<ffffffff8133b88a>] ? __up_read+0x9a/0xc0
[2498228.006277] [<ffffffff81066607>] ? get_futex_value_locked+0x27/0x40
[2498228.006282] [<ffffffff81066d8d>] futex_wait_queue_me+0xcd/0x110
[2498228.006286] [<ffffffff81067878>] futex_wait+0x128/0x290
[2498228.006291] [<ffffffff814a399d>] ? _spin_lock+0x2d/0x60
[2498228.006295] [<ffffffff81066ee2>] ? futex_wake+0x112/0x130
[2498228.006299] [<ffffffff81069bb9>] do_futex+0xc9/0x1b0
[2498228.006303] [<ffffffff81069d16>] sys_futex+0x76/0x170
[2498228.006308] [<ffffffff810edf78>] ? sys_pread64+0x88/0x90
[2498228.006315] [<ffffffff81009ba8>] system_call_fastpath+0x16/0x1b
[2498228.006319]...

I have been unable to collect a core using linux-crashdump on my physical machines, it doesn't seem dump it and reboot automatically. However it does seem to load the crash kernel (kdump init script).

I did collect another stack trace from one of my EC2 machines:

[2498228.006101] divide error: 0000 [#1] SMP 
[2498228.006113] last sysfs file: /sys/devices/xen/vbd-16756/block/sdx4/stat
[2498228.006117] CPU 0 
[2498228.006120] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat raid0 ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype xt_state ip6table_filter ip6_tables ipv6 md_mod nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables
[2498228.006157] Pid: 2128, comm: beam.smp Not tainted 2.6.32-305-ec2 #9-Ubuntu 
[2498228.006161] RIP: e030:[<ffffffff8102ceb4>]  [<ffffffff8102ceb4>] update_sd_lb_stats+0x3a4/0x4e0
[2498228.006172] RSP: e02b:ffff88044188f9f8  EFLAGS: 00010046
[2498228.006176] RAX: 0000000000000000 RBX: ffff88044188fbe4 RCX: 0000000000000001
[2498228.006179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[2498228.006183] RBP: ffff88044188fad8 R08: ffff88000184dbc8 R09: 0000000000000040
[2498228.006187] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
[2498228.006191] R13: 000000000000a380 R14: ffffffffffffffff R15: 0000000000000000
[2498228.006199] FS:  00007fd37084f710(0000) GS:ffff880001846000(0000) knlGS:0000000000000000
[2498228.006204] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[2498228.006207] CR2: 00007fd35e7bd000 CR3: 0000000440f9b000 CR4: 0000000000002620
[2498228.006211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2498228.006215] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[2498228.006220] Process beam.smp (pid: 2128, threadinfo ffff88044188e000, task ffff880440f982c0)
[2498228.006224] Stack:
[2498228.006226]  ffff8804407fec40 ffff88044188fa78 000088044188fb18 0000000000000000
[2498228.006231] <0> ffff88000184daa0 0000000000000000 0000000000000000 0000000000000008
[2498228.006238] <0> 000000000000a380 000000000000a380 ffff88000184dbb0 000000000000a380
[2498228.006245] Call Trace:
[2498228.006251]  [<ffffffff8103877d>] find_busiest_group+0x4d/0x460
[2498228.006258]  [<ffffffff814a1b53>] ? __wait_on_bit_lock+0x73/0xb0
[2498228.006262]  [<ffffffff81039294>] load_balance_newidle+0xa4/0x320
[2498228.006266]  [<ffffffff814a1623>] thread_return+0x3cc/0x429
[2498228.006272]  [<ffffffff8133b88a>] ? __up_read+0x9a/0xc0
[2498228.006277]  [<ffffffff81066607>] ? get_futex_value_locked+0x27/0x40
[2498228.006282]  [<ffffffff81066d8d>] futex_wait_queue_me+0xcd/0x110
[2498228.006286]  [<ffffffff81067878>] futex_wait+0x128/0x290
[2498228.006291]  [<ffffffff814a399d>] ? _spin_lock+0x2d/0x60
[2498228.006295]  [<ffffffff81066ee2>] ? futex_wake+0x112/0x130
[2498228.006299]  [<ffffffff81069bb9>] do_futex+0xc9/0x1b0
[2498228.006303]  [<ffffffff81069d16>] sys_futex+0x76/0x170
[2498228.006308]  [<ffffffff810edf78>] ? sys_pread64+0x88/0x90
[2498228.006315]  [<ffffffff81009ba8>] system_call_fastpath+0x16/0x1b
[2498228.006319]  [<ffffffff81009b40>] ? system_call+0x0/0x52
[2498228.006322] Code: 06 89 85 50 ff ff ff c7 85 54 ff ff ff 01 00 00 00 e9 cf fd ff ff 90 48 8b 95 70 ff ff ff 48 8b 45 a8 8b 72 08 48 c1 e0 0a 31 d2 <48> f7 f6 48 8b 75 b0 48 89 45 a0 31 c0 48 85 f6 74 0c 48 8b 45 
[2498228.006363] RIP  [<ffffffff8102ceb4>] update_sd_lb_stats+0x3a4/0x4e0
[2498228.006368]  RSP <ffff88044188f9f8>
[2498228.006372] ---[ end trace 18faee40e07dc443 ]---

Revision history for this message

John Johansen (jjohansen) wrote on 2010-08-13:

#5

Joe,

what kind of work loads are you running to trigger this?

also after you hit this bug again could you run
apport-collect 614853

Revision history for this message

joe williams (joetify) wrote on 2010-08-13:

#6

All of these servers are doing high throughput database work, specifically CouchDB.

Revision history for this message

joe williams (joetify) wrote on 2010-08-14: Dependencies.txt

#7

Dependencies.txt Edit (1.3 KiB, text/plain)

apport information

tags:	added: apport-collected
description:	updated

Revision history for this message

joe williams (joetify) wrote on 2010-08-16:

#8

In an attempt to figure out the issue I decided to change the IO scheduler thinking it might help considering the contents of the trace. I set one group of nodes to noop and another to deadline. I have seen panics on both groups of machines since doing so. From the (partial) traces I've gotten from those machines the noop trace looks quite a bit different while the deadline trace looks pretty similar with lots of bits regarding xfs.

Screenshots:
http://img.skitch.com/20100816-buaaqf6ggdfp6m8y41x4wfyhfy.jpg
http://img.skitch.com/20100816-dnwh8sijt8jnck5k18ewrdcdeu.jpg

Unfortunately the console in the remote management card cuts the top of the trace off.

Revision history for this message

joe williams (joetify) wrote on 2010-09-02:

#9

I have confirmed this happens with the deadline IO scheduler. Today an EC2 node of ours running deadline on all the disks got the same "divide error: 0000 [#1] SMP" panic.

Revision history for this message

joe williams (joetify) wrote on 2010-09-04:

#10

panic.log Edit (8.2 KiB, text/plain)

Here's another panic screenshot (physical hardware) and console output (EC2).

http://img.skitch.com/20100904-bitg4476jipband75g38g5wjcb.jpg

Revision history for this message

joe williams (joetify) wrote on 2010-09-04:

#11

Not sure if it's related but I noticed the following on boot up of the EC2 machine:

Checking for running unattended-upgrades: [ 132.079264] BUG: soft lockup - CPU#0 stuck for 61s! [udevd:219]

[ 197.577155] BUG: soft lockup - CPU#0 stuck for 61s! [udevd:219]

[ 240.073502] INFO: task mount:609 blocked for more than 120 seconds.

[ 240.073513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 240.073609] INFO: task sync:627 blocked for more than 120 seconds.

[ 240.073613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 263.074746] BUG: soft lockup - CPU#0 stuck for 61s! [udevd:219]

[ 328.573703] BUG: soft lockup - CPU#0 stuck for 61s! [udevd:219]

Revision history for this message

joe williams (joetify) wrote on 2010-09-04:

#12

Dependencies.txt Edit (1.4 KiB, text/plain)

apport information

description:

updated

Revision history for this message

joe williams (joetify) wrote on 2010-09-04:

#13

uname Edit (86 bytes, text/plain)

I ran "apport-collect 614853" on the aformentioned EC2 node and all it seemed to produce was the above dependency list. I have attached uname and lsmod should they be helpful.

Revision history for this message

joe williams (joetify) wrote on 2010-09-04:

#14

lsmod Edit (1.0 KiB, text/plain)

Revision history for this message

joe williams (joetify) wrote on 2010-09-05:

#15

panic.log Edit (3.2 KiB, text/plain)

Got it again ...

Revision history for this message

joe williams (joetify) wrote on 2010-09-05:

#16

Dependencies.txt Edit (1.3 KiB, text/plain)

apport information

description:

updated

Revision history for this message

joe williams (joetify) wrote on 2010-09-05:

#17

Verified my disks are not CFQ, so it seems to effect all schedulers.

$ cat /sys/block/*/queue/scheduler | grep -v none
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq

joe@der-dieb ~/Downloads/linux-2.6.32.21 $ ack "find_busiest_group" .
kernel/sched.c
3389:/********** Helpers for find_busiest_group ************************/
4011:/******* find_busiest_group() helpers end here *********************/
4014: * find_busiest_group - Returns the busiest group within the sched_domain
4039:find_busiest_group(struct sched_domain *sd, int this_cpu,
4182: group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
4206: * Attempt to move tasks. If find_busiest_group has found
4344: group = find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE,
4401: * find_busiest_group(). If there are no imbalance, then

Revision history for this message

joe williams (joetify) wrote on 2010-09-14:

#18

I doubt they are related but figured it was worth mentioning last night we got soft lockups (not the divide by zero panics we've seen in the past) on a machine. Our hosting provider's KVM software didnt allow me to get the text but i got some screenshots.

http://img.skitch.com/20100914-nkskuxfcucgrigj95bqqtbids1.jpg
http://img.skitch.com/20100914-xir2hce4rt1p83m9jyy9agr4dk.jpg
http://img.skitch.com/20100914-tx6nuuf86sp552u118m1uebcd.jpg

From the first function call in the trace it looks like its in the meta information block cache. Maybe due to the spinlock or a bug in xfs?

joe@der-dieb ~/Downloads/linux-2.6.32.21 $ ack mb_cache_shrink_fn .
fs/mbcache.c
118:static int mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask);
121: .shrink = mb_cache_shrink_fn,
189: * mb_cache_shrink_fn() memory pressure callback
200:mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask)

Revision history for this message

joe williams (joetify) wrote on 2010-09-14:

#19

I believe I have found this bug reported in the kernel bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=16991

Anything that can be done to expedite a fix is appreciated.

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-12:

#20

@Joe,
Do you think that this bug is a duplicate (or vice versa) of bug 651370 ?
The thing that makes me think it might be is that your console log and all linked images show massive timestamps in the kernel at the time of the failure. Ie, "3229228" is ~ 897 hours uptime. Was your system up for anywheres near that long ?

Maybe the timestamps is just aftermath of the failure.

Revision history for this message

joe williams (joetify) wrote on 2010-11-08:

#21

@Scott I do not believe its the same bug, see the discussion at https://bugzilla.kernel.org/show_bug.cgi?id=16991

I have gotten a patched kernel from canonical support and applied it to some of my machines this morning, we'll see if it will fix the panics.

Revision history for this message

John Johansen (jjohansen) wrote on 2010-11-10:

#22

lp614853.patch Edit (613 bytes, text/plain)

This is the patch from comment #17 backported to Lucid.

Revision history for this message

joe williams (joetify) wrote on 2010-11-10:

#23

I have been running this patch in production for a couple days and it seems solid thus far. I'm going to wait a few more days before I call it fixed though.

Revision history for this message

Scott Moser (smoser) wrote on 2010-11-11:

#24

ubuntu-kernels-sandbox/ubuntu-lucid-amd64-linux-image-2.6.32-310-ec2_2.6.32-310.190-lp614853-kernel.img.manifest.xml

I uploaded to each region john's kernel from
http://kernel.ubuntu.com/~jj/linux-image-2.6.32-310-ec2_2.6.32-310.19~lp614853_amd64.deb

us-west-1 aki-3e23737b x86_64
us-east-1 aki-2433c44d x86_64
eu-west-1 aki-6c063318 x86_64
ap-southeast-1 aki-d8740a8a x86_64

Scott Moser (smoser) on 2010-11-15

Changed in linux-ec2 (Ubuntu):
importance:	Undecided → Medium
status:	New → Confirmed

Brian Murray (brian-murray) on 2010-11-16

tags:

added: patch

Revision history for this message

joe williams (joetify) wrote on 2010-11-19:

#25

This patch seems solid, the panics don't seem to happen any longer on my machines.

Revision history for this message

John Johansen (jjohansen) wrote on 2010-12-14:

#26

The patch posted above may be causing Bug #671001. The patch "fixes" this bug by simply checking for 0 before doing the division it does not address the underlying issue causing group->cpu_power to be 0 in the first place. So instead of oopsing at the divide by zero, the kernel continues until the underlying problem causes a different bug to surface.

To be clear this is just speculation that this might be the cause, and has not be verified yet.

Revision history for this message

John Johansen (jjohansen) wrote on 2010-12-15:

#27

It has been reported that Bug #671001 was encountered before the ran the test kernel with the above patch.

Revision history for this message

Scott Moser (smoser) wrote on 2011-01-03:

#28

There was more action on the linux bug (https://bugzilla.kernel.org/show_bug.cgi?id=16991#c17), and a paper-over patch sent upstream http://lkml.indiana.edu/hypermail/linux/kernel/1010.2/02058.html . The upstream post got the expected response (no... fix it right).

Revision history for this message

Rudolfs Osins (rudolfs) wrote on 2011-01-05:

#29

oops.txt Edit (6.9 KiB, text/plain)

We had at least 4 crashes related to this bug (all within 2 months). Attached the messages of the latest two panics.

It's a DB server running postgres and a linux software raid10 setup for storage. On all occasions the machine had a higher load than normal ~20 - 30 (normally ~15), on the latest crash there was also a raid rebuild in the background.

Running on AWS
Instance: m2.2xlarge
Region: EU-West
Kernel-id: aki-4feec43b (2.6.32-309-ec2 kernel via pvgrub)

Linux version 2.6.32-309-ec2 (buildd@yellow) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) )
#18-Ubuntu SMP Mon Oct 18 21:00:50 UTC 2010 (Ubuntu 2.6.32-309.18-ec2 2.6.32.21+drm33.7)

Will try to upgrade to linux-image-2.6.32-311-ec2 as there are a lot of changes in the sched code, although I didn't find anything that would address this issue explicitly.

Revision history for this message

joe williams (joetify) wrote on 2011-01-17:

#30

Has this been merged into 10.04? If not, the "paper over" patch should really get included in my opinion and then be replaced when the correct fix is available. Myself and others have been running the custom kernel that includes the fix for a while now with success. I guess I am a bit more pragmatic than the LKML guys in that I have to make sure my machines stay up or my bills don't get paid.

Revision history for this message

Stefan Bader (smb) wrote on 2011-01-18:

#31

Not yet, but in the end maybe the pragmatic approach will have to do until there is something better. I tried to reproduce this with the other patch from the upstream bug (to possible catch setting the value to zero) but have not been able to get anything. I have packages with those kernels at http://people.canonical.com/~smb/lp614853/ which could be used by booting with the pv-grub aki as described in https://lists.ubuntu.com/archives/ubuntu-cloud/2010-December/000466.html. If those being able to get the bug could try to do so with that kernel to see whether that adds more information for upstream.

Meanwhile I would try to get the paper-over patch accepted for SRU.

Revision history for this message

Stefan Bader (smb) wrote on 2011-01-21:

#32

SRU Justification:

Impact: When trying to find the busiest group for the scheduler, there are rare (but it seems more likely in EC2) cases where cpu_power is zero when the code tries to divide by that variable.

Fix: There is no real fix yet (and therefor both patches are not upstream) but users have tested the first patch which works around the issue by avoiding the divide whenever cpu_power actually is zero.
The second patch is an optional companion to the first one which hopefully will yell when cpu_power is set to zero by accident. While it is neither a bug fix nor really needed I would like to add it, too. That way we could potentially catch the real bug in real usage (which seems to be the only way to get it after an extended period of time) and then revert both changes in future, when there is a fix.

Testcase: Not being able to reproduce in test. But this has been reported to happen after around a week of uptime on production servers.
(boot tested this approach to make sure this does not introduce obvious regressions by hitting the warning too often).

Bug Watch Updater (bug-watch-updater) on 2011-01-24

Changed in linux:
status:	Unknown → Confirmed

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2011-02-03:

#33

Patch is in 2.6.32-313.25

Changed in linux-ec2 (Ubuntu):
assignee:	nobody → Stefan Bader (stefan-bader-canonical)
status:	Confirmed → Fix Committed

Steve Conklin (sconklin) on 2011-02-04

tags:

added: verification-needed-lucid

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-03-02:

#34

Download full text (6.1 KiB)

This bug was fixed in the package linux-ec2 - 2.6.32-313.26

---------------
linux-ec2 (2.6.32-313.26) lucid-proposed; urgency=low

[ Brad Figg ]

* Release Tracking Bug
- LP: #716657

[ Brad Figg ]

* Release Tracking Bug
- LP: #712864

[ Brad Figg ]

* Rebased to 2.6.32-29.58

[ Ubuntu: 2.6.32-29.58 ]

  * Release Tracking Bug
    - LP: #716551
  * net: fix rds_iovec page count overflow, CVE-2010-3865
    - LP: #709153
    - CVE-2010-3865
  * net: ax25: fix information leak to userland, CVE-2010-3875
    - LP: #710714
    - CVE-2010-3875
  * net: ax25: fix information leak to userland harder, CVE-2010-3875
    - LP: #710714
    - CVE-2010-3875
  * net: packet: fix information leak to userland, CVE-2010-3876
    - LP: #710714
    - CVE-2010-3876
  * net: tipc: fix information leak to userland, CVE-2010-3877
    - LP: #711291
    - CVE-2010-3877
  * inet_diag: Make sure we actually run the same bytecode we audited,
    CVE-2010-3880
    - LP: #711865
    - CVE-2010-3880

linux-ec2 (2.6.32-313.25) lucid-proposed; urgency=low

[ Brad Figg ]

* Tracking Bug
- LP: #708890

[ Andrew Dickinson ]

* SAUCE: sched: Prevent divide by zero when cpu_power is 0
- LP: #614853

[ Brad Figg ]

* Rebased to 2.6.32-29.57

[ Stefan Bader ]

* SAUCE: sched: Try tp catch cpu_power being set to 0
- LP: #614853

[ Upstream Kernel Changes ]

* SRU: xen: events: do not unmask event channels on resume
- LP: #681083

[ Ubuntu: 2.6.32-29.57 ]

  * Tracking Bug
    - LP: #708864
  * [Config] Set CONFIG_NR_CPUS=256 for amd64 server
    - LP: #706058
  * Input: i8042 - introduce 'notimeout' blacklist for Dell Vostro V13
    - LP: #380126
  * tun: avoid BUG, dump packet on GSO errors
    - LP: #698883
  * TTY: Fix error return from tty_ldisc_open()
    - LP: #705045
  * x86, hotplug: Use mwait to offline a processor, fix the legacy case
    - LP: #705045
  * fuse: verify ioctl retries
    - LP: #705045
  * fuse: fix ioctl when server is 32bit
    - LP: #705045
  * ALSA: hda: Use model=lg quirk for LG P1 Express to enable playback and
    capture
    - LP: #595482, #705045
  * nohz: Fix printk_needs_cpu() return value on offline cpus
    - LP: #705045
  * nohz: Fix get_next_timer_interrupt() vs cpu hotplug
    - LP: #705045
  * nfsd: Fix possible BUG_ON firing in set_change_info
    - LP: #705045
  * NFS: Fix fcntl F_GETLK not reporting some conflicts
    - LP: #705045
  * sunrpc: prevent use-after-free on clearing XPT_BUSY
    - LP: #705045
  * hwmon: (adm1026) Allow 1 as a valid divider value
    - LP: #705045
  * hwmon: (adm1026) Fix setting fan_div
    - LP: #705045
  * amd64_edac: Fix interleaving check
    - LP: #705045
  * IB/uverbs: Handle large number of entries in poll CQ
    - LP: #705045
  * PM / Hibernate: Fix PM_POST_* notification with user-space suspend
    - LP: #705045
  * ACPICA: Fix Scope() op in module level code
    - LP: #705045
  * ACPI: EC: Add another dmi match entry for MSI hardware
    - LP: #705045
  * orinoco: fix TKIP countermeasure behaviour
    - LP: #705045
  * orinoco: clear countermeasure setting on commit
    - LP: #705045
  * x86, amd: Fix panic on AMD CPU family 0x15
    - LP: ...

This bug was fixed in the package linux-ec2 - 2.6.32-313.26

---------------
linux-ec2 (2.6.32-313.26) lucid-proposed; urgency=low

[ Brad Figg ]

* Release Tracking Bug
    - LP: #716657

[ Brad Figg ]

* Release Tracking Bug
    - LP: #712864

[ Brad Figg ]

* Rebased to 2.6.32-29.58

[ Ubuntu: 2.6.32-29.58 ]

* Release Tracking Bug
    - LP: #716551
  * net: fix rds_iovec page count overflow, CVE-2010-3865
    - LP: #709153
    - CVE-2010-3865
  * net: ax25: fix information leak to userland, CVE-2010-3875
    - LP: #710714
    - CVE-2010-3875
  * net: ax25: fix information leak to userland harder, CVE-2010-3875
    - LP: #710714
    - CVE-2010-3875
  * net: packet: fix information leak to userland, CVE-2010-3876
    - LP: #710714
    - CVE-2010-3876
  * net: tipc: fix information leak to userland, CVE-2010-3877
    - LP: #711291
    - CVE-2010-3877
  * inet_diag: Make sure we actually run the same bytecode we audited,
    CVE-2010-3880
    - LP: #711865
    - CVE-2010-3880

linux-ec2 (2.6.32-313.25) lucid-proposed; urgency=low

[ Brad Figg ]

* Tracking Bug
    - LP: #708890

[ Andrew Dickinson ]

* SAUCE: sched: Prevent divide by zero when cpu_power is 0
    - LP: #614853

[ Brad Figg ]

* Rebased to 2.6.32-29.57

[ Stefan Bader ]

* SAUCE: sched: Try tp catch cpu_power being set to 0
    - LP: #614853

[ Upstream Kernel Changes ]

* SRU: xen: events: do not unmask event channels on resume
    - LP: #681083

[ Ubuntu: 2.6.32-29.57 ]

* Tracking Bug
    - LP: #708864
  * [Config] Set CONFIG_NR_CPUS=256 for amd64 server
    - LP: #706058
  * Input: i8042 - introduce 'notimeout' blacklist for Dell Vostro V13
    - LP: #380126
  * tun: avoid BUG, dump packet on GSO errors
    - LP: #698883
  * TTY: Fix error return from tty_ldisc_open()
    - LP: #705045
  * x86, hotplug: Use mwait to offline a processor, fix the legacy case
    - LP: #705045
  * fuse: verify ioctl retries
    - LP: #705045
  * fuse: fix ioctl when server is 32bit
    - LP: #705045
  * ALSA: hda: Use model=lg quirk for LG P1 Express to enable playback and
    capture
    - LP: #595482, #705045
  * nohz: Fix printk_needs_cpu() return value on offline cpus
    - LP: #705045
  * nohz: Fix get_next_timer_interrupt() vs cpu hotplug
    - LP: #705045
  * nfsd: Fix possible BUG_ON firing in set_change_info
    - LP: #705045
  * NFS: Fix fcntl F_GETLK not reporting some conflicts
    - LP: #705045
  * sunrpc: prevent use-after-free on clearing XPT_BUSY
    - LP: #705045
  * hwmon: (adm1026) Allow 1 as a valid divider value
    - LP: #705045
  * hwmon: (adm1026) Fix setting fan_div
    - LP: #705045
  * amd64_edac: Fix interleaving check
    - LP: #705045
  * IB/uverbs: Handle large number of entries in poll CQ
    - LP: #705045
  * PM / Hibernate: Fix PM_POST_* notification with user-space suspend
    - LP: #705045
  * ACPICA: Fix Scope() op in module level code
    - LP: #705045
  * ACPI: EC: Add another dmi match entry for MSI hardware
    - LP: #705045
  * orinoco: fix TKIP countermeasure behaviour
    - LP: #705045
  * orinoco: clear countermeasure setting on commit
    - LP: #705045
  * x86, amd: Fix panic on AMD CPU family 0x15
    - LP: #705045
  * md: fix bug with re-adding of partially recovered device.
    - LP: #705045
  * tracing: Fix panic when lseek() called on "trace" opened for writing
    - LP: #705045
  * x86, gcc-4.6: Use gcc -m options when building vdso
    - LP: #705045
  * x86: Enable the intr-remap fault handling after local APIC setup
    - LP: #705045
  * x86, vt-d: Handle previous faults after enabling fault handling
    - LP: #705045
  * x86, vt-d: Fix the vt-d fault handling irq migration in the x2apic mode
    - LP: #705045
  * x86, vt-d: Quirk for masking vtd spec errors to platform error handling
    logic
    - LP: #705045
  * hvc_console: Fix race between hvc_close and hvc_remove
    - LP: #705045
  * hvc_console: Fix race between hvc_close and hvc_remove, again
    - LP: #705045
  * HID: hidraw: fix window in hidraw_release
    - LP: #705045
  * bfa: fix system crash when reading sysfs fc_host statistics
    - LP: #705045
  * net: release dst entry while cache-hot for GSO case too
    - LP: #705045
  * install_special_mapping skips security_file_mmap check.
    - LP: #705045
  * USB: misc: uss720.c: add another vendor/product ID
    - LP: #705045
  * USB: ftdi_sio: Add D.O.Tec PID
    - LP: #705045
  * USB: usb-storage: unusual_devs entry for the Samsung YP-CP3
    - LP: #705045
  * p54usb: add 5 more USBIDs
    - LP: #705045
  * p54usb: New USB ID for Gemtek WUBI-100GW
    - LP: #705045
  * sound: Prevent buffer overflow in OSS load_mixer_volumes
    - LP: #705045
  * mv_xor: fix race in tasklet function
    - LP: #705045
  * ima: fix add LSM rule bug
    - LP: #705045
  * ALSA: hda: Use LPIB for Dell Latitude 131L
    - LP: #530346, #705045
  * ALSA: hda: Use LPIB quirk for Dell Inspiron m101z/1120
    - LP: #705045
  * block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead
    - LP: #705045
  * sctp: Fix a race between ICMP protocol unreachable and connect()
    - LP: #705045
  * posix-cpu-timers: workaround to suppress the problems with mt exec
    - LP: #705045
  * Linux 2.6.32.28
    - LP: #705045
  * dell-laptop: Add another Dell laptop family to the DMI whitelist
    - LP: #693078
  * dell-laptop: Add another Dell laptop family to the DMI whitelist
    - LP: #693078
  * drm/ttm: Clear the ghost cpu_writers flag on
    ttm_buffer_object_transfer.
    - LP: #708769
  * drm/kms: remove spaces from connector names (v2)
    - LP: #708769
  * Linux 2.6.32.28+drm33.13
    - LP: #708769

[ Ubuntu: 2.6.32-28.56 ]

* Tracking Bug
    - LP: #705565
  * Just a build number increment for a new upload. There was an issue
    in the previous upload that prevented ARMEL from building. The
    issue has been resolved in the PPA and a new upload should produce
    the requisite images.

[ Ubuntu: 2.6.32-28.55 ]

* Another version bump because of abi check failure
  * Tracking Bug
    - LP: #699885

[ Ubuntu: 2.6.32-28.54 ]

* Another version bump because of upload failure

[ Ubuntu: 2.6.32-28.53 ]

* Another version bump because of upload failure
 -- Brad Figg <brad.figg@canonical.com>   Thu, 10 Feb 2011 11:03:57 -0800

Changed in linux-ec2 (Ubuntu):
status:	Fix Committed → Fix Released

Revision history for this message

Rudolfs Osins (rudolfs) wrote on 2011-06-17:

#35

2.6.32-314-ec2 kernel crash @ 17.06.2011 Edit (27.8 KiB, text/plain)

I can confirm, that this bug is still happening in (see attached log):
Ubuntu 10.04.2 LTS, kernel 2.6.32-314-ec2

We're running a Postgres server on AWS with linux software raid10. After the crash we upgraded to:
Linux db6.i.bluereport.net 2.6.32-316-ec2 #31-Ubuntu SMP Wed May 18 14:10:36 UTC 2011 x86_64 GNU/Linux

Will report if it's still happening in 316!

Revision history for this message

Scott Moser (smoser) wrote on 2011-06-17:

#36

Rudolf,
your console log shows:
[ 0.000000] Linux version 2.6.32-312-ec2 (buildd@yellow) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #24-Ubuntu SMP Fri Jan 7 18:30:50 UTC 2011 (Ubuntu 2.6.32-312.24-ec2 2.6.32.27+drm33.12)

That definitely indicates that you've either collected the wrong console output, or you're not running the kernel you think you are. It does appear that pv-grub is loading the kernel, but that its not a 3.6.32-314 kernel.

Revision history for this message

Rudolfs Osins (rudolfs) wrote on 2011-06-17:

#37

Scott,

you're right! I think what happened is, that we were running 312 and had a crash after which we rebooted the machine and installed the newest kernel (314 at that time). But we didn't reboot the machine after the upgrade, so 312 was still running.

Please ignore comment #36!
Let's see how 316 performs...

Revision history for this message

Teo Ruiz (teo) wrote on 2011-08-29:

#38

Crash for kernel 2.6.35-24. Edit (4.7 KiB, text/plain)

Hi all.

This happened to me with 2.6.35-24-server, it is a MySQL (Percona, 5.1.54) machine running not so heavy load but slightly heavier IO. Please find attached the crash log.

The uptime of the server was ~219 days, which is relevant according to the original bug at the kernel.

Was the patch on this bug ported to newer kernerls?

Thanks,

Revision history for this message

Stefan Bader (smb) wrote on 2011-08-30:

#39

No, as this report was only observed on ec2 kernels and also quicker. There has been some upstream stable discussion about crashes after 219 days of uptime (in 2.6.32 based kernels). One of the patches mentioned

commit 305e6835e05513406fa12820e40e4a8ecb63743c
Author: Venkatesh Pallipadi <email address hidden>
Date: Mon Oct 4 17:03:21 2010 -0700

sched: Do not account irq time to current task

would be upstream now, but is not in 2.6.38 kernels before Ubuntu-2.6.35-29.51. The other change seems not yet being pushed forward.

Revision history for this message

Teo Ruiz (teo) wrote on 2011-08-30:

#40

Apparently a patch will be included in Debian to fix the 219 days issue, as per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636797

It's for 2.6.32, but would you think it could be ported to the 2.6.35 on Maverick? Should I file a different bug?

Thanks,

Revision history for this message

Stefan Bader (smb) wrote on 2011-08-30:

#41

This looks like the work-around used for the ec2 kernels. So it sounds like the same problem can in fact happen on real hardware (which was not really clear). That, the fact that it is clearly only papering over some other issue and no reports about this happening on other kernels prevented any action on later kernels.
I think this report should be a good place, we just need another task for the "normal" kernel package. Probably the real fix could be to not mark the sched_clock as stable as it was brought up in that upstream discussion. Though obviously the 219 day delay makes it hard to verify.
But before that I would like to make sure the second part is actually needed. Your report for Maverick was using a 2.6.35-24 kernel and the patch above came in much later (2.6.35-29, sorry the 2.6.38 in my last comment was a mistype).

Revision history for this message

James Sellman (wd-jim-qp) wrote on 2011-09-02:

#42

I wound up opening a separate bug for the generic/server packages over at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/824304?comments=all ... There didn't seem to be a way for me to add those packages to this ticket (just other projects).

Tim Gardner (timg-tpi) on 2011-09-12

Changed in linux-ec2 (Ubuntu Lucid):
status:	New → Fix Released
Changed in linux-ec2 (Ubuntu):
status:	Fix Released → Invalid

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2011-09-12:

#43

This bug will be next to impossible to verify given its 219 day cycle.

Changed in linux (Ubuntu Lucid):
assignee:	nobody → Tim Gardner (timg-tpi)
status:	New → Fix Committed
Changed in linux (Ubuntu):
status:	New → Invalid

Revision history for this message

James Sellman (wd-jim-qp) wrote on 2011-09-12:

#44

Thanks Tim. If we can at least keep the div by zero from happening and keep the kernel from dying, if the underlying problem occurs again we can at least gather more information to determine what happened to put it the situation in the first place. In the meantime, at least we don't have to keep a planned reboot cycle to avoid unplanned oopses.

Revision history for this message

James Sellman (wd-jim-qp) wrote on 2011-09-20:

#45

Can I be pointed to the commit with the diff where the fix went into linux generic (server, etc.) and what package rev it will go into testing on?

Revision history for this message

Martijn Kint (martijn-true) wrote on 2011-09-28:

#46

As James Sellman I'm quite curious if someone could point me to the changelog of the 2.6.35.xx kernel version where this was fixed as I'm unable to find it. I just want to make sure that this issue is fixed or has a workaround so that we don't get this oops again. So far 20 KVM servers have been hit by this bug.

Revision history for this message

Stefan Bader (smb) wrote on 2011-09-28:

#47

This is currently only committed to the repository and will be included in the next proposed kernel update. There will be a message to this report, asking for verification when the package is prepared. Note this is 2.6.32. For 2.6.35 see comment #39: Ubuntu-2.6.35-29.51 had a fix that was said to fix some crashes. But the last confirmation of 2.6.35 crashing was using an older kernel. So for the moment there is nothing planned for that. First there needs to be some feedback that the latest kernel is still crashing that way.

Revision history for this message

Martijn Kint (martijn-true) wrote on 2011-09-28:

#48

Since the latest kvm machine is running 2.6.35-30-server #59~lucid1-Ubuntu and has an uptime of 13 days, 22:03. Will report back in 206 days from now to see if that fix is working as intended. Is there any other workaround available? Like upgrading to another backports kernel, 2.6.38 perhaps?

Or can we trigger this bug in another way?

Revision history for this message

Stefan Bader (smb) wrote on 2011-09-28:

#49

Whether using 2.6.35 or 2.6.38 would make no difference if the patch which is upstream helps. There was a patch claimed to cause the problem sooner in the upstream discussion but it did not seem to work for me when I tried it. So unfortunately I know of now way to speed up testing.

Revision history for this message

Martijn Kint (martijn-true) wrote on 2011-09-28:

#50

It seems like a regression to me or introduced by a new feature. We still have some karmic KVM hosts that are running 2.6.31-20-server kernel but they are definitely not affected by this issue.

Revision history for this message

Dju (mirror-crifo) wrote on 2011-09-29:

#51

Hi.
today, on my filer, running debian squeeze with kernel 2.6.32-5-amd64, i had the same bug in "find_busiest_group"
see the screen here :
http://pic.twitter.com/sAih9DlN
after rebooting, my server runs fine... but i'm affraid it can happen again :(

Revision history for this message

Martijn Kint (martijn-true) wrote on 2011-10-03:

#52

Ok we just got hit again by this bug is there still need to attach the output from syslog to this bug?

Revision history for this message

Stefan Bader (smb) wrote on 2011-10-05:

#53

Yes, the scheduler code changed since 2.6.32 and so the syslog is valuable. Also, was this actually 200+ day uptime or quicker?

Revision history for this message

Martijn Kint (martijn-true) wrote on 2011-10-05:

#54

syslog output Edit (36.6 KiB, text/plain)

It was 244 days actually. Syslog output attached.

Revision history for this message

Stefan Bader (smb) wrote on 2011-10-05:

#55

That syslog is from a 2.6.32 kernel (and a quite old one 2.6.32-29.58). However the current 2.6.32-34.77 would not have the work-around patch, yet. It is staged for the next round of updates. Was that the correct syslog (because crashes with 2.6.35 were mentioned).

Revision history for this message

Dju (mirror-crifo) wrote on 2011-10-08:

#56

mine has been running for 212 days when it crashed
after what i can read on the internet, it _seems_ this bug happens when the server's uptime is 200+ days

Revision history for this message

Herton R. Krzesinski (herton) wrote on 2011-10-13:

#57

The patch is now in -proposed 2.6.32-35.78 kernel for Lucid (it is already included in current ec2 flavour, just main kernel for lucid didn't have it).

Just noted that on master, the debugging patch "UBUNTU: SAUCE: sched: Try tp catch cpu_power being set to 0" isn't included, not sure this was intended.

As this can take a long time to verify, probably it can be tagged verification-done-lucid, unless there is some way/testcase to make the crash happen earlier.

Anyone wanting to test 2.6.32-35.78 kernel, should enable -proposed for now, see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.

Revision history for this message

Herton R. Krzesinski (herton) wrote on 2011-10-20:

#58

Since this bug is hard to verify, looking to require more than 1 week with the pristine kernel, and the same patch is already for some time in lucid-ec2 without issues, I'm marking verified for lucid update.

tags:

added: verification-done-lucid
removed: verification-needed-lucid

Revision history for this message

Michael S. Fischer (otterley) wrote on 2011-10-25:

#59

Any ETA on promoting the 2.6.32-35.78 kernel package from -proposed to -updates?

Revision history for this message

James Sellman (wd-jim-qp) wrote on 2011-11-08:

#60

It looks like everything has been qualified in the bug for the proposed package, and everyone has signed off on it. As of the 27th of October.

I wonder if there's anything else preventing it from being promoted?

Revision history for this message

James Sellman (wd-jim-qp) wrote on 2011-11-08:

#61

See:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/871899

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-11-08:

#62

Download full text (5.2 KiB)

This bug was fixed in the package linux - 2.6.32-35.78

---------------
linux (2.6.32-35.78) lucid-proposed; urgency=low

[Herton R. Krzesinski]

* Release Tracking Bug
- LP: #871899

[ Andrew Dickinson ]

* SAUCE: sched: Prevent divide by zero when cpu_power is 0
- LP: #614853

[ Stefan Bader ]

* [Config] Force perf to use libiberty for demangling
- LP: #783660

[ Tim Gardner ]

  * [Config] Simplify binary-udebs dependencies
    - LP: #832352
  * [Config] kernel preparation cannot be parallelized
    - LP: #832352
  * [Config] Linearize module/abi checks
    - LP: #832352
  * [Config] Linearize and simplify tree preparation rules
    - LP: #832352
  * [Config] Build kernel image in parallel with modules
    - LP: #832352
  * [Config] Set concurrency for kmake invocations
    - LP: #832352
  * [Config] Improve install-arch-headers speed
    - LP: #832352
  * [Config] Fix binary-perarch dependencies
    - LP: #832352
  * [Config] Removed stamp-flavours target
    - LP: #832352
  * [Config] Serialize binary indep targets
    - LP: #832352
  * [Config] Use build stamp directly
    - LP: #832352
  * [Config] Restore prepare-% target
    - LP: #832352
  * [Config] Fix binary-% build target
  * [Config] Fix install-headers target
    - LP: #832352
  * SAUCE: igb: Protect stats update
    - LP: #829566
  * SAUCE: rtl8192se spams log
    - LP: #859702

[ Upstream Kernel Changes ]

  * Add mount option to check uid of device being mounted = expect uid,
    CVE-2011-1833
    - LP: #732628
    - CVE-2011-1833
  * crypto: Move md5_transform to lib/md5.c
    - LP: #827462
  * net: Compute protocol sequence numbers and fragment IDs using MD5.
    - LP: #827462
  * ALSA: timer - Fix Oops at closing slave timer
    - LP: #827462
  * ALSA: snd-usb-caiaq: Fix keymap for RigKontrol3
    - LP: #827462
  * powerpc: Fix device tree claim code
    - LP: #827462
  * powerpc: pseries: Fix kexec on machines with more than 4TB of RAM
    - LP: #827462
  * Linux 2.6.32.45+drm33.19
    - LP: #827462
  * ipv6: make fragment identifications less predictable, CVE-2011-2699
    - LP: #827685
    - CVE-2011-2699
  * tunnels: fix netns vs proto registration ordering
    - LP: #823296
  * Fix broken backport for IPv6 tunnels in 2.6.32-longterm kernels.
  * USB: xhci: fix OS want to own HC
    - LP: #837669
  * USB: assign instead of equal in usbtmc.c
    - LP: #837669
  * USB: usb-storage: unusual_devs entry for ARM V2M motherboard.
    - LP: #837669
  * USB: Serial: Added device ID for Qualcomm Modem in Sagemcom's HiLo3G
    - LP: #837669
  * atm: br2864: sent packets truncated in VC routed mode
    - LP: #837669
  * hwmon: (ibmaem) add missing kfree
    - LP: #837669
  * ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc
    - LP: #837669
  * mm: fix wrong vmap address calculations with odd NR_CPUS values
    - LP: #837669
  * perf tools: do not look at ./config for configuration
    - LP: #837669
  * fs/partitions/efi.c: corrupted GUID partition tables can cause kernel
    oops
    - LP: #837669
  * befs: Validate length of long symbolic links.
    - LP: #837669
  * ALSA: snd_usb_caiaq: track submitted output urbs
    - LP: #8...

Ubuntu
linux-ec2 package

kernel panic divide error: 0000 [#1] SMP

Bug Description

Related branches

CVE References

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
Linux	Fix Released	Unknown	linux-kernel-bugs #16991
linux (Ubuntu)	Invalid	Undecided	Unassigned
Lucid	Fix Released	Undecided	Tim Gardner
linux-ec2 (Ubuntu)	Invalid	Medium	Stefan Bader
Lucid	Fix Released	Undecided	Unassigned

Ubuntulinux-ec2 package

kernel panic divide error: 0000 [#1] SMP

Bug Description

Related branches

CVE References

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
linux-ec2 package