Divide by zero in find_busiest_group/update_sg_lb_stats (on physical hardware)

Reported by James Sellman on 2011-08-11
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Unknown
linux (Ubuntu)
Undecided
Unassigned

Bug Description

This is relevant to Lucid 10.04.2, but it seems to affect all 2.6.32 builds (and possibly others). Symptoms are very frustrating divide by zero kernel panics/hangs after roughly 200 days uptime.

Currently using:

Ubuntu 2.6.32-32.62-server 2.6.32.38+drm33.16 -- image package is linux-image-2.6.32-32-server 2.6.32-32.62

There appears to be no fix, however, as of 2.6.32-32.77

There have been patches submitted to the linux-ec2 package for this bug, however it also appears to hit some types of physical hardware, and there has been no movement on getting these packages included into the main kernel packages, hence I am opening this bug, as I don't think it will be addressed outside of ec2 as long as the impression is given that it is only an ec2 bug.

Please do not mark this bug as a dupe unless the ec2 ticket is promoted to encompass all kernel builds and patches are submitted to all relevant image packages. The ec2-related ticket does indicate that it is related to the mainline kernel build however it offloads patches to the main kernel tree (which, given that the bug has been open an entire year, may not be immediately forthcoming). It would be nice if the current Lucid packages (as well for other affected releases) would receive the same patch that the linux-ec2 package has received.

Thanks for any assistance possible from the kernel team. =)

kernel.org related bug:

https://bugzilla.kernel.org/show_bug.cgi?id=16991

Ubuntu linux-ec2-specific bug:

https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853

James Sellman (wd-jim-qp) wrote :

Quick hardware description of systems that this is occuring on:

SuperMicro X8-series motherboards, various models equipped with Xeon Westmeres, various models.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 824304

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: lucid
James Sellman (wd-jim-qp) wrote :

Due to the nature of the crash, logs cannot be obtained. Screenshots in the referenced ec2 bug are relevant, however, and the problem is already described in that bug, but because it was only reported for ec2 package, fixes were not applied to the generic/server kernel packages.

I am setting bug to confirmed.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Teo Ruiz (teo) wrote :

One of my servers, after 219 days of uptime, crashed with the same divide-by-zero bug, as I confirmed in the ec2 bug. It's a heavy IO machine running a critical MySQL service.

I'm attaching the crash log I could get out of my messages log file.

There is apparently a patch getting into the Debian kernel that should solve this, although it's a workaroud the divide-by-zero and not a proper fix.

Teo Ruiz (teo) wrote :
Changed in linux:
status: Unknown → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.