Ubuntu Server x64 Kernel Oops - Random services tainted

Bug #242804 reported by Dan Maranville on 2008-06-24
6
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Hardy
Medium
Andy Whitcroft

Bug Description

Binary package hint: linux-image-2.6.24-19-generic

Running our LTSP server we have been constantly getting Kernel oops in relation to different tainted services, Bind, SSH, IMAP, etc. it is mostly Bind9 in the logs today was the first we have seen of IMAP. We thought it might have something to do with LTSP but it is not looking that way as even if we have all the users log off the load average still climbs up over the days for example:
19:28:04 up 4 days, 10:33, 2 users, load average: 3.00, 3.00, 3.00

That was our first clue that something was up(loadavg)

So we began surfing the logs and found that random services were crashing in syslog, that directly correlated with kernel oops'es in messages.

I know this is very general I will upload anything requested.

I am attaching excerpts from the syslog and messages that directly correlate to one another.

Dan Maranville (likuidkewl) wrote :
Michael Blinn (mblinn-gmail) wrote :

I'm working on the same physical server as the original bug report. I also noticed this in the kern.log at the end of reboot kernel messages, which may or may not be relevant:

Jun 27 10:52:21 www kernel: [ 145.904725] mtrr: type mismatch for d0000000,1000000 old: write-back new: write-combining
Jun 27 10:52:38 www kernel: [ 163.650905] mtrr: type mismatch for d0000000,1000000 old: write-back new: write-combining
Jun 27 10:53:47 www kernel: [ 232.116870] mtrr: type mismatch for d0000000,1000000 old: write-back new: write-combining

I've also attached two additional Oopses, which occurred before the original bug report's Oops. Note the different affected processes.

These Oopses result in defunct processes which necessitate a reboot to restore services. Unfortunately, the reboot fails and we must physically power-cycle the box from the console - always fun on a production SCSI RAID server.

Any help is most appreciated.

Michael Blinn (mblinn-gmail) wrote :

Today we also updated the box to the latest Dell firmware and BIOS. Perhaps this will address the mtrr type mismatch, and therefore the general protection faults. Will update as the testing progresses.

Michael Blinn (mblinn-gmail) wrote :

Unfortunately we had two more crashes of ssh this morning. The second may have coincided with the killing of an rsync process that was using ssh.

Dan Maranville (likuidkewl) wrote :

Again with the SSH crashes today and yesterday.

Michael Blinn (mblinn-gmail) wrote :

Another crash today. This time in dovecot's imap daemon. kern.log attached.

root@www:/var/log# apt-cache policy linux-image
linux-image:
  Installed: (none)
  Candidate: 2.6.24.19.21
  Version table:
     2.6.24.19.21 0
        500 http://us.archive.ubuntu.com hardy-updates/main Packages
        500 http://security.ubuntu.com hardy-security/main Packages
     2.6.24.16.18 0
        500 http://us.archive.ubuntu.com hardy/main Packages

Michael Blinn (mblinn-gmail) wrote :

After more reading, I'm curious if this is related to the IPSec/IPComp zlib compression kernel bug reported here: http://lkml.org/lkml/2008/2/21/154

Though I have an updated kernel I've disabled compression in my IPSec configuration and will post results.

root@www:/etc/ipsec.d# apt-cache policy linux-image-generic
linux-image-generic:
  Installed: 2.6.24.19.21
  Candidate: 2.6.24.19.21
  Version table:
 *** 2.6.24.19.21 0
        500 http://us.archive.ubuntu.com hardy-updates/main Packages
        500 http://security.ubuntu.com hardy-security/main Packages
        100 /var/lib/dpkg/status
     2.6.24.16.18 0
        500 http://us.archive.ubuntu.com hardy/main Packages

Michael Blinn (mblinn-gmail) wrote :

Since disabling IPSec compression almost a month ago we have experienced 0 problems. I would conclude that the crashes we saw were related to the kernel zlib bug reported here: http://lkml.org/lkml/2008/2/21/154

I'm hoping we will see a patch against the 2.6.24 kernel branch so that we may re-enable compression.

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Michael Blinn (mblinn-gmail) wrote :

The last message in the kernel discussion regarding this bug (http://lkml.org/lkml/2008/2/28/280) says that they're queueing the fix for 2.6.24-stable - how this translates to Ubuntu's 2.6.24-?? versioning I do not know.

I have not seen this bugfix listed in any of the CVEs for Hardy Heron linux-* updates, but it sure would be nice, as the affected servers are our LTS production boxes.

At some point I will try to download an Intrepid LiveCD and test it, but what I'm most interested in is fixing the kernels we have (ideally without custom compilation!)

Thanks for the note Michael. I'm pasting the usptream git commit id below. This patch is already available in the upcoming Intrepid release which is set to come out in the next few days. Based on this I'm going to mark this Fix Released for Intrepid. Additionally I'll open a Hardy nomination to see if we can get this patch backported for Hardy. Thanks.

ogasawara@yoji:~/linux-2.6$ git log 21e43188f272c7fd9efc84b8244c0b1dfccaa105

commit 21e43188f272c7fd9efc84b8244c0b1dfccaa105

Author: Herbert Xu <email address hidden>

Date: Thu Feb 28 11:23:17 2008 -0800

    [IPCOMP]: Disable BH on output when using shared tfm

Changed in linux:
status: New → Fix Released
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
milestone: none → ubuntu-8.04.2
status: New → Triaged

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Steve Langasek (vorlon) on 2009-01-23
Changed in linux:
milestone: ubuntu-8.04.2 → ubuntu-8.04.3
Andy Whitcroft (apw) wrote :

The fix indicated has indeed been pushed out via the stable tree. This change was included in the 2.6.24.4 stable update which were pulled into the Hardy kernel under bug #301608 and released as Ubuntu-2.6.24-23.46. This version has already released into all pockets. Therefore closing this Fix Released.

Changed in linux (Ubuntu Hardy):
assignee: nobody → Andy Whitcroft (apw)
status: Triaged → In Progress
status: In Progress → Fix Released
Michael Blinn (mblinn-gmail) wrote :

Beautiful. Thank you to all who work so hard to keep LTS patches coming. Cheers!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers