All our Lucid 2.6 and 3.0 kernels hang with heavy memory loads

Bug #1161202 reported by Marc Hasson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-lts-backport-oneiric (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

The purpose of this bug is to report/emphasize the severe number of system hangs, which require power-cycling, on our deployment of servers running the 10.04LTS (Lucid) release. The issue here is essentially identical to that reported for the 3.2 kernels on 12.04LTS in bug #1154876 at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876 almost 2 weeks ago.

The kernel version which fails for us in the field is a 2.6.38-16 Ubuntu kernel. Though the power-cycling recovery in the field means we never received a crash dump for analysis we've been able to reproduce what appears to be the identical symptoms on an in-house VMware testbed. The exact same failure as the other bug also occurs in our testbed on Lucid using the very latest stock 3.0.0-32-generic kernel from the repository.

See the other bug for details of scripts/loads and details of a kdb session during the hang. I didn't reproduce all those attachments for this bug report since everything for this version of the system would be similar. Essentially all processes remain "stuck" in __alloc_pages_nodemask and never succeed in allocating memory. All CPUs are busy rerunning each process to try again, to no avail. The OOM logic is not invoked on the 3.0 kernel while in this hang, even though plenty of OOMs had occurred in the time leading up to the hang. In the 2.6.38 kernels it looks essentially the same except that even during the hang we see the OOM select_bad_process() function continually called but no OOM candidate is returned, due to a pending one previously selected. But the end result is identical: continual memory allocation failures, short sleeps, try again, and the system becomes totally non-responsive other than for "pings". The serial console and all other CLI or GUI goes totally dead, with no response. The only thing one can do is break in with kdb to investigate, as shown in the other bug.

Before the hangs even occur we will also see very heavy pgscank and pgscand numbers, as reported by the "sar" facility. On our production machines these can each hit millions of page scans per second and seem to occur even when there are several gigabytes of available memory. The system hangs are invariably immediately preceded by exceptionally high levels of pgscank and usually pgscand as well.

We really need a remedy or some kind of workaround for this issue.

Requested system release info:

marc@direct-10-04:~$ lsb_release -rd
Description: Ubuntu 10.04.4 LTS
Release: 10.04

Requested package info:

marc@direct-10-04:~$ dpkg -l | fgrep linux-image-3.0.0
ii linux-image-3.0.0-32-generic 3.0.0-32.50~lucid1 Linux kernel image for version 3.0.0 on x86/

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-3.0.0-32-generic 3.0.0-32.50~lucid1
ProcVersionSignature: Ubuntu 3.0.0-32.50~lucid1-generic 3.0.65
Uname: Linux 3.0.0-32-generic x86_64
Architecture: amd64
Date: Wed Mar 27 20:45:58 2013
InstallationMedia: Ubuntu 10.04.3 LTS "Lucid Lynx" - Release amd64 (20110719.2)
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.utf8
 SHELL=/bin/bash
SourcePackage: linux-lts-backport-oneiric

Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
dino99 (9d9) wrote :

That version is no more supported; and backport is not expected as its not a 'security' problem

Changed in linux-lts-backport-oneiric (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.