3.2.0-38 and earlier systems hang with heavy memory usage

Bug #1154876 reported by Marc Hasson
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Background

We've been experiencing mysterious hangs on our 2.6.38-16 Ubuntu 10.04
systems in the field. The systems have large amounts of memory and disk,
along with up to a couple dozen CPU threads. Our operations folks have
to power-cycle the machines to recover them, they do not panic. Our use
of "hang" means the system will no longer respond to any current shell
prompts, will not accept new logins, and may not even respond to pings.
It appears totally dead.

Using log files and the "sar" utility from the "sysstat" package we
gradually put together the following clues to the hangs:

  Numerous "INFO: task <task-name>:<pid> blocked for more than 120 seconds"
  High CPU usage suddenly on all CPUs heading into the hang, 92% or higher
  Very high kswapd page scan rates (pgscank/s) - up to 7 million per second
  Very high direct page scan rates (pgscand/s) - up to 3 million per second

In addition to noting the above events just before the hangs, we have
some evidence that the high kswapd scans occur at other times for no
seemingly obvious reason. Such as when there is a signficant (25%) amount
of kbmemfree. Also, we've seen cases where there are application errors
related to a system's responsiveness and that has sometimes correlated
with either high pgscank/s or pgscand/s that lasts for some number of
sar records before the system returns to normal running. The peaks of
these transients aren't usually as high as those we see leading to a
solid system hang/failure. And sometimes these are not "transients",
but last for hours with no apparent event related to the starting or
stopping of this behavior!

So we decided to see if we can reproduce these symptoms on a VMware
testbed that we could easily examine with kdb and snapshot/dump.
Through a combination of tar, find, and cat commands launched from
a shell script we could recreate a system hang on both our 2.6.38-16
systems as well as the various flavors of the 3.2 kernels, with the
one crashdump'ed here being the latest 3.2.0-38 at the time of testing.
The "sar" utility on our 2.6 testing confirmed similar behavior of the
CPUs, kswapd scans, and direct scans leading up to the testbed hangs as
to what we see in the field failures of our servers.

Details on the shell scripts can be found in the file referenced below.
Its important to read the information below on how the crash dump was
taken before investigating it. Reproduction on a 2-CPU VM took 1.5-4
days for a 3.2 kernel, usually considerably less for a 2.6 kernel.

Hang/crashdump details:

In the crashdump the crash "dmesg" command will also show Call Traces that
occured *after* kdb investigations started. Its important to note the
kernel timestamp that indicates the start of those kdb actions and only
examine prior to that for clues as to the hang proper:

[160512.756748] SysRq : DEBUG
[164936.052464] psmouse serio1: Wheel Mouse at isa0060/serio1/input0 lost synchronization, throwing 2 bytes away.
[164943.764441] psmouse serio1: resync failed, issuing reconnect request
[165296.817468] SysRq : DEBUG

Everything previous to the above "dmesg" output occurs prior (or during)
the full system hang. The kdb session started over 12 hours after the
hang, the system was totally non-responsive at either its serial console
or GUI. Did not try a "ping" in this instance.

The "kdb actions" taken may be seen in an actual log of that session
recorded in console_kdb_session.txt. It shows where these 3.2 kernels
are spending their time when hung in our testbed ("spinning" in
__alloc_pages_slowpath by failing an allocation, sleeping, retrying).
We see the same behavior for the 2.6 kernels/tests as well except for
one difference described below. For the 3.2 dump included here all our
script/load processes, as well as system processes, are constantly failing
to allocate a page, sleeping briefly, and trying again. This occurs
across all CPUs (2 CPUs in this system/dump), which fits with what we
believe we see in our field machines for the 2.6 kernels.

For the 2.6 kernels the only difference we see is that there is typically
a call to the __alloc_pages_may_oom function which in turn selects a
process to kill, but we see that there is already a "being killed by oom"
process at the hang so no additional ones are selected. And we deadlock,
just as the comment in oom_kill.c's select_bad_process() says. In the
3.2 kernels we are now moving our systems to we see in our testbed hang
that the code does not go down the __alloc_pages_may_oom path. Yet from
the logs we include and the "dmesg" within crash one can see that prior
to the hang OOM killing is invoked frequently. The key seems to be a
difference in the "did_some_progress" variable returned when we are very
low on memory, its always a "1" in the 3.2 kernels on our testbed.

Though the kernel used here is 3.2.0-38-generic we have also caused this
to occur with earlier 3.2 Ubuntu kernels. We have also reproduced the
failures with 2.6.38-8, 2.6.38-16, and 3.0 Ubuntu kernels.

Quick description of included attachments (assuming this bug tool lets me add them separately):
console_boot_output.txt - boot up messages until standard running state of OOMs
dmesg_of_boot.txt - dmesg file from boot, mostly duplicates start of the above
console_last_output.txt - last messages on serial console when system hung
console_kdb_session.txt - kdb session demo'ing where system is "spinning"
dump.201303072055 - sysrq-g dump, system was up around 2 days before hanging
reproduction_info.txt - Machine environment and script used in our testbed

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-38-generic 3.2.0-38.61
ProcVersionSignature: Ubuntu 3.2.0-38.61-generic 3.2.37
Uname: Linux 3.2.0-38-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 2.0.1-0ubuntu17.1
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: AudioPCI [Ensoniq AudioPCI], device 0: ES1371/1 [ES1371 DAC2/ADC]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: marc 2591 F.... pulseaudio
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Card0.Amixer.info:
 Card hw:0 'AudioPCI'/'Ensoniq AudioPCI ENS1371 at 0x20c0, irq 18'
   Mixer name : 'Cirrus Logic CS4297A rev 3'
   Components : 'AC97a:43525913'
   Controls : 24
   Simple ctrls : 13
Date: Wed Mar 13 17:05:30 2013
HibernationDevice: RESUME=UUID=2342cd45-2970-47d7-bb6d-6801d361cb3e
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425)
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
 Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
MachineType: VMware, Inc. VMware Virtual Platform
MarkForUpload: True
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 svgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-38-generic root=UUID=2db72c58-0ff6-48f6-87e4-55365ee344df ro crashkernel=384M-2G:64M,2G-:128M rootdelay=60 console=ttyS1,115200n8 kgdboc=kms,kbd,ttyS1,115200n8 splash
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-38-generic N/A
 linux-backports-modules-3.2.0-38-generic N/A
 linux-firmware 1.79.1
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 06/02/2011
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: 440BX Desktop Reference Platform
dmi.board.vendor: Intel Corporation
dmi.board.version: None
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 1
dmi.chassis.vendor: No Enclosure
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd06/02/2011:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:
dmi.product.name: VMware Virtual Platform
dmi.product.version: None
dmi.sys.vendor: VMware, Inc.

Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.9 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc2-raring/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :

Sure Joseph, in progress. I have the 3.9 kernel you referenced now running my tests on my 12.04 system. Its so far behaving normally, it will likely take a few days to know whether there is any difference as far as the "hang" is concerned.

Just for the record, I had previously tested with: linux-image-3.5.0-21-generic_3.5.0-21.32~precise1_amd64.deb and the hang failure could still be seen with that kernel. I had not checked my records when I submitted this bug, so had forgotten. I could possibly have entitled this bug as 3.5.0-21 or earlier fail, but was focused on using one of the regularly distributed kernels to test/reproduce the failure for you folks.

I had also tested with: linux-image-3.8.0-0-generic_3.8.0-0.3_amd64.deb. On that system the hang did not occur BUT for some reason it also appeared to be the case that my loading tests were not "pushing" the system as hard either. So I figured some mismatch between that kernel and precise was the cause and that this 3.8 test was inconclusive.

Your 3.9 kernel seems to be allowing my tests to allocate as much memory and inflict as many memory overloading events (OOM killer) as the 3.5 and 3.2 kernels, so this test looks like we will be able to gather a datapoint on the issue, either way.

Revision history for this message
Marc Hasson (mhassonsuspect) wrote :

My testing on the 3.9 kernel has been underway since the note above, its surpassed 11 days of running the loads from the scripts attached, and even higher. The previous 3.2 and 3.5 kernel testing never exceeded 4.5 days before hanging solidly, and usually were less.

So, the 3.9 kernel appears to be considerably more robust at the very least since I could not cause it to solidly hang as I could in my 2.6 and 3.2/3.5 kernel testbeds. So it would be good to see 3.9 backported to Precise for supported usage on our deployed 12.04 systems. And I will write another bug for the 2.6 systems that are suffering the most so that perhaps something can be done there as well. BUT.....

... I could not tag this bug either as "kernel-bug-exists-upstream" nor "kernel-fixed-upstream" because while the "solid hang/failure" symptom *is* fixed in the upstream kernel we *still* experienced the same hangs but of only 5-10 minutes each event through at least the later half of the 3.9 kernel testing. I had no way to measure these hangs other than my own observations at my testing consoles, I had the impression they occurred a couple of times a day. I first noticed them a few days into the test, and can not say for sure whether they were there from the beginning or not. 5-10 minutes of outage from our servers would look the same to most network operations folks as a permanently solid hang, one can't have customers twiddling their thumbs for that long when engaged in transactions of some kind. I believe these "transient" hangs were also seen in my 3.2/3.5 testing, but I didn't time them since I was most concerned about the solid hang/failure. When any of the kernels, including this 3.9 test,l hangs like this I can see that all CPUs are 100% busy and I presume its the same symptom I've reported earlier about the constant rescheduling all processes for another page that I reported as part of my kdb session attachment. But I did not break in with kdb to confirm that in this round of testing, I didn't want to risk disrupting the longer-term survival testing I was going for primarily. I can confirm that pings were still responded to during these hangs and that the serial console remained unresponsive for the 5-10 minutes of hang.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :

So, its been many weeks without any kind of acknowledgement of either my previous note in this bug from March nor the 10.04 variant I filed in bug #1161202 for the 10.04 base.

Is there any way to get a response of anything further to do on these matters? You guys have the scripts/description and dumps, these issues are reproducible at will on 2 different LTS releases and still cause ongoing operational issues for us. The newest upstream kernel we tried, as reported in March, appears to be an improvement but is still unacceptable with its many minutes of going mute. In practical commerce terms, thats just as severe as permanently hanging from the user's viewpoint.

Is there anything more I can provde, test, or do?

tags: added: kernel-bug-exists-upstream
penalvch (penalvch)
description: updated
Revision history for this message
penalvch (penalvch) wrote :

Marc Hansson, could you please provide the full VMWare product version you are utilizing?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Marc Hasson (mhassonsuspect) wrote :

Christopher, its looks like I actually have a reasonable record of the VMWare version I was using for this reproduction despite having regularly updated my VMWare. . The VMWare installer has a log that shows that at the time of the reproduction/report here I was running the VMWare vmplayer 4.0.4 x86_64 version build#744019.

Since I was causing reproductions of this issue well before and after the dates in March that I reported it here, I'm quite certain that I've reproduced this issue across multiple versions of the vmplayer. And we've seen similar-appearing issues on our real servers. Hope this helps.

Thanks for looking into this! Its still an ongoing issue, especially with the 2.6 kernels in another bug I wrote related to this one.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
penalvch (penalvch)
tags: added: kernel-bug-exists-upstream-v3.9-rc2
removed: kernel-bug-exists-upstream
Revision history for this message
penalvch (penalvch) wrote :

Marc Hasson, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc4

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Marc Hasson (mhassonsuspect) wrote : Re: [Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage
Download full text (3.4 KiB)

Christopher, I did such a test back in March upon request with no response
to my testing results then. Nor *any* activity, until your recent notes.
Do we have any specific reason or bugfix to believe that this memory issue
has been addressed since then? Will I get at least a response this time to
my testing results as to next steps? I ask because this test can take
several days to perform and I will not be able to start it immediately, I
want to make sure its worth our time to do in order to get the most
effective progress on this.

BTW, have you noticed the bug report below? It seems fairly similar to
what I've been seeing in different kernels/testing as well as has a similar
reproduction method. In my testing there definitely is the OOM deadlock in
my 2.6 kernel testing while the 3.2 kernel testing seemed to have a
slightly different "deadlock" in believing it had made page freeing
progress when in fact it had not done so. My upstream kernel testing back
in March had not totally deadlocked, but would "freeze" for long periods of
time. Here's the kernel.org bug, with no response, that seemed somewhat
similar and could be an indication that me doing upstream testing would not
be all that useful:

https://bugzilla.kernel.org/show_bug.cgi?id=59901

Assuming a positive response for me to still proceed with the upstream
testing you requested, I will first have to reconfirm that my current
testbed can reproduce the issue with the latest 3.2 -51 Ubuntu kernel and
then I will try the upstream kernel you referenced. So it will be a little
while before I can report on this testing, which took multiple days in the
past. Nor can I start this testing immediately. It sure would be nice if
you guys could reproduce this, I thought my info on this score would be
adequate for that.

  -- Marc --

On Fri, Aug 9, 2013 at 7:35 PM, Christopher M. Penalver <
<email address hidden>> wrote:

> Marc Hasson, could you please test the latest upstream kernel available
> following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow
> additional upstream developers to examine the issue. Please do not test the
> daily folder, but the one all the way at the bottom. Once you've tested the
> upstream kernel, please comment on which kernel version specifically you
> tested. If this bug is fixed in the mainline kernel, please add the
> following tags:
> kernel-fixed-upstream
> kernel-fixed-upstream-VERSION-NUMBER
>
> where VERSION-NUMBER is the version number of the kernel you tested. For
> example:
> kernel-fixed-upstream-v3.11-rc4
>
> This can be done by clicking on the yellow circle with a black pencil icon
> next to the word Tags located at the bottom of the bug description. As
> well, please remove the tag:
> needs-upstream-testing
>
> If the mainline kernel does not fix this bug, please add the following
> tags:
> kernel-bug-exists-upstream
> kernel-bug-exists-upstream-VERSION-NUMBER
>
> As well, please remove the tag:
> needs-upstream-testing
>
> If you are unable to test the mainline kernel, please comment as to why
> specifically you were unable to test it and add the following tags:
> kernel-unable-to-test-upstream
> kernel-unable-to-t...

Read more...

Revision history for this message
Marc Hasson (mhassonsuspect) wrote :
Download full text (5.1 KiB)

Summary:

Tried 3.11rc7, very happy with how it behaved in our testing. Tried
this week's 3.12rc5, disappointed that a "step backwards" was taken
on that one for us. The difference for us was in the "low memory killer"
that was configured in the 3.11rc7 build but not the 3.12rc5 system.
Details below, as a consequence I'm tagging this bug with both "upstream
3.11rc7 fixes" as well as "upstream 3.12rc5 doesn't fix"!

Details:

I've now switched to a real hardware (Dell multicore) platform to make
sure no one has any doubts as to this kernel problem being an issue on
real hardware as well as my VM testbed. I can achieve the same hang
failure in the original bug description using either my 2GB VM or the
actual machine now.

I first reproduced the hang with a more recent 3.2.0-45 kernel on this
64-bit Dell hardware and then tried both the mainline 3.11rc7 and this
week's 3.12rc5 kernels from the URL supplied above by Christopher.

The good news is that I was unable to reproduce a problem using the
3.11rc7 kernel and the system was extremely well-behaved! That is,
despite running a very heavy load it remained responsive to new requests,
appeared to get more overall work accomplished compared to the 3.2 system
in the same time period, and had a minimum of kswapd scan rates in the
"sar" records. And no direct allocation failure scan rates at all.
Naturally, the system was SIGKILL'ing off selected processes periodically
but this is the price I'd expect for running the memory-overloading
test I have here and in my real-world environment. We much prefer
this behavior of individual processes being killed off, which can be
subsequently relaunched, rather than hanging or crashing the entire
system. Especially since it appeared that the SIGKILLs in my tests
were *always* directed at processes that were actively doing the memory
consuming work, so they were good choices.

I note that the processes SKIGKILL'ed off in the above 3.11rc7 system
were dispatched to their death by the "low memory killer" logic in the
lowmemorykiller.c code. The standard kernel OOM killer rarely, if ever,
was invoked. The 3.11rc7 kernel appears to have been built with the
CONFIG_ANDROID_LOW_MEMORY_KILLER=y setting which caused that low memory
killer code to be statically linked into the kernel and register its low
memory shrinker callback function which issued the appropriate SIGKILLs
under overloaded conditions.

The bad news is that the more recent 3.12rc5 kernel I tried did NOT
have the above CONFIG_ANDROID_LOW_MEMORY_KILLER=y setting and instead
relied upon just the kernel OOM killer. This 3.12rc5 system is behaving
similarly to when I turned off the 3.11rc7's "low memory killer" via
a /sys/module low memory minfree parameter. That is, the 3.12rc5 (or
3.11rc7 with "low memory killer" disabled) system experienced:

 1) Much longer, and with wide variance, user response times
    External wget queries went from 1-5 seconds with the "low memory
    killer" enabled during the overloading tests to 2 *minutes* without
    that facility!

 2) High kswapd scans of .5M-1M/second in the "sar" reports
    With the low memory killer, kswapd scan rates never exceeded a few K/sec...

Read more...

tags: added: kernel-bug-exists-upstream kernel-bug-exists-upstream-v3.12-rc5 kernel-fixed-upstream kernel-fixed-upstream-v3.11-rc7
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
To post a comment you must log in.