system swapping itself to death in raring for no good reason

Bug #1152736 reported by Steve Langasek on 2013-03-08
46
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Raring
High
Unassigned
Saucy
High
Unassigned

Bug Description

My laptop has 4GB of RAM and ~6GB of swap configured. After my most recent kernel upgrade in raring, I am noticing the system has started swapping itself to death; the desktop becomes completely unresponsive, and in some cases it becomes unresponsive even over SSH.

Looking remotely with SSH, I find that kswapd0 is using up nearly one full core. I have no idea *why* - I have vm.swappiness set to 30, and 'free' shows that over 1GB of RAM is still being used for buffers, so there really shouldn't be any memory pressure. Despite the fact that there's only ~400MB of swap used, which should certainly fit back into system memory, 'swapoff -a' fails with a 'Could not allocate memory' error. If I set vm.swappiness to 0, the swap usage decreases, but *very* slowly: after over a half hour, there's still over 400MB of swap used. And I don't have any idea what kswapd is doing at this point, but it's still very busy; and even after setting vm.swappiness=0, the system has managed a second time to get itself into an unresponsive state, with swap looking like the culprit.

dmesg shows nothing (which I will try to demonstrate by attaching logs from the machine in question, once it's responsive enough to let me run apport-collect).
---
ApportVersion: 2.9-0ubuntu2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: vorlon 29618 F.... pulseaudio
 /dev/snd/controlC0: vorlon 29618 F.... pulseaudio
DistroRelease: Ubuntu 13.04
HibernationDevice: RESUME=UUID=f6ab3c43-61b4-4af7-bf03-fa3b147a1de0
InstallationDate: Installed on 2010-09-24 (896 days ago)
InstallationMedia: Ubuntu 10.04.1 LTS "Lucid Lynx" - Release amd64 (20100816.1)
MachineType: LENOVO 3249CTO
MarkForUpload: True
Package: linux (not installed)
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.8.0-10-generic root=/dev/mapper/hostname-root ro quiet splash --verbose vt.handoff=7
ProcVersionSignature: Ubuntu 3.8.0-10.19-generic 3.8.2
RelatedPackageVersions:
 linux-restricted-modules-3.8.0-10-generic N/A
 linux-backports-modules-3.8.0-10-generic N/A
 linux-firmware 1.103
Tags: quantal
Uname: Linux 3.8.0-10-generic x86_64
UpgradeStatus: Upgraded to quantal on 2013-01-25 (42 days ago)
UserGroups: adm admin cdrom dialout libvirtd lpadmin mythtv plugdev sambashare src sudo
WifiSyslog:

dmi.bios.date: 08/23/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 6QET52WW (1.22 )
dmi.board.name: 3249CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr6QET52WW(1.22):bd08/23/2010:svnLENOVO:pn3249CTO:pvrThinkPadX201:rvnLENOVO:rn3249CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 3249CTO
dmi.product.version: ThinkPad X201
dmi.sys.vendor: LENOVO

Steve Langasek (vorlon) on 2013-03-08
Changed in linux (Ubuntu):
importance: Undecided → Critical

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1152736

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: raring
Joseph Salisbury (jsalisbury) wrote :

Hi Steve,

Which kernel version is this? Does the issue go away if you boot back into the prior kernel? Do you have to perform any specific actions for the swapping to start, or does it happen right after boot?

tags: added: kernel-key

apport information

tags: added: apport-collected quantal
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Steve Langasek (vorlon) wrote :

The issue is intermittent, and I don't know what causes it aside from firefox getting handsy with the system memory. There seems to be *some* arbitrary threshold beyond which the behavior becomes nonlinear, but it's not anywhere I would expect it to be: after filing the bug I used SysRq to kill off the session without a full reboot, and
 at this point I still have the following in 'free':

$ free
             total used free shared buffers cached
Mem: 3842476 3371792 470684 0 63260 1176384
-/+ buffers/cache: 2132148 1710328
Swap: 6291452 289828 6001624
$

This is with vm.swappiness=0 - and I still can't swapoff. So I don't know what my kernel is doing, but it's up to no good.

I haven't yet tested whether rebooting to a previous kernel helps; I'll do that shortly.

Steve Langasek (vorlon) wrote :

> I haven't yet tested whether rebooting to a previous kernel helps; I'll
> do that shortly.

I've now been running 3.8.0-9.18 for the past day. Current memory usage,
according to free:

$ free
             total used free shared buffers cached
Mem: 3842480 3693604 148876 0 6464 1049600
-/+ buffers/cache: 2637540 1204940
Swap: 6291452 417872 5873580
$

So far the problem has not recurred. However, I would note that in the same
time frame I also installed a flashblocker add-on for my browser, so my
usage pattern may have changed.

I'll give things another day, then if the problem doesn't recur, boot back
into -10 to see if I can reproduce the bug again.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Steve. I've added this to the kernel team hot list. I'll await your testing results.

Steve Langasek (vorlon) wrote :

I just saw this issue again on 3.8.0-9. Current memory usage (after the swap storm calmed down):

$ free
             total used free shared buffers cached
Mem: 3842480 3735332 107148 0 4216 1133784
-/+ buffers/cache: 2597332 1245148
Swap: 6291452 682164 5609288
$

vm.swappiness=30. The flash blocker extension is still in place in my browser.

It seems like the kernel is somehow unhappy to let its page cache drop below 1.2GB in size, and tries really hard to prevent that by swapping other things around?

Joseph Salisbury (jsalisbury) wrote :

The latest mainline kernel is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc2-raring/

It would be good to know if this bug is also in the mainline kernel.

Joseph Salisbury (jsalisbury) wrote :

Hi Steve,

It would also be helpful to know if this is a regression in Raring. Did you happen to have prior releases installed on this machine? If so, did they also exhibit this bug? It might be worthwhile to test some prior kernels such as v3.5, 3.6 and v3.7.

v3.5: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.7.8-quantal/
v3.6: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6.11-raring/
v3.7: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.7.10-raring/

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Joseph Salisbury (jsalisbury) wrote :

I wonder if Firefox could have a memory leak, if this only happens when Firefox is running?

tags: added: kernel-da-key
removed: kernel-key
Steve Langasek (vorlon) wrote :

This machine was continuously upgraded since precise. The problem only occurred in raring, I did not see this behavior with older kernels.

It also has not recurred with the v3.9-rc2-raring kernel.

> I wonder if Firefox could have a memory leak, if this only
> happens when Firefox is running?

Firefox may or may not have a memory leak, but the issue here is the way in which the kernel is managing swap in a situation where there should still be far more than enough memory for everything. Reserving 1.2GB of system memory for page caches and going into a swap death when hitting that limit is not reasonable behavior for the kernel.

Do you want to step me through a kernel bisect, so we can try to find the commit that fixes it between 3.8.0 and 3.9-rc2?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Steve Langasek (vorlon) wrote :

Still using the v3.9~rc2-raring kernel, I noticed today that upon launching a Google hangout, kswapd0 popped up in top chewing up a lot of CPU. *however*, unlike with the 3.8 kernel from raring, the system *actually swapped things out* to free up more memory:

$ free
             total used free shared buffers cached
Mem: 3842060 3713168 128892 0 1096 885656
-/+ buffers/cache: 2826416 1015644
Swap: 6291452 1418152 4873300
$

With the previous kernel, no matter how bad the swap storm became, the "cached" figure would never drop below 1.0GB.

So I think this bug is fixed in mainline, and would appreciate help tracking down the fix so we can make sure it's addressed for raring.

Joseph Salisbury (jsalisbury) wrote :

I can perform a "Reverse" kernel bisect to identify the commit that fixes this bug in the v3.9 kernel. We first need to identify the first kernel version that fixes this issue.

It looks like the bug is fixed in v3.9-rc2. The next step would be to test v3.9-rc1:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc1-raring/

If v3.9-rc1 has the bug, I will bisect between v3.9-rc1 and v3.9-rc2. If v3.9-rc1 does not have the bug, I will bisect between v3.8 final and v3.9-rc1.

The reverse bisect will require the testing of 8 - 12 test kernels, but it should track down the exact commit that fixes the bug.

tags: added: performing-bisect
Steve Langasek (vorlon) wrote :

Unfortunately, after my last message I've since seen two swap death scenarios with the 3.9~rc2 kernel. So it appears this is not fixed upstream after all (or maybe only partially fixed, reducing the frequency), and a bisect is probably not useful here.

I will try downgrading to the quantal kernel to see if I get any better results there.

cro (cro) wrote :

This bug affects my laptop as well, regardless of swap or swappiness settings.

The laptop is an Asus UX32A, and has only ever had 13.04 installed (from the RC prior to release)

/swap is configured with 4G of space on the internal SSD rather than the HDD.

Under normal usage kswapd0 starts using 99%+ of CPU every couple of hours. `killall` on firefox or thunderbird usually fixes things (I generally have a lot of Firefox windows open), as does switching to another TTY and waiting, watching `top` and the HDD light as kswapd0 thrashes the disk.

While kswapd0 is using CPU, the UI starts to become unresponsive, with interactions delayed and the mouse movemenets becoming jerky.

I've configured this machine with and without swap, with high and low swappiness (it's currently 20, but I originally set it to 0) and nothing affects kswapd0 using up to 100% CPU and making the entire machine unresponsive.

In some instances it has become so bad and the disk thrashing so consistent and long-term that only a hard power-off works to recover control of the machine (pressing ctrl-alt-F2 for example will take in excess of 5 minutes to respond, and the password entry in that TTY will timeout before completing login, as will attempts to connect via SSH)

An example `free -h`
                 total used free shared buffers cached
Mem: 3.8G 3.6G 157M 0B 4.3M 1.1G
-/+ buffers/cache: 2.5G 1.3G
Swap: 3.9G 719M 3.2G

I've previously used another laptop also with 4Gb of RAM, configured the same way (except with /swap on a spinning disk rather than an SSD) and I used that laptop with variations from 10.04 up to 12.10 with no swap issues at all.

My desktop (Mint14), also with 4Gb of RAM does not suffer this swapping problem either.

General information:
3.8.0-19-generic #29-Ubuntu SM
desktop: Mate 1.6.0

Changed in linux (Ubuntu):
importance: Critical → High
cro (cro) wrote :

An update.

The swappiness is still killing my laptop. It seems to be explicitly related to having Firefox running, especially if I have multiple windows open, with lots of content.

Killing the Firefox process solves the swap problem and returns control of the desktop and mouse, and allows me to continue working.

Here's my 'free' status when kswapd0 is using >100% CPU. In this situation kswapd0 is using between 90 and 120% CPU according to `top`, and everything else is under 5%, including firefox.

cro@zen:~$ free -h
             total used free shared buffers cached
Mem: 3.8G 3.3G 503M 0B 16M 1.4G
-/+ buffers/cache: 1.8G 1.9G
Swap: 3.9G 274M 3.6G

Or, because text formatting is bad, of 4Gb swap, 3.6Gb is free, and I have >500Mb of free RAM.

vm_swappiness is set to 0.

This happens to me on average more than once an hour when I'm working (multiple windows/tabs open).

cro,

On Mon, May 20, 2013 at 11:21:00AM -0000, cro wrote:
> The swappiness is still killing my laptop. It seems to be explicitly
> related to having Firefox running, especially if I have multiple windows
> open, with lots of content.

> Killing the Firefox process solves the swap problem and returns control
> of the desktop and mouse, and allows me to continue working.

> Here's my 'free' status when kswapd0 is using >100% CPU. In this
> situation kswapd0 is using between 90 and 120% CPU according to `top`,
> and everything else is under 5%, including firefox.

Out of curiosity, are you using LVM snapshots on this system? (Trying to
figure out any commonalities that might explain the kernel refusing to give
up its cache)

cro (cro) wrote :

Nope, I'm not using LVM snapshots. I'm thinking it's something to do with embeds or media objects - further attempting to work today showed that whenever the flashplugin was loaded (in either firefox or chrome, showing a video in a page) it would swap to death until those processes were killed.

I'm doing more testing/tweaking myself, things like disabling SDD deadline in rc.local to see if that has any effect.

Steve Langasek (vorlon) wrote :

On Mon, May 20, 2013 at 05:20:30PM -0000, cro wrote:
> Nope, I'm not using LVM snapshots. I'm thinking it's something to do
> with embeds or media objects

That would explain higher memory usage of firefox. It does not explain the
kernel failing to make proper use of the available memory. The cause of
*that* must lie somewhere in kernel space. Either this is a generic problem
with the current kernel and it's just that very few people are hitting it,
or it's a problem specific to some particular uses of the kernel. Some
ideas that come to mind are LVM snapshots, LUKS encryption, and virtual
machines.

cro (cro) wrote :

I made two changes to my machine yesterday in rc.local:

I commented out this:
# echo deadline >/sys/block/sdb/queue/scheduler

I added this:
rmmod rts5139 (this is the smart card reader poller)

I've not had a swap to death instance since that wasn't directly related to too many embedded objects on a Firefox page. There's been some swapping going on, and some general slowdowns, but nothing like the past few weeks.

It may be coincidence though.

Joseph Salisbury (jsalisbury) wrote :

The v3.10-rc2 kernel is now out. Would it be possible to test this kernel? This will tell us if it's going to be a problem in Saucy as well:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.10-rc2-saucy/

tags: added: kernel-stable-key
Steve Langasek (vorlon) wrote :

On Tue, May 21, 2013 at 04:56:54PM -0000, cro wrote:
> I made two changes to my machine yesterday in rc.local:

> I commented out this:
> # echo deadline >/sys/block/sdb/queue/scheduler

Ah, interesting. I also have the deadline scheduler configured here, so
that could be related.

I haven't hit the problem recently for a little while, but when I next do,
I'll see if this helps.

Steve Langasek (vorlon) wrote :

Using deadline vs. noop as the scheduler on my disk has no effect.

I've tuned everything I can think of in /proc/sys/vm, to no effect - swappiness, dirty_writeback, overcommit_ratio, dirty_background_ratio. I've tried 'echo 3 > /proc/sys/vm/drop_caches'; when I cat this file back, it stays at '3' - which seems to imply that it's received the instruction to drop the cache, but is failing to do so?!

The size of the un-freeable cache is dependent on what I have running. If I check from a console immediately after boot (with only the lightdm greeter running), the cache bottoms out at about 200MB in size. If I check after login without starting firefox, it's about 700MB. If I start firefox, it's about 1.4GB. I have no idea what is in those caches, but I cannot for the life of me convince the kernel to give them up for my apps to have more memory.

For reference, my system uses LVM and my root filesystem and /home partition are using LUKS. I wonder if the use of LUKS somehow means the kernel is creating a non-freeable cache in front of the encrypted disk. However, if it is, it's *very* buggy, as a check with lsof tells me that the cache is many times the size of *all* the files opened on those disks by *all* the processes on the system (total currently open size: 128MB).

Steve Langasek (vorlon) wrote :
Download full text (14.1 KiB)

/proc/meminfo of the cache that won't die:

$ cat /proc/meminfo
MemTotal: 3842228 kB
MemFree: 263420 kB
Buffers: 141864 kB
Cached: 1798700 kB
SwapCached: 0 kB
Active: 1744612 kB
Inactive: 1624940 kB
Active(anon): 1430188 kB
Inactive(anon): 1039588 kB
Active(file): 314424 kB
Inactive(file): 585352 kB
Unevictable: 32 kB
Mlocked: 32 kB
SwapTotal: 6291452 kB
SwapFree: 6291452 kB
Dirty: 64 kB
Writeback: 0 kB
AnonPages: 1428932 kB
Mapped: 220580 kB
Shmem: 1040808 kB
Slab: 105084 kB
SReclaimable: 60904 kB
SUnreclaim: 44180 kB
KernelStack: 5240 kB
PageTables: 41892 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 8212564 kB
Committed_AS: 5986084 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 548996 kB
VmallocChunk: 34359186700 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 84024 kB
DirectMap2M: 3901440 kB

And /proc/slabinfo:
$ sudo cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nf_conntrack_ffffffff81ccdb40 526 598 312 26 2 : tunables 0 0 0 : slabdata 23 23 0
kvm_async_pf 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 15920 2 8 : tunables 0 0 0 : slabdata 0 0 0
nfsd4_delegations 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
nfs_direct_cache 0 0 208 19 1 : tunables 0 0 0 : slabdata 0 0 0
nfs_write_data 34 34 960 17 4 : tunables 0 0 0 : slabdata 2 2 0
nfs_inode_cache 32 32 1024 16 4 : tunables 0 0 0 : slabdata 2 2 0
rpc_inode_cache 100 100 640 25 4 : tunables 0 0 0 : slabdata 4 4 0
fscache_cookie_jar 153 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0
btrfs_delayed_data_ref 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
btrfs_delayed_ref_head 64 64 128 32 1 : tunables 0 0 0 : slabdata 2 2 0
btrfs_delayed_node 0 0 288 28 2 : tunables 0 0 0 : slabdata 0 0 0
btrfs_ordered_extent 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
btrfs_extent_buffer 0 0 336 24 2 : tunables 0 0 0 : slabdata 0 0 0
btrfs_path 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
btrfs_transaction 0 0 304 26 2 : tunables 0 0 0 : sl...

Luis Henriques (henrix) wrote :

Accidentaly, I've found an lkml thread that may (or may not) be related with this bug:

http://thread.gmane.org/gmane.linux.kernel/1443124

Also, a possible duplicate of this bug: bug #1185172

Joseph Salisbury (jsalisbury) wrote :

I don't believe we ever performed a kernel bisect to identify the commit that introduced this regression. The discussion on LKML indicates v3.8-r4 did not exhibit this issue, but v3.8-rc7 did. Would it be possible for you to test the v3.8-rc4 kernel[0] to see if the bug happens there? If it doesn't, test v3.8-rc5[1] then v3.8-rc6[2] if needed.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc4-raring/
[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc5-raring/
[2] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc6-raring/

Changed in linux (Ubuntu Raring):
status: New → Confirmed
importance: Undecided → High
Steve Langasek (vorlon) wrote :

FYI, I tried downgrading to 3.8-rc4-raring, and the problem was still reproducible.

Comparing notes with other folks who have similar setups and are not seeing this problem, my attention was drawn to the fact that I had /tmp on a tmpfs. I thought the tmpfs was probably not the problem, because I had tried size limiting it and was still seeing problems. However, after disabling use of tmpfs for /tmp six days ago (which probably no longer makes sense for my environment now that I have an SSD), the problem has not recurred, even though my usage pattern hasn't changed. I have seen instances where firefox has managed to be killed due to OOM, but I have not seen swap death with 80% of swap unused like I was seeing before.

So it seems likely that this is related to use of tmpfs.

cro (cro) wrote :

I disabled tmpfs when I started seeing this issue, however the swapping has continued, albeit with a lesser number of issues - until today, when the swapping issue was so bad that I performed a hard power down after my laptop had been unresponsive for 24 minutes (clock showed 14:32, I forced a power-down at 14:56).

I couldn't log into any other TTY (it would time out after 60 seconds, or not respond at all), so I couldn't access top or free or anything to try and debug this issue, or even force-kill firefox (which usually solves the problem).

I didn't have many windows open, although I was doing some web development at the time (perhaps 3 or 4 firefox windows, Eclipse, MySQL workbench).

I am also seeing instances where swap is using >3Gb, cache is using >1Gb, and there are no processes running (default desktop only - all other processes/applications stopped). The only way to force the system to stop using swap when there is nothing to swap is to do a hard restart.

Time to also tune background processes I think - the system uses more than 3Gb of physical RAM when idling.

cro (cro) wrote :

Ack, and I meant to add:

So I don't think this is related to tmpfs, at least in my case.

tags: added: bios-outdated-6quj19us needs-upstream-testing
summary: - system swapping itself to death in raring for no good reason
+ [Lenovo ThinkPad X201 3249] system swapping itself to death in raring
+ for no good reason
Steve Langasek (vorlon) wrote :

There is nothing hardware-specific about this bug. Taking the model info back out of the title, to avoid confusion.

summary: - [Lenovo ThinkPad X201 3249] system swapping itself to death in raring
- for no good reason
+ system swapping itself to death in raring for no good reason
Joseph Salisbury (jsalisbury) wrote :

Can you see if this issue still exists in the 3.12-rc3 kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-rc3-saucy/

tags: removed: performing-bisect
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Changed in linux (Ubuntu Raring):
status: Confirmed → Incomplete
Changed in linux (Ubuntu Saucy):
status: Confirmed → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Saucy) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Saucy):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Raring) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Raring):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers