BUG: Bad page state in process node pfn:8e9d9

Bug #1007082 reported by Ken on 2012-05-31
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Stefan Bader

Bug Description

Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-virtual x86_64)

Running Ubuntu 12.04 on 64bit EC2 instance... occasionally the instance becomes unresponsive and requires a reboot. Here is the System Log from the EC2 console:

[16357652.971938] BUG: Bad page state in process node pfn:8e9d9
[16357652.971947] page:ffffea00023a7640 count:0 mapcount:-127 mapping: (null) index:0x7f89dc026
[16357652.971954] page flags: 0x100000000000000()
[16357652.971960] Modules linked in: isofs acpiphp
[16357652.971970] Pid: 14135, comm: node Tainted: G B D 3.2.0-23-virtual #36-Ubuntu
[16357652.971976] Call Trace:
[16357652.971988] [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
[16357652.971994] [<ffffffff8111c208>] bad_page+0x18/0x30
[16357652.972000] [<ffffffff8111d3c5>] prep_new_page+0x1d5/0x1e0
[16357652.972008] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[16357652.972017] [<ffffffff8113204f>] ? __inc_zone_state+0x5f/0x70
[16357652.972023] [<ffffffff8111d59f>] get_page_from_freelist+0x1cf/0x540
[16357652.972031] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[16357652.972038] [<ffffffff8111dba9>] __alloc_pages_nodemask+0x109/0x800
[16357652.972044] [<ffffffff81005001>] ? xen_mc_extend_args+0x111/0x150
[16357652.972051] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[16357652.972059] [<ffffffff8116b6c0>] ? __mem_cgroup_commit_charge+0x70/0xc0
[16357652.972066] [<ffffffff81006739>] ? pte_mfn_to_pfn+0x89/0xf0
[16357652.972075] [<ffffffff8115672a>] alloc_pages_vma+0x9a/0x150
[16357652.972081] [<ffffffff81136f5c>] do_anonymous_page.isra.38+0x7c/0x2f0
[16357652.972088] [<ffffffff8113abc1>] handle_pte_fault+0x1e1/0x200
[16357652.972094] [<ffffffff810067be>] ? xen_pmd_val+0xe/0x10
[16357652.972100] [<ffffffff81005209>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[16357652.972108] [<ffffffff8113af98>] handle_mm_fault+0x1f8/0x350
[16357652.972116] [<ffffffff81658ddb>] do_page_fault+0x14b/0x520
[16357652.972122] [<ffffffff81140e08>] ? do_mmap_pgoff+0x348/0x360
[16357652.972129] [<ffffffff81140f75>] ? sys_mmap_pgoff+0x155/0x230
[16357652.972135] [<ffffffff81655a35>] page_fault+0x25/0x30
[16357652.972141] BUG: Bad page state in process node pfn:593da
[16357652.972146] page:ffffea000164f680 count:0 mapcount:-127 mapping: (null) index:0x7f89dc027
[16357652.972208] page flags: 0x100000000000000()
[16357652.972213] Modules linked in: isofs acpiphp
[16357652.972221] Pid: 14135, comm: node Tainted: G B D 3.2.0-23-virtual #36-Ubuntu
[16357652.972227] Call Trace:
[16357652.972232] [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
[16357652.972238] [<ffffffff8111c208>] bad_page+0x18/0x30
[16357652.972244] [<ffffffff8111d3c5>] prep_new_page+0x1d5/0x1e0
[16357652.972251] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[16357652.972257] [<ffffffff8113204f>] ? __inc_zone_state+0x5f/0x70
[16357652.972264] [<ffffffff8111d59f>] get_page_from_freelist+0x1cf/0x540
[16357652.972271] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[16357652.972278] [<ffffffff8111dba9>] __alloc_pages_nodemask+0x109/0x800
[16357652.972284] [<ffffffff81005001>] ? xen_mc_extend_args+0x111/0x150
[16357652.972291] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[16357652.972298] [<ffffffff8116b6c0>] ? __mem_cgroup_commit_charge+0x70/0xc0
[16357652.972305] [<ffffffff81006739>] ? pte_mfn_to_pfn+0x89/0xf0
[16357652.972311] [<ffffffff8115672a>] alloc_pages_vma+0x9a/0x150
[16357652.972318] [<ffffffff81136f5c>] do_anonymous_page.isra.38+0x7c/0x2f0
[16357652.972325] [<ffffffff8113abc1>] handle_pte_fault+0x1e1/0x200
[16357652.972331] [<ffffffff810067be>] ? xen_pmd_val+0xe/0x10
[16357652.972337] [<ffffffff81005209>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[16357652.972345] [<ffffffff8113af98>] handle_mm_fault+0x1f8/0x350
[16357652.972351] [<ffffffff81658ddb>] do_page_fault+0x14b/0x520
[16357652.972358] [<ffffffff81140e08>] ? do_mmap_pgoff+0x348/0x360
[16357652.972364] [<ffffffff81140f75>] ? sys_mmap_pgoff+0x155/0x230
[16357652.972371] [<ffffffff81655a35>] page_fault+0x25/0x30
[16357652.972377] BUG: Bad page state in process node pfn:58a7f
[16357652.972382] page:ffffea0001629fc0 count:0 mapcount:-127 mapping: (null) index:0x7f89dc028
[16357652.972389] page flags: 0x100000000000000()
---
AcpiTables:

AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 May 31 15:39 seq
 crw-rw---T 1 root audio 116, 33 May 31 15:39 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
DistroRelease: Ubuntu 12.04
Ec2AMI: ami-563b9d3f
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1a
Ec2InstanceType: m1.large
Ec2Kernel: aki-825ea7eb
Ec2Ramdisk: unavailable
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=LABEL=cloudimg-rootfs ro console=hvc0
ProcModules:
 acpiphp 24231 0 - Live 0x0000000000000000
 isofs 40257 0 - Live 0x0000000000000000
ProcVersionSignature: User Name 3.2.0-23.36-virtual 3.2.14
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-23-virtual N/A
 linux-backports-modules-3.2.0-23-virtual N/A
 linux-firmware 1.79
RfKill: Error: [Errno 2] No such file or directory
Tags: precise ec2-images
Uname: Linux 3.2.0-23-virtual x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm admin audio cdrom dialout dip floppy netdev plugdev video

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1007082

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
Ken (kenshi) wrote : BootDmesg.txt

apport information

tags: added: apport-collected ec2-images
description: updated
Ken (kenshi) wrote : CurrentDmesg.txt

apport information

Ken (kenshi) wrote : ProcCpuinfo.txt

apport information

apport information

Ken (kenshi) wrote : UdevDb.txt

apport information

Ken (kenshi) wrote : UdevLog.txt

apport information

Ken (kenshi) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

We have noted that there is a newer version of the kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest kernel for this release by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get install linux

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Ken (kenshi) wrote :

Upgraded kernel version, just got the same (or similar) issue.

Attached is the system log from the EC2 console.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Ken (kenshi) wrote :

Does this look like a Xen issue as well?

Do you need any other information?

Ken (kenshi) wrote :

Some additional information:

We're running apache2 under moderately high load ( ~50% CPU utilization) on m1.large instances, which have 2 virt cores. Python2.7 running under mod_wsgi, and a few node.js services behind a reverse proxy. There is some php running under apache as well.

The issue manifests itself once every couple days.

Stefan Bader (smb) wrote :

Sorry, it took a while to clean out some other issues. I hope I can concentrate on this one now. But first it would be great to know whether with the recent kernels (3.2.0-26 or -27, -27 needs -proposed enabled) this still happens. If yes, please add the page fault output again (to see its still the same).

Ken (kenshi) wrote :

Installing 3.2.0-26 on the machines... I'll let it run for a few days and let you know if the problem persists.

Ken (kenshi) wrote :

Same issue with 3.2.0-26. Machine was under moderate load, about 50% utilization.

Attached is the trace.

Stefan Bader (smb) wrote :

Ken, thanks for testing and confirming. So while the overall issue remains the same (system detects that a page from some cache pool is not properly initialized), the latest trace shows at least that the exact nature of the corruption may not be consistent. This likely will be hard to get down to...

Ken (kenshi) wrote :

Got it - we're still seeing these issues about once a day, so let me know if I can help out in any way.

Stefan Bader (smb) wrote :

So there is one thing which may or may not be related. It would be good to gather information about the Xen version this happens on. This can be found in dmesg (grep Xen). While rebooting an instance keeps you on the same version, starting new instances (even within the same region) can cause it to be run on a different Xen version.
Even if it is not related, at least it would give bring a bit more confidence whether this is guest kernel related when seeing various versions showing the issue. And it would be a waste not to take the chance while this happens every few days.

It is very strange to see this only happening on pulling a page from the freelist. Either things get corrupted while on it (since there seem to be the same tests done when giving a page back into the list) or by some weird luck those pages end up on the list wrongly.

Ken (kenshi) wrote :

Seems to be occurring on both versions of Xen that we get put on:

Xen version: 3.4.3-2.6.18 (preserve-AD)
Xen version: 3.0.3-rc5-8.1.14.f

Mark Thornton (mthornton-2) wrote :

We also see this problem on (real) machines running KVM. It may be related to this:

http://marc.info/?l=linux-mm&m=134129723504527&w=2

Stefan Bader (smb) wrote :

If it is related to the patch in the previous comment then this should be fixed when running a Ubuntu kernel 3.2.0-30.47 or higher (currently only in the proposed pocket -> https://launchpad.net/ubuntu/precise/+source/linux).

Ken (kenshi) wrote :

I will install the new kernel version on a few of the machines and test it out.

Thanks,

Ken (kenshi) wrote :

Sorry for the delay... same issue is occuring, at about the same frequency.

Kernel version, from the pre-proposed PPA: 3.2.0-30.47pre201208200400

Stefan Bader (smb) wrote :

Thanks Ken. So at least we can say it is not related to the issue in comment #20. Interesting/weird stack trace, looks like a oops/panic message runs into a spinlock issue...

Stefan Bader (smb) wrote :

Hm, maybe there is rather a relation to bug #1011792... or it is now...

Mark Thornton (mthornton-2) wrote :

That does look like it has fixed one bug only to uncover a different bug.

Stefan Bader (smb) wrote :

Right, thinking of it, it might be that it was actually the issue that is now solved (no more bad page state) but now running into bug #1011792. At least this other bug now has a reproducer that does not require a production load.

Ken (kenshi) wrote :

Let me know if you want me to try anything else

Justin Dossey (jbd) wrote :
Download full text (6.2 KiB)

I'm also seeing this bug (almost exactly the original trace) on two physical servers since upgrading to 12.04 LTS. The same machines ran 10.04 LTS without any errors for over a year, and since I'm seeing the same BUG on both servers, I believe it to be related to the 3.2.0 kernel and not the hardware. Notably, the "bad_page.part.61+0x9f/0xf0" line exactly matches the original trace in this bug report.

Generally, the system stays up when this happens, but the baseline load average on the system increases because the apache2 process triggering the bug gets stuck. Stopping apache, kill -9ing all the apache processes which did not exit when stopping apache, and starting apache again brings the load back down to normal.

About every two weeks, the servers become completely unresponsive and must be reset.

Hope this helps find the issue. This bug has prevented us from upgrading any further systems until it is resolved, and we may even have to downgrade these computers to 10.04 until a solution becomes available.

The systems are completely up-to-date with 12.04.1 LTS.

Example from today:

[1309944.336646] BUG: Bad page state in process apache2 pfn:1334cc
[1309944.349965] page:ffffea0004cd3300 count:0 mapcount:0 mapping: (null) index:0x1a9c
[1309944.375260] page flags: 0x200000002001008(uptodate|private_2|0x2000000)
[1309944.388104] Modules linked in: ipt_REJECT xt_tcpudp xt_multiport iptable_filter ip_tables x_tables cachefiles nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 vesafb psmouse serio_raw joydev i5100_edac ioatdma dca edac_core mac_hid lp parport pata_it8213 usbhid floppy hid e1000e 3w_9xxx
[1309944.439976] Pid: 11497, comm: apache2 Tainted: G B D 3.2.0-32-generic #51-Ubuntu
[1309944.465951] Call Trace:
[1309944.478272] [<ffffffff8111ebff>] bad_page.part.61+0x9f/0xf0
[1309944.490451] [<ffffffff8111ec68>] bad_page+0x18/0x30
[1309944.502755] [<ffffffff8111f6ee>] free_pages_prepare+0x10e/0x120
[1309944.514555] [<ffffffff8111f859>] free_hot_cold_page+0x49/0x1a0
[1309944.526060] [<ffffffff81012728>] ? __switch_to+0x138/0x360
[1309944.537454] [<ffffffff8111fbd4>] __pagevec_free+0x54/0xd0
[1309944.548556] [<ffffffff816588dc>] ? __schedule+0x3cc/0x6f0
[1309944.559282] [<ffffffff81123c1c>] release_pages+0x24c/0x280
[1309944.569964] [<ffffffff8116f79a>] ? mem_cgroup_add_lru_list+0x1a/0x20
[1309944.580545] [<ffffffff81123da0>] ? pagevec_move_tail+0x40/0x40
[1309944.590917] [<ffffffff81123d2a>] pagevec_lru_move_fn+0xda/0xf0
[1309944.601225] [<ffffffff81123d57>] ____pagevec_lru_add+0x17/0x20
[1309944.611199] [<ffffffff81123fd8>] __lru_cache_add+0x68/0x90
[1309944.620860] [<ffffffff811676f7>] ? __unmap_and_move+0x107/0x270
[1309944.630305] [<ffffffff8112448d>] lru_cache_add_lru+0x2d/0x50
[1309944.639530] [<ffffffff8112a709>] putback_lru_page+0x69/0xe0
[1309944.648441] [<ffffffff811678f4>] unmap_and_move+0x94/0x150
[1309944.657237] [<ffffffff81167bae>] migrate_pages+0x9e/0x140
[1309944.665861] [<ffffffff8115b590>] ? isolate_freepages+0x210/0x210
[1309944.674300] [<ffffffff8115bd91>] compact_zone.part.14+0x121/0x270
[1309944.682777] [<ffffffff8115bfc7>] compact_zone+0x37/0x50
[1309944.691109] [<ffffff...

Read more...

Changed in linux (Ubuntu):
assignee: nobody → Colin King (colin-king)
Colin Ian King (colin-king) wrote :

It's not really possible to determine too much more from these bad.page traces, we are getting some different kinds of random corruption, for example, different invalid page flags values and bad count values. So I think the best way forward is to install a debug kernel that I've built.

The debug kernel has VM debugging enabled which will slow the machine a little, but will add some more sanity checking and perhaps will give us some more information.

The kernel .debs are available here: http://kernel.ubuntu.com/~cking/lp-1007082/

We can take this one step further by trying to capture a kernel crash dump and I will inspect this crash image to try and see if this provides any further information. Crash dump images strip out a lot of unused data, but can be rather larger (several hundred MB) and there is of course the risk of sharing data in the kernel that you don't want to upload to launchpad, so the use of crashdump is up to you. Instructions on how to install, enable and trigger a crash dump image are here: https://wiki.ubuntu.com/Kernel/CrashdumpRecipe

You need to install linux-crashdump, reboot, check that crash kernel is loaded and then wait for the problem to manifest itself. Then trigger a crash and then I need to inspect the dump image saved in /var/crash - the notes for these steps are explained in the wiki page mentioned above.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Ken (kenshi) wrote :

Thanks Colin - been a little busy but I'll hopefully get both installed next week on one or two of the machines.

Justin Dossey (jbd) wrote :

I've installed the crashdump recipe on both of my crashing machines and will report back when I get a crash. After the next crash, I'll install the VM debug kernels and collect the next crash for this bug.

Colin Ian King (colin-king) wrote :

Thanks, lets see what kind of extra debug state we get.

Justin Dossey (jbd) wrote :

Got a crash today (from the non-VM debug kernel), but the crashdump was not written to /var/crash. Next time, I will look harder before resetting the system via IPMI.

This time, I do see in the kern.log as the last message written before the crash:

Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:3085!
Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] invalid opcode: 0000 [#6] SMP
Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] CPU 0
Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] Modules linked in: ipt_REJECT xt_tcpudp xt_multiport iptable_filter ip_tables x_tables cachefiles nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 vesafb i5100_edac psmouse edac_core ioatdma dca serio_raw joydev mac_hid lp parport e1000e usbhid pata_it8213 hid floppy 3w_9xxx
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363]
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] Pid: 26574, comm: kworker/u:7 Tainted: G B D 3.2.0-33-generic #52-Ubuntu Supermicro X7DCL/X7DCL
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] RIP: 0010:[<ffffffff811a84bb>] [<ffffffff811a84bb>] drop_buffers+0xab/0xb0
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] RSP: 0018:ffff88000875f630 EFLAGS: 00010246
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] RAX: 0200000002001009 RBX: ffffea0004863e40 RCX: 0000000000000024

Bryan Quigley (bryanquigley) wrote :

Are you still experiencing this crash?

Justin Dossey (jbd) wrote :

I continued to experience the crash until I disabled fsc on my NFS mounts. After fsc was disabled, the servers have not crashed once.

Stefan Bader (smb) wrote :

Just discussed this with Colin, somehow (given the hint about fsc), this change in v3.0 sounds suspiciously like it could be fixing the issue:

commit c902ce1bfb40d8b049bd2319b388b4b68b04bc27
Author: David Howells <email address hidden>
Date: Thu Jul 7 12:19:48 2011 +0100

    FS-Cache: Add a helper to bulk uncache pages on an inode

    Add an FS-Cache helper to bulk uncache pages on an inode. This will
    only work for the circumstance where the pages in the cache correspond
    1:1 with the pages attached to an inode's page cache.

    This is required for CIFS and NFS: When disabling inode cookie, we were
    returning the cookie and setting cifsi->fscache to NULL but failed to
    invalidate any previously mapped pages. This resulted in "Bad page
    state" errors and manifested in other kind of errors when running
    fsstress. Fix it by uncaching mapped pages when we disable the inode
    cookie.

    This patch should fix the following oops and "Bad page state" errors
    seen during fsstress testing.

Justin, if we provided a test kernel, would you be able to give that a try?

Changed in linux (Ubuntu):
assignee: Colin King (colin-king) → Stefan Bader (stefan-bader-canonical)
Justin Dossey (jbd) wrote :

Yes, I can try a test kernel.

Download full text (6.8 KiB)

I can try out a test kernel too.
On May 15, 2013 11:56 AM, "Justin Dossey" <email address hidden>
wrote:

> Yes, I can try a test kernel.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1007082
>
> Title:
> BUG: Bad page state in process node pfn:8e9d9
>
> Status in “linux” package in Ubuntu:
> Incomplete
>
> Bug description:
> Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-virtual x86_64)
>
> Running Ubuntu 12.04 on 64bit EC2 instance... occasionally the
> instance becomes unresponsive and requires a reboot. Here is the
> System Log from the EC2 console:
>
> [16357652.971938] BUG: Bad page state in process node pfn:8e9d9
> [16357652.971947] page:ffffea00023a7640 count:0 mapcount:-127 mapping:
> (null) index:0x7f89dc026
> [16357652.971954] page flags: 0x100000000000000()
> [16357652.971960] Modules linked in: isofs acpiphp
> [16357652.971970] Pid: 14135, comm: node Tainted: G B D
> 3.2.0-23-virtual #36-Ubuntu
> [16357652.971976] Call Trace:
> [16357652.971988] [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
> [16357652.971994] [<ffffffff8111c208>] bad_page+0x18/0x30
> [16357652.972000] [<ffffffff8111d3c5>] prep_new_page+0x1d5/0x1e0
> [16357652.972008] [<ffffffff8100aa32>] ? check_events+0x12/0x20
> [16357652.972017] [<ffffffff8113204f>] ? __inc_zone_state+0x5f/0x70
> [16357652.972023] [<ffffffff8111d59f>]
> get_page_from_freelist+0x1cf/0x540
> [16357652.972031] [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
> [16357652.972038] [<ffffffff8111dba9>]
> __alloc_pages_nodemask+0x109/0x800
> [16357652.972044] [<ffffffff81005001>] ? xen_mc_extend_args+0x111/0x150
> [16357652.972051] [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
> [16357652.972059] [<ffffffff8116b6c0>] ?
> __mem_cgroup_commit_charge+0x70/0xc0
> [16357652.972066] [<ffffffff81006739>] ? pte_mfn_to_pfn+0x89/0xf0
> [16357652.972075] [<ffffffff8115672a>] alloc_pages_vma+0x9a/0x150
> [16357652.972081] [<ffffffff81136f5c>]
> do_anonymous_page.isra.38+0x7c/0x2f0
> [16357652.972088] [<ffffffff8113abc1>] handle_pte_fault+0x1e1/0x200
> [16357652.972094] [<ffffffff810067be>] ? xen_pmd_val+0xe/0x10
> [16357652.972100] [<ffffffff81005209>] ?
> __raw_callee_save_xen_pmd_val+0x11/0x1e
> [16357652.972108] [<ffffffff8113af98>] handle_mm_fault+0x1f8/0x350
> [16357652.972116] [<ffffffff81658ddb>] do_page_fault+0x14b/0x520
> [16357652.972122] [<ffffffff81140e08>] ? do_mmap_pgoff+0x348/0x360
> [16357652.972129] [<ffffffff81140f75>] ? sys_mmap_pgoff+0x155/0x230
> [16357652.972135] [<ffffffff81655a35>] page_fault+0x25/0x30
> [16357652.972141] BUG: Bad page state in process node pfn:593da
> [16357652.972146] page:ffffea000164f680 count:0 mapcount:-127 mapping:
> (null) index:0x7f89dc027
> [16357652.972208] page flags: 0x100000000000000()
> [16357652.972213] Modules linked in: isofs acpiphp
> [16357652.972221] Pid: 14135, comm: node Tainted: G B D
> 3.2.0-23-virtual #36-Ubuntu
> [16357652.972227] Call Trace:
> [16357652.972232] [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
> [...

Read more...

Stefan Bader (smb) wrote :

Going back to this I realized the hinted patch cannot be right as this got in with 3.0 and this bug is about 3.2 kernels (somehow LTS release get a bit mixed up). Anyway, there actually are two other changes that came in after 3.2 and are about fscache and bad pages:

#1: CacheFiles: Fix the marking of cached pages
#2: NFS: nfs_migrate_page() does not wait for FS-Cache to finish with a page

While #1 seems to contain the same top level call (get_page_from_freelist), #2 seems to be more related to NFS. So I put two versions of kernels to [1]: v1 only has #1 and v2 has both #1 and #2. So when testing, try v1 first and if that is sufficient we can ignore v2.

[1] http://people.canonical.com/~smb/lp1007082/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers