Bug #1007082 “BUG: Bad page state in process node pfn:8e9d9” : Bugs : linux package : Ubuntu

Revision history for this message

Brad Figg (brad-figg) wrote on 2012-05-31: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1007082

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: precise

Revision history for this message

Ken (kenshi) wrote on 2012-05-31: BootDmesg.txt

#2

BootDmesg.txt Edit (16.1 KiB, text/plain)

apport information

tags:	added: apport-collected ec2-images
description:	updated

Revision history for this message

Ken (kenshi) wrote on 2012-05-31: CurrentDmesg.txt

#3

CurrentDmesg.txt Edit (568 bytes, text/plain)

apport information

Revision history for this message

Ken (kenshi) wrote on 2012-05-31: ProcCpuinfo.txt

#4

ProcCpuinfo.txt Edit (1.3 KiB, text/plain)

apport information

Revision history for this message

Ken (kenshi) wrote on 2012-05-31: ProcInterrupts.txt

#5

ProcInterrupts.txt Edit (1.6 KiB, text/plain)

apport information

Revision history for this message

Ken (kenshi) wrote on 2012-05-31: UdevDb.txt

#6

UdevDb.txt Edit (34.1 KiB, text/plain)

apport information

Revision history for this message

Ken (kenshi) wrote on 2012-05-31: UdevLog.txt

#7

UdevLog.txt Edit (82.6 KiB, text/plain)

apport information

Revision history for this message

Ken (kenshi) wrote on 2012-05-31: WifiSyslog.txt

#8

WifiSyslog.txt Edit (27.0 KiB, text/plain)

apport information

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-06-05:

#9

We have noted that there is a newer version of the kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest kernel for this release by simply running the following commands in a terminal window:

sudo apt-get update
sudo apt-get install linux

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

Changed in linux (Ubuntu):
importance:	Undecided → Medium
status:	Confirmed → Incomplete

Revision history for this message

Ken (kenshi) wrote on 2012-06-07:

#10

system_log.txt Edit (15.1 KiB, text/plain)

Upgraded kernel version, just got the same (or similar) issue.

Attached is the system log from the EC2 console.

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Ken (kenshi) wrote on 2012-06-11:

#11

Does this look like a Xen issue as well?

Do you need any other information?

Revision history for this message

Ken (kenshi) wrote on 2012-06-12:

#12

Some additional information:

We're running apache2 under moderately high load ( ~50% CPU utilization) on m1.large instances, which have 2 virt cores. Python2.7 running under mod_wsgi, and a few node.js services behind a reverse proxy. There is some php running under apache as well.

The issue manifests itself once every couple days.

Revision history for this message

Stefan Bader (smb) wrote on 2012-07-10:

#13

Sorry, it took a while to clean out some other issues. I hope I can concentrate on this one now. But first it would be great to know whether with the recent kernels (3.2.0-26 or -27, -27 needs -proposed enabled) this still happens. If yes, please add the page fault output again (to see its still the same).

Revision history for this message

Ken (kenshi) wrote on 2012-07-11:

#14

Installing 3.2.0-26 on the machines... I'll let it run for a few days and let you know if the problem persists.

Revision history for this message

Ken (kenshi) wrote on 2012-07-12:

#15

trace.txt Edit (19.7 KiB, text/plain)

Same issue with 3.2.0-26. Machine was under moderate load, about 50% utilization.

Attached is the trace.

Revision history for this message

Stefan Bader (smb) wrote on 2012-07-13:

#16

Ken, thanks for testing and confirming. So while the overall issue remains the same (system detects that a page from some cache pool is not properly initialized), the latest trace shows at least that the exact nature of the corruption may not be consistent. This likely will be hard to get down to...

Revision history for this message

Ken (kenshi) wrote on 2012-07-16:

#17

Got it - we're still seeing these issues about once a day, so let me know if I can help out in any way.

Revision history for this message

Stefan Bader (smb) wrote on 2012-07-20:

#18

So there is one thing which may or may not be related. It would be good to gather information about the Xen version this happens on. This can be found in dmesg (grep Xen). While rebooting an instance keeps you on the same version, starting new instances (even within the same region) can cause it to be run on a different Xen version.
Even if it is not related, at least it would give bring a bit more confidence whether this is guest kernel related when seeing various versions showing the issue. And it would be a waste not to take the chance while this happens every few days.

It is very strange to see this only happening on pulling a page from the freelist. Either things get corrupted while on it (since there seem to be the same tests done when giving a page back into the list) or by some weird luck those pages end up on the list wrongly.

Revision history for this message

Ken (kenshi) wrote on 2012-07-20:

#19

Seems to be occurring on both versions of Xen that we get put on:

Xen version: 3.4.3-2.6.18 (preserve-AD)
Xen version: 3.0.3-rc5-8.1.14.f

Revision history for this message

Mark Thornton (mthornton-2) wrote on 2012-09-03:

#20

We also see this problem on (real) machines running KVM. It may be related to this:

http://marc.info/?l=linux-mm&m=134129723504527&w=2

Revision history for this message

Stefan Bader (smb) wrote on 2012-09-04:

#21

If it is related to the patch in the previous comment then this should be fixed when running a Ubuntu kernel 3.2.0-30.47 or higher (currently only in the proposed pocket -> https://launchpad.net/ubuntu/precise/+source/linux).

Revision history for this message

Ken (kenshi) wrote on 2012-09-04:

#22

I will install the new kernel version on a few of the machines and test it out.

Thanks,

Revision history for this message

Ken (kenshi) wrote on 2012-09-19:

#23

crash log from EC2 console Edit (47.2 KiB, text/plain)

Sorry for the delay... same issue is occuring, at about the same frequency.

Kernel version, from the pre-proposed PPA: 3.2.0-30.47pre201208200400

Revision history for this message

Stefan Bader (smb) wrote on 2012-09-20:

#24

Thanks Ken. So at least we can say it is not related to the issue in comment #20. Interesting/weird stack trace, looks like a oops/panic message runs into a spinlock issue...

Revision history for this message

Stefan Bader (smb) wrote on 2012-09-20:

#25

Hm, maybe there is rather a relation to bug #1011792... or it is now...

Revision history for this message

Mark Thornton (mthornton-2) wrote on 2012-09-20:

#26

That does look like it has fixed one bug only to uncover a different bug.

Revision history for this message

Stefan Bader (smb) wrote on 2012-09-20:

#27

Right, thinking of it, it might be that it was actually the issue that is now solved (no more bad page state) but now running into bug #1011792. At least this other bug now has a reproducer that does not require a production load.

Revision history for this message

Ken (kenshi) wrote on 2012-10-04:

#28

Let me know if you want me to try anything else

Revision history for this message

Justin Dossey (jbd) wrote on 2012-11-13:

#29

Download full text (6.2 KiB)

I'm also seeing this bug (almost exactly the original trace) on two physical servers since upgrading to 12.04 LTS. The same machines ran 10.04 LTS without any errors for over a year, and since I'm seeing the same BUG on both servers, I believe it to be related to the 3.2.0 kernel and not the hardware. Notably, the "bad_page.part.61+0x9f/0xf0" line exactly matches the original trace in this bug report.

Generally, the system stays up when this happens, but the baseline load average on the system increases because the apache2 process triggering the bug gets stuck. Stopping apache, kill -9ing all the apache processes which did not exit when stopping apache, and starting apache again brings the load back down to normal.

About every two weeks, the servers become completely unresponsive and must be reset.

Hope this helps find the issue. This bug has prevented us from upgrading any further systems until it is resolved, and we may even have to downgrade these computers to 10.04 until a solution becomes available.

The systems are completely up-to-date with 12.04.1 LTS.

Example from today:

[1309944.336646] BUG: Bad page state in process apache2 pfn:1334cc
[1309944.349965] page:ffffea0004cd3300 count:0 mapcount:0 mapping: (null) index:0x1a9c
[1309944.375260] page flags: 0x200000002001008(uptodate|private_2|0x2000000)
[1309944.388104] Modules linked in: ipt_REJECT xt_tcpudp xt_multiport iptable_filter ip_tables x_tables cachefiles nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 vesafb psmouse serio_raw joydev i5100_edac ioatdma dca edac_core mac_hid lp parport pata_it8213 usbhid floppy hid e1000e 3w_9xxx
[1309944.439976] Pid: 11497, comm: apache2 Tainted: G B D 3.2.0-32-generic #51-Ubuntu
[1309944.465951] Call Trace:
[1309944.478272] [<ffffffff8111ebff>] bad_page.part.61+0x9f/0xf0
[1309944.490451] [<ffffffff8111ec68>] bad_page+0x18/0x30
[1309944.502755] [<ffffffff8111f6ee>] free_pages_prepare+0x10e/0x120
[1309944.514555] [<ffffffff8111f859>] free_hot_cold_page+0x49/0x1a0
[1309944.526060] [<ffffffff81012728>] ? __switch_to+0x138/0x360
[1309944.537454] [<ffffffff8111fbd4>] __pagevec_free+0x54/0xd0
[1309944.548556] [<ffffffff816588dc>] ? __schedule+0x3cc/0x6f0
[1309944.559282] [<ffffffff81123c1c>] release_pages+0x24c/0x280
[1309944.569964] [<ffffffff8116f79a>] ? mem_cgroup_add_lru_list+0x1a/0x20
[1309944.580545] [<ffffffff81123da0>] ? pagevec_move_tail+0x40/0x40
[1309944.590917] [<ffffffff81123d2a>] pagevec_lru_move_fn+0xda/0xf0
[1309944.601225] [<ffffffff81123d57>] ____pagevec_lru_add+0x17/0x20
[1309944.611199] [<ffffffff81123fd8>] __lru_cache_add+0x68/0x90
[1309944.620860] [<ffffffff811676f7>] ? __unmap_and_move+0x107/0x270
[1309944.630305] [<ffffffff8112448d>] lru_cache_add_lru+0x2d/0x50
[1309944.639530] [<ffffffff8112a709>] putback_lru_page+0x69/0xe0
[1309944.648441] [<ffffffff811678f4>] unmap_and_move+0x94/0x150
[1309944.657237] [<ffffffff81167bae>] migrate_pages+0x9e/0x140
[1309944.665861] [<ffffffff8115b590>] ? isolate_freepages+0x210/0x210
[1309944.674300] [<ffffffff8115bd91>] compact_zone.part.14+0x121/0x270
[1309944.682777] [<ffffffff8115bfc7>] compact_zone+0x37/0x50
[1309944.691109] [<ffffff...

I'm also seeing this bug (almost exactly the original trace) on two physical servers since upgrading to 12.04 LTS.  The same machines ran 10.04 LTS without any errors for over a year, and since I'm seeing the same BUG on both servers, I believe it to be related to the 3.2.0 kernel and not the hardware.  Notably, the "bad_page.part.61+0x9f/0xf0" line exactly matches the original trace in this bug report.

Generally, the system stays up when this happens, but the baseline load average on the system increases because the apache2 process triggering the bug gets stuck.  Stopping apache, kill -9ing all the apache processes which did not exit when stopping apache, and starting apache again brings the load back down to normal.

About every two weeks, the servers become completely unresponsive and must be reset.

Hope this helps find the issue.  This bug has prevented us from upgrading any further systems until it is resolved, and we may even have to downgrade these computers to 10.04 until a solution becomes available.

The systems are completely up-to-date with 12.04.1 LTS.

Example from today:

[1309944.336646] BUG: Bad page state in process apache2  pfn:1334cc
[1309944.349965] page:ffffea0004cd3300 count:0 mapcount:0 mapping:          (null) index:0x1a9c
[1309944.375260] page flags: 0x200000002001008(uptodate|private_2|0x2000000)
[1309944.388104] Modules linked in: ipt_REJECT xt_tcpudp xt_multiport iptable_filter ip_tables x_tables cachefiles nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 vesafb psmouse serio_raw joydev i5100_edac ioatdma dca edac_core mac_hid lp parport pata_it8213 usbhid floppy hid e1000e 3w_9xxx
[1309944.439976] Pid: 11497, comm: apache2 Tainted: G    B D      3.2.0-32-generic #51-Ubuntu
[1309944.465951] Call Trace:
[1309944.478272]  [<ffffffff8111ebff>] bad_page.part.61+0x9f/0xf0
[1309944.490451]  [<ffffffff8111ec68>] bad_page+0x18/0x30
[1309944.502755]  [<ffffffff8111f6ee>] free_pages_prepare+0x10e/0x120
[1309944.514555]  [<ffffffff8111f859>] free_hot_cold_page+0x49/0x1a0
[1309944.526060]  [<ffffffff81012728>] ? __switch_to+0x138/0x360
[1309944.537454]  [<ffffffff8111fbd4>] __pagevec_free+0x54/0xd0
[1309944.548556]  [<ffffffff816588dc>] ? __schedule+0x3cc/0x6f0
[1309944.559282]  [<ffffffff81123c1c>] release_pages+0x24c/0x280
[1309944.569964]  [<ffffffff8116f79a>] ? mem_cgroup_add_lru_list+0x1a/0x20
[1309944.580545]  [<ffffffff81123da0>] ? pagevec_move_tail+0x40/0x40
[1309944.590917]  [<ffffffff81123d2a>] pagevec_lru_move_fn+0xda/0xf0
[1309944.601225]  [<ffffffff81123d57>] ____pagevec_lru_add+0x17/0x20
[1309944.611199]  [<ffffffff81123fd8>] __lru_cache_add+0x68/0x90
[1309944.620860]  [<ffffffff811676f7>] ? __unmap_and_move+0x107/0x270
[1309944.630305]  [<ffffffff8112448d>] lru_cache_add_lru+0x2d/0x50
[1309944.639530]  [<ffffffff8112a709>] putback_lru_page+0x69/0xe0
[1309944.648441]  [<ffffffff811678f4>] unmap_and_move+0x94/0x150
[1309944.657237]  [<ffffffff81167bae>] migrate_pages+0x9e/0x140
[1309944.665861]  [<ffffffff8115b590>] ? isolate_freepages+0x210/0x210
[1309944.674300]  [<ffffffff8115bd91>] compact_zone.part.14+0x121/0x270
[1309944.682777]  [<ffffffff8115bfc7>] compact_zone+0x37/0x50
[1309944.691109]  [<ffffffff8115c153>] compact_zone_order+0x83/0xb0
[1309944.699220]  [<ffffffff8115c24d>] try_to_compact_pages+0xcd/0x100
[1309944.707068]  [<ffffffff81645796>] __alloc_pages_direct_compact+0xb2/0x170
[1309944.714904]  [<ffffffff811208a5>] __alloc_pages_nodemask+0x535/0x8f0
[1309944.722401]  [<ffffffff81157ce6>] alloc_pages_current+0xb6/0x120
[1309944.729869]  [<ffffffff81160c8d>] allocate_slab+0x13d/0x1a0
[1309944.737041]  [<ffffffff81160d20>] new_slab+0x30/0x180
[1309944.743957]  [<ffffffff81647199>] __slab_alloc+0x165/0x269
[1309944.750992]  [<ffffffff81218c26>] ? ext4_get_block+0x16/0x20
[1309944.757996]  [<ffffffffa019ce80>] ? nfs_readdata_alloc+0x20/0xa0 [nfs]
[1309944.765108]  [<ffffffffa019ce80>] ? nfs_readdata_alloc+0x20/0xa0 [nfs]
[1309944.771973]  [<ffffffff81164666>] kmem_cache_alloc+0x136/0x140
[1309944.779081]  [<ffffffffa019a9d1>] ? nfs_create_request+0x41/0x160 [nfs]
[1309944.786319]  [<ffffffffa019cbb0>] ? nfs_return_empty_page+0x70/0x70 [nfs]
[1309944.793434]  [<ffffffffa019ce80>] nfs_readdata_alloc+0x20/0xa0 [nfs]
[1309944.800585]  [<ffffffffa019cf34>] nfs_pagein_one+0x34/0x200 [nfs]
[1309944.807563]  [<ffffffffa019a9d1>] ? nfs_create_request+0x41/0x160 [nfs]
[1309944.814653]  [<ffffffffa019cbb0>] ? nfs_return_empty_page+0x70/0x70 [nfs]
[1309944.821932]  [<ffffffffa019d5f8>] nfs_generic_pagein+0x18/0x30 [nfs]
[1309944.829008]  [<ffffffffa019d639>] nfs_generic_pg_readpages+0x29/0xa0 [nfs]
[1309944.836234]  [<ffffffffa019a742>] __nfs_pageio_add_request+0x22/0xb0 [nfs]
[1309944.843457]  [<ffffffffa019ad03>] nfs_pageio_add_request+0x23/0x40 [nfs]
[1309944.850815]  [<ffffffffa019cc33>] readpage_async_filler+0x83/0x130 [nfs]
[1309944.858064]  [<ffffffffa019cbb0>] ? nfs_return_empty_page+0x70/0x70 [nfs]
[1309944.865326]  [<ffffffff81122e1a>] read_cache_pages+0xba/0x120
[1309944.872611]  [<ffffffffa019db31>] nfs_readpages+0x131/0x1a0 [nfs]
[1309944.880134]  [<ffffffff81122a78>] read_pages+0x48/0x100
[1309944.887456]  [<ffffffff81122c93>] __do_page_cache_readahead+0x163/0x180
[1309944.894820]  [<ffffffff81123001>] ra_submit+0x21/0x30
[1309944.902272]  [<ffffffff81123125>] ondemand_readahead+0x115/0x230
[1309944.909891]  [<ffffffff8152edfd>] ? release_sock+0x6d/0x80
[1309944.917433]  [<ffffffff811232c8>] page_cache_async_readahead+0x88/0xb0
[1309944.924981]  [<ffffffff813108fe>] ? radix_tree_lookup_slot+0xe/0x10
[1309944.932569]  [<ffffffff81117b6e>] ? find_get_page+0x1e/0x90
[1309944.940263]  [<ffffffff811184a9>] do_generic_file_read.constprop.33+0x269/0x440
[1309944.956115]  [<ffffffff8111941f>] generic_file_aio_read+0xef/0x280
[1309944.964185]  [<ffffffff81528ec2>] ? alloc_sock_iocb+0x12/0x60
[1309944.972206]  [<ffffffff8152a443>] ? sock_aio_write+0x63/0x90
[1309944.980355]  [<ffffffffa018ee49>] nfs_file_read+0x89/0x100 [nfs]
[1309944.988937]  [<ffffffff8117792a>] do_sync_read+0xda/0x120
[1309944.997074]  [<ffffffff8129d5f3>] ? security_file_permission+0x93/0xb0
[1309945.005354]  [<ffffffff81177db1>] ? rw_verify_area+0x61/0xf0
[1309945.013604]  [<ffffffff81178290>] vfs_read+0xb0/0x180
[1309945.021840]  [<ffffffff811783aa>] sys_read+0x4a/0x90
[1309945.029904]  [<ffffffff81663442>] system_call_fastpath+0x16/0x1b

Colin Ian King (colin-king) on 2012-11-14

Changed in linux (Ubuntu):
assignee:	nobody → Colin King (colin-king)

Revision history for this message

Colin Ian King (colin-king) wrote on 2012-11-15:

#30

It's not really possible to determine too much more from these bad.page traces, we are getting some different kinds of random corruption, for example, different invalid page flags values and bad count values. So I think the best way forward is to install a debug kernel that I've built.

The debug kernel has VM debugging enabled which will slow the machine a little, but will add some more sanity checking and perhaps will give us some more information.

The kernel .debs are available here: http://kernel.ubuntu.com/~cking/lp-1007082/

We can take this one step further by trying to capture a kernel crash dump and I will inspect this crash image to try and see if this provides any further information. Crash dump images strip out a lot of unused data, but can be rather larger (several hundred MB) and there is of course the risk of sharing data in the kernel that you don't want to upload to launchpad, so the use of crashdump is up to you. Instructions on how to install, enable and trigger a crash dump image are here: https://wiki.ubuntu.com/Kernel/CrashdumpRecipe

You need to install linux-crashdump, reboot, check that crash kernel is loaded and then wait for the problem to manifest itself. Then trigger a crash and then I need to inspect the dump image saved in /var/crash - the notes for these steps are explained in the wiki page mentioned above.

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

Ken (kenshi) wrote on 2012-11-21:

#31

Thanks Colin - been a little busy but I'll hopefully get both installed next week on one or two of the machines.

Revision history for this message

Justin Dossey (jbd) wrote on 2012-11-21:

#32

I've installed the crashdump recipe on both of my crashing machines and will report back when I get a crash. After the next crash, I'll install the VM debug kernels and collect the next crash for this bug.

Revision history for this message

Colin Ian King (colin-king) wrote on 2012-11-22:

#33

Thanks, lets see what kind of extra debug state we get.

Revision history for this message

Justin Dossey (jbd) wrote on 2012-11-28:

#34

Got a crash today (from the non-VM debug kernel), but the crashdump was not written to /var/crash. Next time, I will look harder before resetting the system via IPMI.

This time, I do see in the kern.log as the last message written before the crash:

Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:3085!
Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] invalid opcode: 0000 [#6] SMP
Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] CPU 0
Nov 28 12:47:44 pproxy-04 kernel: [769422.996028] Modules linked in: ipt_REJECT xt_tcpudp xt_multiport iptable_filter ip_tables x_tables cachefiles nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 vesafb i5100_edac psmouse edac_core ioatdma dca serio_raw joydev mac_hid lp parport e1000e usbhid pata_it8213 hid floppy 3w_9xxx
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363]
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] Pid: 26574, comm: kworker/u:7 Tainted: G B D 3.2.0-33-generic #52-Ubuntu Supermicro X7DCL/X7DCL
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] RIP: 0010:[<ffffffff811a84bb>] [<ffffffff811a84bb>] drop_buffers+0xab/0xb0
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] RSP: 0018:ffff88000875f630 EFLAGS: 00010246
Nov 28 12:47:44 pproxy-04 kernel: [769423.019363] RAX: 0200000002001009 RBX: ffffea0004863e40 RCX: 0000000000000024

Revision history for this message

Bryan Quigley (bryanquigley) wrote on 2013-05-08:

#35

Are you still experiencing this crash?

Revision history for this message

Justin Dossey (jbd) wrote on 2013-05-09:

#36

I continued to experience the crash until I disabled fsc on my NFS mounts. After fsc was disabled, the servers have not crashed once.

Revision history for this message

Stefan Bader (smb) wrote on 2013-05-15:

#37

Just discussed this with Colin, somehow (given the hint about fsc), this change in v3.0 sounds suspiciously like it could be fixing the issue:

commit c902ce1bfb40d8b049bd2319b388b4b68b04bc27
Author: David Howells <email address hidden>
Date: Thu Jul 7 12:19:48 2011 +0100

FS-Cache: Add a helper to bulk uncache pages on an inode

    Add an FS-Cache helper to bulk uncache pages on an inode. This will
    only work for the circumstance where the pages in the cache correspond
    1:1 with the pages attached to an inode's page cache.

    This is required for CIFS and NFS: When disabling inode cookie, we were
    returning the cookie and setting cifsi->fscache to NULL but failed to
    invalidate any previously mapped pages. This resulted in "Bad page
    state" errors and manifested in other kind of errors when running
    fsstress. Fix it by uncaching mapped pages when we disable the inode
    cookie.

This patch should fix the following oops and "Bad page state" errors
seen during fsstress testing.

Justin, if we provided a test kernel, would you be able to give that a try?

Changed in linux (Ubuntu):
assignee:	Colin King (colin-king) → Stefan Bader (stefan-bader-canonical)

Revision history for this message

Justin Dossey (jbd) wrote on 2013-05-15:

#38

Yes, I can try a test kernel.

Revision history for this message

Ken (kenshi) wrote on 2013-05-15: Re: [Bug 1007082] Re: BUG: Bad page state in process node pfn:8e9d9

#39

Download full text (6.8 KiB)

I can try out a test kernel too.
On May 15, 2013 11:56 AM, "Justin Dossey" <email address hidden>
wrote:

> Yes, I can try a test kernel.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1007082
>
> Title:
> BUG: Bad page state in process node pfn:8e9d9
>
> Status in “linux” package in Ubuntu:
> Incomplete
>
> Bug description:
> Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-virtual x86_64)
>
> Running Ubuntu 12.04 on 64bit EC2 instance... occasionally the
> instance becomes unresponsive and requires a reboot. Here is the
> System Log from the EC2 console:
>
> [16357652.971938] BUG: Bad page state in process node pfn:8e9d9
> [16357652.971947] page:ffffea00023a7640 count:0 mapcount:-127 mapping:
> (null) index:0x7f89dc026
> [16357652.971954] page flags: 0x100000000000000()
> [16357652.971960] Modules linked in: isofs acpiphp
> [16357652.971970] Pid: 14135, comm: node Tainted: G B D
> 3.2.0-23-virtual #36-Ubuntu
> [16357652.971976] Call Trace:
> [16357652.971988] [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
> [16357652.971994] [<ffffffff8111c208>] bad_page+0x18/0x30
> [16357652.972000] [<ffffffff8111d3c5>] prep_new_page+0x1d5/0x1e0
> [16357652.972008] [<ffffffff8100aa32>] ? check_events+0x12/0x20
> [16357652.972017] [<ffffffff8113204f>] ? __inc_zone_state+0x5f/0x70
> [16357652.972023] [<ffffffff8111d59f>]
> get_page_from_freelist+0x1cf/0x540
> [16357652.972031] [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
> [16357652.972038] [<ffffffff8111dba9>]
> __alloc_pages_nodemask+0x109/0x800
> [16357652.972044] [<ffffffff81005001>] ? xen_mc_extend_args+0x111/0x150
> [16357652.972051] [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
> [16357652.972059] [<ffffffff8116b6c0>] ?
> __mem_cgroup_commit_charge+0x70/0xc0
> [16357652.972066] [<ffffffff81006739>] ? pte_mfn_to_pfn+0x89/0xf0
> [16357652.972075] [<ffffffff8115672a>] alloc_pages_vma+0x9a/0x150
> [16357652.972081] [<ffffffff81136f5c>]
> do_anonymous_page.isra.38+0x7c/0x2f0
> [16357652.972088] [<ffffffff8113abc1>] handle_pte_fault+0x1e1/0x200
> [16357652.972094] [<ffffffff810067be>] ? xen_pmd_val+0xe/0x10
> [16357652.972100] [<ffffffff81005209>] ?
> __raw_callee_save_xen_pmd_val+0x11/0x1e
> [16357652.972108] [<ffffffff8113af98>] handle_mm_fault+0x1f8/0x350
> [16357652.972116] [<ffffffff81658ddb>] do_page_fault+0x14b/0x520
> [16357652.972122] [<ffffffff81140e08>] ? do_mmap_pgoff+0x348/0x360
> [16357652.972129] [<ffffffff81140f75>] ? sys_mmap_pgoff+0x155/0x230
> [16357652.972135] [<ffffffff81655a35>] page_fault+0x25/0x30
> [16357652.972141] BUG: Bad page state in process node pfn:593da
> [16357652.972146] page:ffffea000164f680 count:0 mapcount:-127 mapping:
> (null) index:0x7f89dc027
> [16357652.972208] page flags: 0x100000000000000()
> [16357652.972213] Modules linked in: isofs acpiphp
> [16357652.972221] Pid: 14135, comm: node Tainted: G B D
> 3.2.0-23-virtual #36-Ubuntu
> [16357652.972227] Call Trace:
> [16357652.972232] [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
> [...

I can try out a test kernel too.
On May 15, 2013 11:56 AM, "Justin Dossey" <1007082@bugs.launchpad.net>
wrote:

> Yes, I can try a test kernel.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1007082
>
> Title:
>   BUG: Bad page state in process node  pfn:8e9d9
>
> Status in “linux” package in Ubuntu:
>   Incomplete
>
> Bug description:
>   Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-virtual x86_64)
>
>   Running Ubuntu 12.04 on 64bit EC2 instance...  occasionally the
>   instance becomes unresponsive and requires a reboot.  Here is the
>   System Log from the EC2 console:
>
>   [16357652.971938] BUG: Bad page state in process node  pfn:8e9d9
>   [16357652.971947] page:ffffea00023a7640 count:0 mapcount:-127 mapping:
>        (null) index:0x7f89dc026
>   [16357652.971954] page flags: 0x100000000000000()
>   [16357652.971960] Modules linked in: isofs acpiphp
>   [16357652.971970] Pid: 14135, comm: node Tainted: G    B D
>  3.2.0-23-virtual #36-Ubuntu
>   [16357652.971976] Call Trace:
>   [16357652.971988]  [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
>   [16357652.971994]  [<ffffffff8111c208>] bad_page+0x18/0x30
>   [16357652.972000]  [<ffffffff8111d3c5>] prep_new_page+0x1d5/0x1e0
>   [16357652.972008]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
>   [16357652.972017]  [<ffffffff8113204f>] ? __inc_zone_state+0x5f/0x70
>   [16357652.972023]  [<ffffffff8111d59f>]
> get_page_from_freelist+0x1cf/0x540
>   [16357652.972031]  [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
>   [16357652.972038]  [<ffffffff8111dba9>]
> __alloc_pages_nodemask+0x109/0x800
>   [16357652.972044]  [<ffffffff81005001>] ? xen_mc_extend_args+0x111/0x150
>   [16357652.972051]  [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
>   [16357652.972059]  [<ffffffff8116b6c0>] ?
> __mem_cgroup_commit_charge+0x70/0xc0
>   [16357652.972066]  [<ffffffff81006739>] ? pte_mfn_to_pfn+0x89/0xf0
>   [16357652.972075]  [<ffffffff8115672a>] alloc_pages_vma+0x9a/0x150
>   [16357652.972081]  [<ffffffff81136f5c>]
> do_anonymous_page.isra.38+0x7c/0x2f0
>   [16357652.972088]  [<ffffffff8113abc1>] handle_pte_fault+0x1e1/0x200
>   [16357652.972094]  [<ffffffff810067be>] ? xen_pmd_val+0xe/0x10
>   [16357652.972100]  [<ffffffff81005209>] ?
> __raw_callee_save_xen_pmd_val+0x11/0x1e
>   [16357652.972108]  [<ffffffff8113af98>] handle_mm_fault+0x1f8/0x350
>   [16357652.972116]  [<ffffffff81658ddb>] do_page_fault+0x14b/0x520
>   [16357652.972122]  [<ffffffff81140e08>] ? do_mmap_pgoff+0x348/0x360
>   [16357652.972129]  [<ffffffff81140f75>] ? sys_mmap_pgoff+0x155/0x230
>   [16357652.972135]  [<ffffffff81655a35>] page_fault+0x25/0x30
>   [16357652.972141] BUG: Bad page state in process node  pfn:593da
>   [16357652.972146] page:ffffea000164f680 count:0 mapcount:-127 mapping:
>        (null) index:0x7f89dc027
>   [16357652.972208] page flags: 0x100000000000000()
>   [16357652.972213] Modules linked in: isofs acpiphp
>   [16357652.972221] Pid: 14135, comm: node Tainted: G    B D
>  3.2.0-23-virtual #36-Ubuntu
>   [16357652.972227] Call Trace:
>   [16357652.972232]  [<ffffffff8111c19f>] bad_page.part.61+0x9f/0xf0
>   [16357652.972238]  [<ffffffff8111c208>] bad_page+0x18/0x30
>   [16357652.972244]  [<ffffffff8111d3c5>] prep_new_page+0x1d5/0x1e0
>   [16357652.972251]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
>   [16357652.972257]  [<ffffffff8113204f>] ? __inc_zone_state+0x5f/0x70
>   [16357652.972264]  [<ffffffff8111d59f>]
> get_page_from_freelist+0x1cf/0x540
>   [16357652.972271]  [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
>   [16357652.972278]  [<ffffffff8111dba9>]
> __alloc_pages_nodemask+0x109/0x800
>   [16357652.972284]  [<ffffffff81005001>] ? xen_mc_extend_args+0x111/0x150
>   [16357652.972291]  [<ffffffff8100a25d>] ?
> xen_force_evtchn_callback+0xd/0x10
>   [16357652.972298]  [<ffffffff8116b6c0>] ?
> __mem_cgroup_commit_charge+0x70/0xc0
>   [16357652.972305]  [<ffffffff81006739>] ? pte_mfn_to_pfn+0x89/0xf0
>   [16357652.972311]  [<ffffffff8115672a>] alloc_pages_vma+0x9a/0x150
>   [16357652.972318]  [<ffffffff81136f5c>]
> do_anonymous_page.isra.38+0x7c/0x2f0
>   [16357652.972325]  [<ffffffff8113abc1>] handle_pte_fault+0x1e1/0x200
>   [16357652.972331]  [<ffffffff810067be>] ? xen_pmd_val+0xe/0x10
>   [16357652.972337]  [<ffffffff81005209>] ?
> __raw_callee_save_xen_pmd_val+0x11/0x1e
>   [16357652.972345]  [<ffffffff8113af98>] handle_mm_fault+0x1f8/0x350
>   [16357652.972351]  [<ffffffff81658ddb>] do_page_fault+0x14b/0x520
>   [16357652.972358]  [<ffffffff81140e08>] ? do_mmap_pgoff+0x348/0x360
>   [16357652.972364]  [<ffffffff81140f75>] ? sys_mmap_pgoff+0x155/0x230
>   [16357652.972371]  [<ffffffff81655a35>] page_fault+0x25/0x30
>   [16357652.972377] BUG: Bad page state in process node  pfn:58a7f
>   [16357652.972382] page:ffffea0001629fc0 count:0 mapcount:-127 mapping:
>        (null) index:0x7f89dc028
>   [16357652.972389] page flags: 0x100000000000000()
>   ---
>   AcpiTables:
>
>   AlsaDevices:
>    total 0
>    crw-rw---T 1 root audio 116,  1 May 31 15:39 seq
>    crw-rw---T 1 root audio 116, 33 May 31 15:39 timer
>   AplayDevices: Error: [Errno 2] No such file or directory
>   ApportVersion: 2.0.1-0ubuntu5
>   Architecture: amd64
>   ArecordDevices: Error: [Errno 2] No such file or directory
>   AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq',
> '/dev/snd/timer'] failed with exit code 1:
>   CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1:
> nl80211 not found.
>   DistroRelease: Ubuntu 12.04
>   Ec2AMI: ami-563b9d3f
>   Ec2AMIManifest: (unknown)
>   Ec2AvailabilityZone: us-east-1a
>   Ec2InstanceType: m1.large
>   Ec2Kernel: aki-825ea7eb
>   Ec2Ramdisk: unavailable
>   IwConfig:
>    lo        no wireless extensions.
>
>    eth0      no wireless extensions.
>   Lspci:
>
>   Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to
> initialize libusb: -99
>   Package: linux (not installed)
>   PciMultimedia:
>
>   ProcEnviron:
>    TERM=xterm
>    LANG=en_US.UTF-8
>    SHELL=/bin/bash
>   ProcFB:
>
>   ProcKernelCmdLine: root=LABEL=cloudimg-rootfs ro console=hvc0
>   ProcModules:
>    acpiphp 24231 0 - Live 0x0000000000000000
>    isofs 40257 0 - Live 0x0000000000000000
>   ProcVersionSignature: User Name 3.2.0-23.36-virtual 3.2.14
>   RelatedPackageVersions:
>    linux-restricted-modules-3.2.0-23-virtual N/A
>    linux-backports-modules-3.2.0-23-virtual  N/A
>    linux-firmware                            1.79
>   RfKill: Error: [Errno 2] No such file or directory
>   Tags:  precise ec2-images
>   Uname: Linux 3.2.0-23-virtual x86_64
>   UpgradeStatus: No upgrade log present (probably fresh install)
>   UserGroups: adm admin audio cdrom dialout dip floppy netdev plugdev video
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1007082/+subscriptions
>

Revision history for this message

Stefan Bader (smb) wrote on 2013-05-22:

#40

Going back to this I realized the hinted patch cannot be right as this got in with 3.0 and this bug is about 3.2 kernels (somehow LTS release get a bit mixed up). Anyway, there actually are two other changes that came in after 3.2 and are about fscache and bad pages:

#1: CacheFiles: Fix the marking of cached pages
#2: NFS: nfs_migrate_page() does not wait for FS-Cache to finish with a page

While #1 seems to contain the same top level call (get_page_from_freelist), #2 seems to be more related to NFS. So I put two versions of kernels to [1]: v1 only has #1 and v2 has both #1 and #2. So when testing, try v1 first and if that is sufficient we can ignore v2.

[1] http://people.canonical.com/~smb/lp1007082/

Ubuntu
linux package

BUG: Bad page state in process node pfn:8e9d9

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntulinux package

BUG: Bad page state in process node pfn:8e9d9

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package