page allocation failure on machines under heavy network load

Bug #648721 reported by Craig Box
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux-ec2 (Ubuntu)
Invalid
Low
Unassigned

Bug Description

Tuning option:

The error messages seen below indicate failure to allocate a 32K chunk of memory (order 5 allocation = 2^5K), the valid function calls in the stack trace lead to the network stack. The problem mainly is that the virtual network devices allocate and free buffers quicker than the vm subsystem can keep up with. Increasing the amount of memory tried to keep free for allocations (/proc/sys/vm/min_free_kbytes) avoids the messages.

---

This error occurs in many processes running on our web servers: apache2, glusterfsd, kswapd0, rsyslogd, swapper and zabbix_agentd today.

Sep 27 07:39:19 web3 kernel: [2909159.608929] __ratelimit: 35 callbacks suppressed
Sep 27 07:39:19 web3 kernel: [2909159.608933] swapper: page allocation failure. order:5, mode:0x20
Sep 27 07:39:19 web3 kernel: [2909159.608937] Pid: 0, comm: swapper Not tainted 2.6.32-305-ec2 #9-Ubuntu
Sep 27 07:39:19 web3 kernel: [2909159.608939] Call Trace:
Sep 27 07:39:19 web3 kernel: [2909159.608948] [<c052840d>] ? printk+0x18/0x1b
Sep 27 07:39:19 web3 kernel: [2909159.608960] [<c019be6e>] __alloc_pages_slowpath+0x46e/0x4a0
Sep 27 07:39:19 web3 kernel: [2909159.608963] [<c019bfed>] __alloc_pages_nodemask+0x14d/0x160
Sep 27 07:39:19 web3 kernel: [2909159.608968] [<c01c4e65>] T.685+0xe5/0x3a0
Sep 27 07:39:19 web3 kernel: [2909159.608971] [<c01c531c>] cache_alloc_refill+0x1fc/0x230
Sep 27 07:39:19 web3 kernel: [2909159.608975] [<c01c54c6>] __kmalloc+0x176/0x1a0
Sep 27 07:39:19 web3 kernel: [2909159.608979] [<c048d9e3>] pskb_expand_head+0x53/0x170
Sep 27 07:39:19 web3 kernel: [2909159.608982] [<c048f50c>] __pskb_pull_tail+0x4c/0x2d0
Sep 27 07:39:19 web3 kernel: [2909159.608987] [<c04b66d1>] ? nf_iterate+0x71/0x90
Sep 27 07:39:19 web3 kernel: [2909159.608990] [<c04bffd0>] ? dst_output+0x0/0x10
Sep 27 07:39:19 web3 kernel: [2909159.608993] [<c049a743>] dev_queue_xmit+0x143/0x560
Sep 27 07:39:19 web3 kernel: [2909159.608996] [<c04b6829>] ? nf_hook_slow+0x59/0xd0
Sep 27 07:39:19 web3 kernel: [2909159.609000] [<c04c27df>] ip_finish_output+0x12f/0x2c0
Sep 27 07:39:19 web3 kernel: [2909159.609003] [<c04c2a0f>] ip_output+0x9f/0xb0
Sep 27 07:39:19 web3 kernel: [2909159.609006] [<c04bffd0>] ? dst_output+0x0/0x10
Sep 27 07:39:19 web3 kernel: [2909159.609008] [<c04c1bc8>] ip_local_out+0x18/0x20
Sep 27 07:39:19 web3 kernel: [2909159.609011] [<c04c22e7>] ip_queue_xmit+0x147/0x390
Sep 27 07:39:19 web3 kernel: [2909159.609016] [<c012a0dc>] ? enqueue_entity+0x14c/0x1e0
Sep 27 07:39:19 web3 kernel: [2909159.609022] [<c01400d7>] ? lock_timer_base+0x27/0x50
Sep 27 07:39:19 web3 kernel: [2909159.609027] [<c04d4935>] tcp_transmit_skb+0x395/0x6e0
Sep 27 07:39:19 web3 kernel: [2909159.609030] [<c04d252f>] ? tcp_clean_rtx_queue+0x5af/0x790
Sep 27 07:39:19 web3 kernel: [2909159.609033] [<c04d6d47>] tcp_write_xmit+0x167/0x460
Sep 27 07:39:19 web3 kernel: [2909159.609037] [<c04d7421>] __tcp_push_pending_frames+0x31/0x90
Sep 27 07:39:19 web3 kernel: [2909159.609040] [<c04d0166>] tcp_data_snd_check+0x26/0xe0
Sep 27 07:39:19 web3 kernel: [2909159.609043] [<c04d354b>] tcp_rcv_established+0x2eb/0x5d0
Sep 27 07:39:19 web3 kernel: [2909159.609046] [<c04da44d>] tcp_v4_do_rcv+0xbd/0x130
Sep 27 07:39:19 web3 kernel: [2909159.609049] [<c04db1fe>] tcp_v4_rcv+0x5fe/0x6f0
Sep 27 07:39:19 web3 kernel: [2909159.609051] [<c04b66d1>] ? nf_iterate+0x71/0x90
Sep 27 07:39:19 web3 kernel: [2909159.609054] [<c04b6829>] ? nf_hook_slow+0x59/0xd0
Sep 27 07:39:19 web3 kernel: [2909159.609058] [<c04bda87>] ip_local_deliver_finish+0xa7/0x270
Sep 27 07:39:19 web3 kernel: [2909159.609061] [<c04bdcd7>] ip_local_deliver+0x87/0x90
Sep 27 07:39:19 web3 kernel: [2909159.609064] [<c04bd9e0>] ? ip_local_deliver_finish+0x0/0x270
Sep 27 07:39:19 web3 kernel: [2909159.609067] [<c04bd33b>] ip_rcv_finish+0xeb/0x380
Sep 27 07:39:19 web3 kernel: [2909159.609071] [<c050ddb8>] ? packet_rcv_spkt+0x38/0x140
Sep 27 07:39:19 web3 kernel: [2909159.609074] [<c04bd7ee>] ip_rcv+0x21e/0x2e0
Sep 27 07:39:19 web3 kernel: [2909159.609077] [<c050ddb8>] ? packet_rcv_spkt+0x38/0x140
Sep 27 07:39:19 web3 kernel: [2909159.609080] [<c049901f>] netif_receive_skb+0x39f/0x5c0
Sep 27 07:39:19 web3 kernel: [2909159.609084] [<c0453fae>] netif_poll+0x4ce/0x7b0
Sep 27 07:39:19 web3 kernel: [2909159.609088] [<c0499a5a>] net_rx_action+0x12a/0x270
Sep 27 07:39:19 web3 kernel: [2909159.609091] [<c04d82a0>] ? tcp_write_timer+0x0/0x200
Sep 27 07:39:19 web3 kernel: [2909159.609095] [<c01387fb>] __do_softirq+0xab/0x1e0
Sep 27 07:39:19 web3 kernel: [2909159.609099] [<c01775f7>] ? handle_IRQ_event+0x97/0x180
Sep 27 07:39:19 web3 kernel: [2909159.609102] [<c01775c4>] ? handle_IRQ_event+0x64/0x180
Sep 27 07:39:19 web3 kernel: [2909159.609105] [<c01389b5>] do_softirq+0x85/0x90
Sep 27 07:39:19 web3 kernel: [2909159.609108] [<c0138acd>] irq_exit+0x2d/0x40
Sep 27 07:39:19 web3 kernel: [2909159.609111] [<c0443f55>] evtchn_do_upcall+0x1c5/0x1f0
Sep 27 07:39:19 web3 kernel: [2909159.609114] [<c0136ec8>] ? ns_to_timespec+0x38/0x50
Sep 27 07:39:19 web3 kernel: [2909159.609118] [<c0104a06>] hypervisor_callback+0x4a/0x52
Sep 27 07:39:19 web3 kernel: [2909159.609121] [<c0106735>] ? xen_safe_halt+0x15/0x40
Sep 27 07:39:19 web3 kernel: [2909159.609125] [<c0109e09>] xen_idle+0x29/0x80
Sep 27 07:39:19 web3 kernel: [2909159.609127] [<c01034af>] cpu_idle+0x8f/0xc0
Sep 27 07:39:19 web3 kernel: [2909159.609131] [<c051c793>] rest_init+0x53/0x60
Sep 27 07:39:19 web3 kernel: [2909159.609136] [<c06b5c28>] start_kernel+0x379/0x37f
Sep 27 07:39:19 web3 kernel: [2909159.609139] [<c06b56d9>] ? unknown_bootoption+0x0/0x1a0
Sep 27 07:39:19 web3 kernel: [2909159.609144] [<c06b5067>] i386_start_kernel+0x67/0x6e
Sep 27 07:39:19 web3 kernel: [2909159.609150] Mem-Info:
Sep 27 07:39:19 web3 kernel: [2909159.609152] DMA per-cpu:
Sep 27 07:39:19 web3 kernel: [2909159.609154] CPU 0: hi: 0, btch: 1 usd: 0
Sep 27 07:39:19 web3 kernel: [2909159.609156] CPU 1: hi: 0, btch: 1 usd: 0
Sep 27 07:39:19 web3 kernel: [2909159.609157] Normal per-cpu:
Sep 27 07:39:19 web3 kernel: [2909159.609159] CPU 0: hi: 155, btch: 38 usd: 34
Sep 27 07:39:19 web3 kernel: [2909159.609161] CPU 1: hi: 155, btch: 38 usd: 125
Sep 27 07:39:19 web3 kernel: [2909159.609162] HighMem per-cpu:
Sep 27 07:39:19 web3 kernel: [2909159.609164] CPU 0: hi: 155, btch: 38 usd: 27
Sep 27 07:39:19 web3 kernel: [2909159.609166] CPU 1: hi: 155, btch: 38 usd: 11
Sep 27 07:39:19 web3 kernel: [2909159.609170] active_anon:130673 inactive_anon:43799 isolated_anon:48
Sep 27 07:39:19 web3 kernel: [2909159.609171] active_file:69994 inactive_file:114046 isolated_file:0
Sep 27 07:39:19 web3 kernel: [2909159.609172] unevictable:8 dirty:36 writeback:0 unstable:0
Sep 27 07:39:19 web3 kernel: [2909159.609173] free:46836 slab_reclaimable:25018 slab_unreclaimable:2516
Sep 27 07:39:19 web3 kernel: [2909159.609174] mapped:12605 shmem:7770 pagetables:0 bounce:0
Sep 27 07:39:19 web3 kernel: [2909159.609180] DMA free:2880kB min:76kB low:92kB high:112kB active_anon:0kB inactive_anon:0kB active_file:5064kB inactive_file:112kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16256kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:140kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 27 07:39:19 web3 kernel: [2909159.609184] lowmem_reserve[]: 0 696 1710 1710
Sep 27 07:39:19 web3 kernel: [2909159.609191] Normal free:144196kB min:3336kB low:4168kB high:5004kB active_anon:45808kB inactive_anon:11704kB active_file:193280kB inactive_file:178152kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:713224kB mlocked:0kB dirty:56kB writeback:0kB mapped:7844kB shmem:7824kB slab_reclaimable:99932kB slab_unreclaimable:10064kB kernel_stack:1360kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 27 07:39:19 web3 kernel: [2909159.609196] lowmem_reserve[]: 0 0 8111 8111
Sep 27 07:39:19 web3 kernel: [2909159.609203] HighMem free:40268kB min:512kB low:1724kB high:2940kB active_anon:476884kB inactive_anon:163684kB active_file:81632kB inactive_file:277920kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:1038292kB mlocked:32kB dirty:88kB writeback:0kB mapped:42576kB shmem:23256kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:48 all_unreclaimable? no
Sep 27 07:39:19 web3 kernel: [2909159.609207] lowmem_reserve[]: 0 0 0 0
Sep 27 07:39:19 web3 kernel: [2909159.609211] DMA: 458*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 2880kB
Sep 27 07:39:19 web3 kernel: [2909159.609219] Normal: 16083*4kB 5465*8kB 1009*16kB 457*32kB 82*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 144196kB

Sep 27 07:39:19 web3 kernel: [2909159.609227] HighMem: 211*4kB 88*8kB 1554*16kB 407*32kB 7*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 40268kB
Sep 27 07:39:19 web3 kernel: [2909159.609236] 194330 total pagecache pages
Sep 27 07:39:19 web3 kernel: [2909159.609238] 2520 pages in swap cache
Sep 27 07:39:19 web3 kernel: [2909159.609240] Swap cache stats: add 2222885, delete 2220365, find 14893866/14933489
Sep 27 07:39:19 web3 kernel: [2909159.609242] Free swap = 856824kB
Sep 27 07:39:19 web3 kernel: [2909159.609243] Total swap = 917496kB
Sep 27 07:39:19 web3 kernel: [2909159.615308] 447488 pages RAM
Sep 27 07:39:19 web3 kernel: [2909159.615311] 263682 pages HighMem
Sep 27 07:39:19 web3 kernel: [2909159.615312] 10446 pages reserved
Sep 27 07:39:19 web3 kernel: [2909159.615313] 244645 pages shared
Sep 27 07:39:19 web3 kernel: [2909159.615315] 279978 pages non-shared

I can't tell for certain if these errors cause any problem, but we are seeing performance issues on these machines and I want to clear this up before digging too much further.

Similar errors have been reported online related to network drivers (iwlan/madwifi), and connection tracking, none of which should be relevant based on my backtraces.

Also reported relating to EC2: http://developer.amazonwebservices.com/connect/thread.jspa?threadID=43939

According to this thread, similar issues have been fixed in 2.6.35: http://www.spinics.net/lists/netdev/msg139488.html

Ubuntu 10.04
AMI: ami-a94d67dd
Manifest: 099720109477/ebs/ubuntu-images/ubuntu-lucid-10.04-i386-server-20100427.1
AKI: aki-c34d67b7

This particular instance is a c1.medium.

crb@web3:~$ uname -a
Linux web3 2.6.32-305-ec2 #9-Ubuntu SMP Thu Apr 15 04:14:01 UTC 2010 i686 GNU/Linux

Tags: ec2-images
Craig Box (craig.box)
summary: - page allocaiton failure on machines under heavy network load
+ page allocation failure on machines under heavy network load
Revision history for this message
Scott Moser (smoser) wrote :

@Craig, are you able to reproduce this easily ?
I've copied John also, a kernel engineer who handles ubuntu kernels on ec2.

Revision history for this message
Craig Box (craig.box) wrote :

Reproducing: haven't tried, but it seems from other posts that any heavy network load might cause this kind of error.

However, as per Andrew Morton at https://bugzilla.kernel.org/show_bug.cgi?id=12135, it might end up being just a badly written warning message.

Scott Moser (smoser)
tags: added: ec2-images
Revision history for this message
Stefan Bader (smb) wrote :

I would tend to agree that this is more an indication about needing to tweak the system than an actual fault. Has the suggestion of incrementing the value of /proc/sys/vm/min_free_kbytes been tried?

Revision history for this message
Craig Box (craig.box) wrote :

Yes, and the errors stopped as you would expect.

Scott Moser (smoser)
Changed in linux-ec2 (Ubuntu):
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Stefan Bader (smb) wrote :

Since this is agreed to be no actual bug, but a very noisy hint to tune the kernel to a specific use, I am closing this bug. I am swaying between "invalid" and "won't fix" but tending towards the first as the latter would rather imply this being a bug. I added some info as the first thing of the description to help others to quickly solve the issue.

description: updated
Changed in linux-ec2 (Ubuntu):
status: Triaged → Invalid
Revision history for this message
Jason Gunthorpe (jgunthorpe) wrote :

I see this too. This is not similar to kernel bug 12135, this is a failing order 5 allocation, not a failing order 0 allocation. The net stack should not be making high order allocations, something to do with Xen causes them to happen and of course they fail as memory is fragmented.

Revision history for this message
dan (k-dan) wrote :

i have this too. not heavy network usage yet, machine's barely in production. it's an ec2 instance though, with a basic LAMP stack installed. doesn't appear to have affected stability yet either, but getting a lot of
Jun 21 13:20:35 ip-10-227-177-177 kernel: [1376611.883375] swapper: page allocation failure. order:5, mode:0x20
Jun 21 13:20:35 ip-10-227-177-177 kernel: [1376611.883379] Pid: 0, comm: swapper Not tainted 2.6.32-308-ec2 #16-Ubuntu
Jun 21 13:20:35 ip-10-227-177-177 kernel: [1376611.883381] Call Trace:
etc etc..
in /var/log/messages.

Revision history for this message
Henrik (neu242) wrote :

I get the same message now and then, although on a Xen domU Debian installation (kernel 2.6.32-5-xen-686). I've issued a "echo 8192 >/proc/sys/vm/min_free_kbytes" now (as mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=12135#c4), hopefully that's that last I see of this message.

Revision history for this message
Apoorv Birthare (apoorvbirthare) wrote :

Its not actually a bug in kernel but a bug in the application that is trying to acquire large size memory pages and kernel is failing to provide those pages. In our case the application that was misbehaving was the device driver of our network card. Try updating the nic card driver on your system and see if that helps.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.