Comment 1 for bug 1389787

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: 3.11 memory consumption leads to HANG (not sure if 3.13 suffers from this).

For This specific back trace...

Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152776] [<ffffffff8128eb02>] ext4_es_lru_del+0x32/0x80
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270595>] ext4_clear_inode+0x45/0x90
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257621>] ext4_evict_inode+0x81/0x510
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152792] [<ffffffff811ce490>] evict+0xc0/0x1d0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152795] [<ffffffff811ce5e1>] dispose_list+0x41/0x50
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152797] [<ffffffff8174756f>] ? _raw_spin_trylock+0xf/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf5d5>] prune_icache_sb+0x185/0x340
...
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152842] [<ffffffff81173ff0>] handle_mm_fault+0x2a0/0x3e0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152847] [<ffffffff8174b9ff>] __do_page_fault+0x1af/0x560
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152850] [<ffffffff811b8de7>] ? cp_new_stat+0x107/0x120
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152856] [<ffffffff810ca06c>] ? do_futex+0x7c/0x1b0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152858] [<ffffffff811b91b5>] ? SYSC_newstat+0x25/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152861] [<ffffffff8174bde7>] do_page_fault+0x37/0x70

Which I DO NOT have a dump for (not sure if there is any old dump for this backtrace).

For this backtrace...

I can see memory pressure seems to be big and the page fault incurs in (signalled also by the fact that the page reclaiming is going through "slowest_path" : __alloc_pages_nodemask -> alloc_pages_slowpath -> __alloc_pages_direct_reclaim -> __perform_reclaim -> try_to_free_pages) - seeing that direct_reclaim is a synchronous and blocking way of handling page faults. This explains why dropping caches might be helping here, since the path for the page allocation is different on such conditions.

Moving a bit...

Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152812] [<ffffffff8115af04>] shrink_slab+0x154/0x300
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152816] [<ffffffff8115dbf8>] do_try_to_free_pages+0x218/0x290
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152819] [<ffffffff8115dfa4>] try_to_free_pages+0xe4/0x1a0

trying to shrink slab structure to free some memory so a frame can be allocated to the page fault that just happened.

Moving further.. it gets to the point that aged cache pages are scanned (LRU) to check if they can be freed AND we got probably into a page filebacked (pagecache / demand paging) that had its "shrink" function called (to flush data, free cache, update filesystem master block, etc):

# update filesystem super block
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf5d5>] prune_icache_sb+0x185/0x340
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152806] [<ffffffff811b7703>] prune_super+0x193/0x1b0

# calling inode evict function (to free inode page)
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270595>] ext4_clear_inode+0x45/0x90
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257621>] ext4_evict_inode+0x81/0x510

Finally we arrive at ext4_es_lru_del(struct inode *inode):

1023 spin_lock(&sbi->s_es_lru_lock);

The task is hang and someone is holding this lock. I would have to have this core to know who and why.

Continuing preliminary analysis for other backtraces / dumps on other comment...