Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152776] [<ffffffff8128eb02>] ext4_es_lru_del+0x32/0x80
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270595>] ext4_clear_inode+0x45/0x90
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257621>] ext4_evict_inode+0x81/0x510
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152792] [<ffffffff811ce490>] evict+0xc0/0x1d0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152795] [<ffffffff811ce5e1>] dispose_list+0x41/0x50
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152797] [<ffffffff8174756f>] ? _raw_spin_trylock+0xf/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf5d5>] prune_icache_sb+0x185/0x340
...
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152842] [<ffffffff81173ff0>] handle_mm_fault+0x2a0/0x3e0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152847] [<ffffffff8174b9ff>] __do_page_fault+0x1af/0x560
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152850] [<ffffffff811b8de7>] ? cp_new_stat+0x107/0x120
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152856] [<ffffffff810ca06c>] ? do_futex+0x7c/0x1b0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152858] [<ffffffff811b91b5>] ? SYSC_newstat+0x25/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152861] [<ffffffff8174bde7>] do_page_fault+0x37/0x70
Which I DO NOT have a dump for (not sure if there is any old dump for this backtrace).
For this backtrace...
I can see memory pressure seems to be big and the page fault incurs in (signalled also by the fact that the page reclaiming is going through "slowest_path" : __alloc_pages_nodemask -> alloc_pages_slowpath -> __alloc_pages_direct_reclaim -> __perform_reclaim -> try_to_free_pages) - seeing that direct_reclaim is a synchronous and blocking way of handling page faults. This explains why dropping caches might be helping here, since the path for the page allocation is different on such conditions.
Moving a bit...
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152812] [<ffffffff8115af04>] shrink_slab+0x154/0x300
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152816] [<ffffffff8115dbf8>] do_try_to_free_pages+0x218/0x290
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152819] [<ffffffff8115dfa4>] try_to_free_pages+0xe4/0x1a0
trying to shrink slab structure to free some memory so a frame can be allocated to the page fault that just happened.
Moving further.. it gets to the point that aged cache pages are scanned (LRU) to check if they can be freed AND we got probably into a page filebacked (pagecache / demand paging) that had its "shrink" function called (to flush data, free cache, update filesystem master block, etc):
# update filesystem super block
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf5d5>] prune_icache_sb+0x185/0x340
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152806] [<ffffffff811b7703>] prune_super+0x193/0x1b0
# calling inode evict function (to free inode page)
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270595>] ext4_clear_inode+0x45/0x90
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257621>] ext4_evict_inode+0x81/0x510
Finally we arrive at ext4_es_lru_del(struct inode *inode):
1023 spin_lock(&sbi->s_es_lru_lock);
The task is hang and someone is holding this lock. I would have to have this core to know who and why.
Continuing preliminary analysis for other backtraces / dumps on other comment...
For This specific back trace...
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152776] [<ffffffff8128e b02>] ext4_es_ lru_del+ 0x32/0x80 595>] ext4_clear_ inode+0x45/ 0x90 621>] ext4_evict_ inode+0x81/ 0x510 490>] evict+0xc0/0x1d0 5e1>] dispose_ list+0x41/ 0x50 56f>] ? _raw_spin_ trylock+ 0xf/0x30 5d5>] prune_icache_ sb+0x185/ 0x340 ff0>] handle_ mm_fault+ 0x2a0/0x3e0 9ff>] __do_page_ fault+0x1af/ 0x560 de7>] ? cp_new_ stat+0x107/ 0x120 06c>] ? do_futex+0x7c/0x1b0 1b5>] ? SYSC_newstat+ 0x25/0x30 de7>] do_page_ fault+0x37/ 0x70
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152792] [<ffffffff811ce
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152795] [<ffffffff811ce
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152797] [<ffffffff81747
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf
...
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152842] [<ffffffff81173
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152847] [<ffffffff8174b
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152850] [<ffffffff811b8
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152856] [<ffffffff810ca
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152858] [<ffffffff811b9
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152861] [<ffffffff8174b
Which I DO NOT have a dump for (not sure if there is any old dump for this backtrace).
For this backtrace...
I can see memory pressure seems to be big and the page fault incurs in (signalled also by the fact that the page reclaiming is going through "slowest_path" : __alloc_ pages_nodemask -> alloc_pages_ slowpath -> __alloc_ pages_direct_ reclaim -> __perform_reclaim -> try_to_free_pages) - seeing that direct_reclaim is a synchronous and blocking way of handling page faults. This explains why dropping caches might be helping here, since the path for the page allocation is different on such conditions.
Moving a bit...
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152812] [<ffffffff8115a f04>] shrink_ slab+0x154/ 0x300 bf8>] do_try_ to_free_ pages+0x218/ 0x290 fa4>] try_to_ free_pages+ 0xe4/0x1a0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152816] [<ffffffff8115d
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152819] [<ffffffff8115d
trying to shrink slab structure to free some memory so a frame can be allocated to the page fault that just happened.
Moving further.. it gets to the point that aged cache pages are scanned (LRU) to check if they can be freed AND we got probably into a page filebacked (pagecache / demand paging) that had its "shrink" function called (to flush data, free cache, update filesystem master block, etc):
# update filesystem super block 5d5>] prune_icache_ sb+0x185/ 0x340 703>] prune_super+ 0x193/0x1b0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152806] [<ffffffff811b7
# calling inode evict function (to free inode page) 595>] ext4_clear_ inode+0x45/ 0x90 621>] ext4_evict_ inode+0x81/ 0x510
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257
Finally we arrive at ext4_es_ lru_del( struct inode *inode):
1023 spin_lock( &sbi->s_ es_lru_ lock);
The task is hang and someone is holding this lock. I would have to have this core to know who and why.
Continuing preliminary analysis for other backtraces / dumps on other comment...